lumin.optimisation package¶
Submodules¶
lumin.optimisation.features module¶
-
lumin.optimisation.features.
get_rf_feat_importance
(rf, inputs, targets, weights=None)[source]¶ Compute feature importance for a Random Forest model using rfpimp.
- Parameters
rf (
ForestRegressor
) – trained Random Forest modelinputs (
DataFrame
) – input data as Pandas DataFrametargets (
ndarray
) – target data as Numpy arrayweights (
Optional
[ndarray
]) – Optional data weights as Numpy array
- Return type
DataFrame
-
lumin.optimisation.features.
rf_rank_features
(train_df, val_df, objective, train_feats, targ_name='gen_target', wgt_name=None, importance_cut=0.0, n_estimators=40, n_rfs=1, savename=None, plot_settings=<lumin.plotting.plot_settings.PlotSettings object>)[source]¶ Compute relative permutation importance of input features via using Random Forests. A reduced set of ‘important features’ is obtained by cutting on relative importance and a new model is trained and evaluated on this reduced set. RFs will have their hyper-parameters roughly optimised, both when training on all features and once when training on important features. Relative importances may be computed multiple times (via n_rfs) and averaged. In which case the standard error is also computed.
- Parameters
train_df (
DataFrame
) – training data as Pandas DataFrameval_df (
DataFrame
) – validation data as Pandas DataFrameobjective (
str
) – string representation of objective: either ‘classification’ or ‘regression’train_feats (
List
[str
]) – complete list of training featurestarg_name (
str
) – name of column containing target datawgt_name (
Optional
[str
]) – name of column containing weight data. If set, will use weights for training and evaluation, otherwise will notimportance_cut (
float
) – minimum importance required to be considered an ‘important feature’n_estimators (
int
) – number of trees to use in each forestn_rfs (
int
) – number of trainings to perform on all training features in order to compute importancessavename (
Optional
[str
]) – Optional name of file to which to save the plot of feature importancesplot_settings (
PlotSettings
) –PlotSettings
class to control figure appearance
- Return type
List
[str
]- Returns
List of features passing importance_cut, ordered by importance
lumin.optimisation.hyper_param module¶
-
lumin.optimisation.hyper_param.
get_opt_rf_params
(x_trn, y_trn, x_val, y_val, objective, w_trn=None, w_val=None, params={'max_features': [0.3, 0.5, 0.7, 0.9], 'min_samples_leaf': [1, 3, 5, 10, 25, 50, 100]}, n_estimators=40, verbose=True)[source]¶ Use an ordered parameter-scan to roughly optimise Random Forest hyper-parameters.
- Parameters
x_trn (
ndarray
) – training input datay_trn (
ndarray
) – training target datax_val (
ndarray
) – validation input datay_val (
ndarray
) – validation target dataobjective (
str
) – string representation of objective: either ‘classification’ or ‘regression’w_trn (
Optional
[ndarray
]) – training weightsw_val (
Optional
[ndarray
]) – validation weightsparams (
OrderedDict
) – ordered dictionary mapping parameters to optimise to list of values to cosnidern_estimators (
int
) – number of trees to use in each forestverbose – Print extra information and show a live plot of model performance
- Returns
dictionary mapping parameters to their optimised values rf: best performing Random Forest
- Return type
params
-
lumin.optimisation.hyper_param.
fold_lr_find
(fy, model_builder, bs, train_on_weights=True, shuffle_fold=True, n_folds=-1, lr_bounds=[1e-05, 10], callback_partials=None, plot_settings=<lumin.plotting.plot_settings.PlotSettings object>)[source]¶ Wrapper function for training using
LRFinder
which runs a Smith LR range test (https://arxiv.org/abs/1803.09820) using folds inFoldYielder
. Trains models for 1 fold, interpolating LR between set bounds. This repeats for each fold inFoldYielder
, and loss evolution is averaged.- Parameters
fy (
FoldYielder
) –FoldYielder
providing training datamodel_builder (
ModelBuilder
) –ModelBuilder
providing networks and optimisersbs (
int
) – batch sizetrain_on_weights (
bool
) – If weights are present, whether to use them for trainingshuffle_fold (
bool
) – whether to shuffle data in foldsn_folds (
int
) – if >= 1, will only train n_folds number of models, otherwise will train one model per foldlr_bounds (
Tuple
[float
,float
]) – starting and ending LR valuescallback_partials (
Optional
[List
[partial
]]) – optional list of functools.partial, each of which will a instantiateCallback
when calledplot_settings (
PlotSettings
) –PlotSettings
class to control figure appearance
- Return type
List
[LRFinder
]- Returns
List of
LRFinder
which were used for each model trained
lumin.optimisation.threshold module¶
-
lumin.optimisation.threshold.
binary_class_cut
(df, top_perc=5.0, min_pred=0.9, wgt_factor=1.0, br=0.0, syst_unc_b=0.0, pred_name='pred', targ_name='gen_target', wgt_name='gen_weight', plot_settings=<lumin.plotting.plot_settings.PlotSettings object>)[source]¶ Attention
Depreciated as renamed to
binary_class_cut_by_ams()
. Will be removed in v0.4.- Return type
Tuple
[float
,float
,float
]
-
lumin.optimisation.threshold.
binary_class_cut_by_ams
(df, top_perc=5.0, min_pred=0.9, wgt_factor=1.0, br=0.0, syst_unc_b=0.0, pred_name='pred', targ_name='gen_target', wgt_name='gen_weight', plot_settings=<lumin.plotting.plot_settings.PlotSettings object>)[source]¶ Optimise a cut on a signal-background classifier prediction by the Approximate Median Significance Cut which should generalise better by taking the mean class prediction of the top top_perc percentage of points as ranked by AMS
- Parameters
df (
DataFrame
) – Pandas DataFrame containing datatop_perc (
float
) – top percentage of events to consider as ranked by AMSmin_pred (
float
) – minimum prediction to considerwgt_factor (
float
) – single multiplicative coeficient for rescaling signal and background weights before computing AMSbr (
float
) – background offset biassyst_unc_b (
float
) – fractional systemtatic uncertainty on backgroundpred_name (
str
) – column to use as predictionstarg_name (
str
) – column to use as truth labels for signal and backgroundwgt_name (
str
) – column to use as weights for signal and background eventsplot_settings (
PlotSettings
) –PlotSettings
class to control figure appearance
- Return type
Tuple
[float
,float
,float
]- Returns
Optimised cut AMS at cut Maximum AMS