lumin.optimisation package¶

Submodules¶

lumin.optimisation.features module¶

lumin.optimisation.features.get_rf_feat_importance(rf, inputs, targets, weights=None)[source]¶

Compute feature importance for a Random Forest model using rfpimp.

Parameters

rf (ForestRegressor) – trained Random Forest model
inputs (DataFrame) – input data as Pandas DataFrame
targets (ndarray) – target data as Numpy array
weights (Optional[ndarray]) – Optional data weights as Numpy array

Return type

DataFrame

lumin.optimisation.features.rf_rank_features(train_df, val_df, objective, train_feats, targ_name='gen_target', wgt_name=None, importance_cut=0.0, n_estimators=40, n_rfs=1, savename=None, plot_settings=<lumin.plotting.plot_settings.PlotSettings object>)[source]¶

Compute relative permutation importance of input features via using Random Forests. A reduced set of ‘important features’ is obtained by cutting on relative importance and a new model is trained and evaluated on this reduced set. RFs will have their hyper-parameters roughly optimised, both when training on all features and once when training on important features. Relative importances may be computed multiple times (via n_rfs) and averaged. In which case the standard error is also computed.

Parameters

train_df (DataFrame) – training data as Pandas DataFrame
val_df (DataFrame) – validation data as Pandas DataFrame
objective (str) – string representation of objective: either ‘classification’ or ‘regression’
train_feats (List[str]) – complete list of training features
targ_name (str) – name of column containing target data
wgt_name (Optional[str]) – name of column containing weight data. If set, will use weights for training and evaluation, otherwise will not
importance_cut (float) – minimum importance required to be considered an ‘important feature’
n_estimators (int) – number of trees to use in each forest
n_rfs (int) – number of trainings to perform on all training features in order to compute importances
savename (Optional[str]) – Optional name of file to which to save the plot of feature importances
plot_settings (PlotSettings) – PlotSettings class to control figure appearance

Return type

List[str]

Returns

List of features passing importance_cut, ordered by importance

lumin.optimisation.hyper_param module¶

lumin.optimisation.hyper_param.get_opt_rf_params(x_trn, y_trn, x_val, y_val, objective, w_trn=None, w_val=None, params={'max_features': [0.3, 0.5, 0.7, 0.9], 'min_samples_leaf': [1, 3, 5, 10, 25, 50, 100]}, n_estimators=40, verbose=True)[source]¶

Use an ordered parameter-scan to roughly optimise Random Forest hyper-parameters.

Parameters

x_trn (ndarray) – training input data
y_trn (ndarray) – training target data
x_val (ndarray) – validation input data
y_val (ndarray) – validation target data
objective (str) – string representation of objective: either ‘classification’ or ‘regression’
w_trn (Optional[ndarray]) – training weights
w_val (Optional[ndarray]) – validation weights
params (OrderedDict) – ordered dictionary mapping parameters to optimise to list of values to cosnider
n_estimators (int) – number of trees to use in each forest
verbose – Print extra information and show a live plot of model performance

Returns

dictionary mapping parameters to their optimised values rf: best performing Random Forest

Return type

params

lumin.optimisation.hyper_param.fold_lr_find(fy, model_builder, bs, train_on_weights=True, shuffle_fold=True, n_folds=-1, lr_bounds=[1e-05, 10], callback_partials=None, plot_settings=<lumin.plotting.plot_settings.PlotSettings object>)[source]¶

Wrapper function for training using LRFinder which runs a Smith LR range test (https://arxiv.org/abs/1803.09820) using folds in FoldYielder. Trains models for 1 fold, interpolating LR between set bounds. This repeats for each fold in FoldYielder, and loss evolution is averaged.

Parameters

fy (FoldYielder) – FoldYielder providing training data
model_builder (ModelBuilder) – ModelBuilder providing networks and optimisers
bs (int) – batch size
train_on_weights (bool) – If weights are present, whether to use them for training
shuffle_fold (bool) – whether to shuffle data in folds
n_folds (int) – if >= 1, will only train n_folds number of models, otherwise will train one model per fold
lr_bounds (Tuple[float, float]) – starting and ending LR values
callback_partials (Optional[List[partial]]) – optional list of functools.partial, each of which will a instantiate Callback when called
plot_settings (PlotSettings) – PlotSettings class to control figure appearance

Return type

List[LRFinder]

Returns

List of LRFinder which were used for each model trained

lumin.optimisation.threshold module¶

lumin.optimisation.threshold.binary_class_cut(df, top_perc=5.0, min_pred=0.9, wgt_factor=1.0, br=0.0, syst_unc_b=0.0, pred_name='pred', targ_name='gen_target', wgt_name='gen_weight', plot_settings=<lumin.plotting.plot_settings.PlotSettings object>)[source]¶

Attention

Depreciated as renamed to binary_class_cut_by_ams(). Will be removed in v0.4.

Return type: Tuple[float, float, float]

lumin.optimisation.threshold.binary_class_cut_by_ams(df, top_perc=5.0, min_pred=0.9, wgt_factor=1.0, br=0.0, syst_unc_b=0.0, pred_name='pred', targ_name='gen_target', wgt_name='gen_weight', plot_settings=<lumin.plotting.plot_settings.PlotSettings object>)[source]¶

Optimise a cut on a signal-background classifier prediction by the Approximate Median Significance Cut which should generalise better by taking the mean class prediction of the top top_perc percentage of points as ranked by AMS

Parameters

df (DataFrame) – Pandas DataFrame containing data
top_perc (float) – top percentage of events to consider as ranked by AMS
min_pred (float) – minimum prediction to consider
wgt_factor (float) – single multiplicative coeficient for rescaling signal and background weights before computing AMS
br (float) – background offset bias
syst_unc_b (float) – fractional systemtatic uncertainty on background
pred_name (str) – column to use as predictions
targ_name (str) – column to use as truth labels for signal and background
wgt_name (str) – column to use as weights for signal and background events
plot_settings (PlotSettings) – PlotSettings class to control figure appearance

Return type

Tuple[float, float, float]

Returns

Optimised cut AMS at cut Maximum AMS

lumin.optimisation package¶

Submodules¶

lumin.optimisation.features module¶

lumin.optimisation.hyper_param module¶

lumin.optimisation.threshold module¶

Module contents¶

Docs

Tutorials