lumin.optimisation package¶

Submodules¶

lumin.optimisation.features module¶

lumin.optimisation.features.auto_filter_on_linear_correlation(train_df, val_df, check_feats, objective, targ_name, strat_key=None, wgt_name=None, corr_threshold=0.8, n_estimators=40, rf_params=None, optimise_rf=True, n_rfs=5, subsample_rate=None, savename=None, plot_settings=<lumin.plotting.plot_settings.PlotSettings object>)[source]¶

Filters a list of possible training features by identifying pairs of linearly correlated features and then attempting to remove either feature from each pair by checking whether doing so would not decrease the performance Random Forests trained to perform classification or regression.

Linearly correlated features are identified by computing Spearman’s rank-order correlation coefficients for every pair of features. Hierachical clustering is then used to group features. Clusters of features with a correlation coefficient greater than a set threshold are candidates for removal. Candidate sets of features are tested, in order of decreasing correlation, by computing the mean performance of a Random Forests trained on all remaining training features and all remaining training features except each feature in the set in turn. If the RF trained on all remaining features consistently outperforms the other trainings, then no feature from the set is removed, otherwise the feature whose removal causes the largest mean increase in performance is removed. This test is then repeated on the remaining features in the set, until either no features are removed, or only one feature remains.

Since this function involves training many models, it can be slow on large datasets. In such cases one can use the subsample_rate argument to sample randomly a fraction of the whole dataset (with optionaly stratification). Resampling is performed prior to each RF training for maximum genralisation, and any weights in the data are automatically renormalised to the original weight sum (within each class).

Attention

This function combines plot_rank_order_dendrogram() with rf_check_feat_removal(). This is purely for convenience and should not be treated as a ‘black box’. We encourage users to convince themselves that it is really is reasonable to remove the features which are identified as redundant.

Parameters:

train_df (DataFrame) – training data as Pandas DataFrame
val_df (DataFrame) – validation data as Pandas DataFrame
check_feats (List[str]) – complete list of features to consider for training and removal
objective (str) – string representation of objective: either ‘classification’ or ‘regression’
targ_name (str) – name of column containing target data
strat_key (Optional[str]) – name of column to use to stratify data when resampling
wgt_name (Optional[str]) – name of column containing weight data. If set, will use weights for training and evaluation, otherwise will not
corr_threshold (float) – minimum threshold on Spearman’s rank-order correlation coefficient for pairs to be considered ‘correlated’
n_estimators (int) – number of trees to use in each forest
rf_params (Union[Dict, OrderedDict, None]) – either: a dictionare of keyword hyper-parameters to use for the Random Forests, if optimse_rf is False; or an OrderedDict of a range of hyper-parameters to test during optimisation. See get_opt_rf_params() for more details.
optimise_rf (bool) – whether to optimise the Random Forest hyper-parameters for the (sub-sambled) dataset
n_rfs (int) – number of trainings to perform during each perfromance impact test
subsample_rate (Optional[float]) – float between 0 and 1. If set will subsample the trainng data to the requested fraction
savename (Optional[str]) – Optional name of file to which to save the first plot of feature clustering
plot_settings (PlotSettings) – PlotSettings class to control figure appearance

Return type:

List[str]

Returns:

Filtered list of training features

lumin.optimisation.features.auto_filter_on_mutual_dependence(train_df, val_df, check_feats, objective, targ_name, strat_key=None, wgt_name=None, md_threshold=0.8, n_estimators=40, rf_params=None, optimise_rf=True, n_rfs=5, subsample_rate=None, plot_settings=<lumin.plotting.plot_settings.PlotSettings object>)[source]¶

Filters a list of possible training features via mutual dependence: By identifying features whose values can be accurately predicted using the other features. Features with a high ‘dependence’ are then checked to see whether removing them would not decrease the performance Random Forests trained to perform classification or regression. For best results, the features to check should be supplied in order to decreasing importance.

Dependent features are identified by training Random Forest regressors on the other features. Features with a dependence greater than a set threshold are candidates for removal. Candidate features are tested, in order of increasing importance, by computing the mean performance of a Random Forests trained on: all remaining training features; and all remaining training features except the candidate feature. If the RF trained on all remaining features except the candidate feature consistently outperforms or matches the training which uses all remaining features, then the candidate feature is removed, otherwise the feature remains and is no longer tested.

Since evaluating the mutual dependence via regression then allows the important features used by the regressor to be identified, it is possible to test multiple feature removals at once, provided a removal candidate is not important for predicting another removal candidate.

Since this function involves training many models, it can be slow on large datasets. In such cases one can use the subsample_rate argument to sample randomly a fraction of the whole dataset (with optionaly stratification). Resampling is performed prior to each RF training for maximum genralisation, and any weights in the data are automatically renormalised to the original weight sum (within each class).

Attention

This function combines RFPImp’s feature_dependence_matrix with rf_check_feat_removal(). This is purely for convenience and should not be treated as a ‘black box’. We encourage users to convince themselves that it is really is reasonable to remove the features which are identified as redundant.

Note

Technicalities related to RFPImp’s use of SVG for plots mean that the mutual dependence plots can have low resolution when shown or saved. Therefore this function does not take a savename argument. Users wiching to save the plots as PNG or PDF should compute the dependence matrix themselves using feature_dependence_matrix and then plot using plot_dependence_heatmap, calling .save([savename]) on the returned object. The plotting backend might need to be set to SVG, using: %config InlineBackend.figure_format = ‘svg’.

Parameters:

train_df (DataFrame) – training data as Pandas DataFrame
val_df (DataFrame) – validation data as Pandas DataFrame
check_feats (List[str]) – complete list of features to consider for training and removal
objective (str) – string representation of objective: either ‘classification’ or ‘regression’
targ_name (str) – name of column containing target data
strat_key (Optional[str]) – name of column to use to stratify data when resampling
wgt_name (Optional[str]) – name of column containing weight data. If set, will use weights for training and evaluation, otherwise will not
md_threshold (float) – minimum threshold on the mutual dependence coefficient for a feature to be considered ‘predictable’
n_estimators (int) – number of trees to use in each forest
rf_params (Optional[OrderedDict]) – either: a dictionare of keyword hyper-parameters to use for the Random Forests, if optimse_rf is False; or an OrderedDict of a range of hyper-parameters to test during optimisation. See get_opt_rf_params() for more details.
optimise_rf (bool) – whether to optimise the Random Forest hyper-parameters for the (sub-sambled) dataset
n_rfs (int) – number of trainings to perform during each perfromance impact test
subsample_rate (Optional[float]) – float between 0 and 1. If set will subsample the trainng data to the requested fraction
plot_settings (PlotSettings) – PlotSettings class to control figure appearance

Return type:

List[str]

Returns:

Filtered list of training features

lumin.optimisation.features.get_rf_feat_importance(rf, inputs, targets, weights=None)[source]¶

Compute feature importance for a Random Forest model using rfpimp.

Parameters:

rf (Union[RandomForestRegressor, RandomForestClassifier]) – trained Random Forest model
inputs (DataFrame) – input data as Pandas DataFrame
targets (ndarray) – target data as Numpy array
weights (Optional[ndarray]) – Optional data weights as Numpy array

Return type:

DataFrame

lumin.optimisation.features.repeated_rf_rank_features(train_df, val_df, n_reps, min_frac_import, objective, train_feats, targ_name='gen_target', wgt_name=None, strat_key=None, subsample_rate=None, resample_val=True, importance_cut=0.0, n_estimators=40, rf_params=None, optimise_rf=True, n_rfs=1, n_max_display=30, n_threads=1, savename=None, plot_settings=<lumin.plotting.plot_settings.PlotSettings object>)[source]¶

Runs rf_rank_features() multiple times on bootstrap resamples of training data and computes the fraction of times each feature passes the importance cut. Then returns a list features which are have a fractional selection as important great than some number. I.e. in cases where rf_rank_features() can be unstable (list of important features changes each run), this method can be used to help stabailse the list of important features

Parameters:

train_df (DataFrame) – training data as Pandas DataFrame
val_df (DataFrame) – validation data as Pandas DataFrame
n_reps (int) – number of times to resample and run rf_rank_features()
min_frac_import (float) – minimum fraction of times feature must be selected as important by rf_rank_features() in order to be considered generally important
objective (str) – string representation of objective: either ‘classification’ or ‘regression’
train_feats (List[str]) – complete list of training features
targ_name (str) – name of column containing target data
wgt_name (Optional[str]) – name of column containing weight data. If set, will use weights for training and evaluation, otherwise will not
strat_key (Optional[str]) – name of column to use to stratify data when resampling
subsample_rate (Optional[float]) – if set, will subsample the training data to the provided fraction. Subsample is repeated per Random Forest training
resample_val (bool) – whether to also resample the validation set, or use the original set for all evaluations
importance_cut (float) – minimum importance required to be considered an ‘important feature’
n_estimators (int) – number of trees to use in each forest
rf_params (Optional[Dict[str, Any]]) – optional dictionary of keyword parameters for SK-Learn Random Forests Or ordered dictionary mapping parameters to optimise to list of values to consider If None and will optimise parameters using lumin.optimisation.hyper_param.get_opt_rf_params()
optimise_rf (bool) – if true will optimise RF params, passing rf_params to get_opt_rf_params()
n_rfs (int) – number of trainings to perform on all training features in order to compute importances
n_max_display (int) – maximum number of features to display in importance plot
n_threads (int) – number of rankings to run simultaneously
savename (Optional[str]) – Optional name of file to which to save the plot of feature importances
plot_settings (PlotSettings) – PlotSettings class to control figure appearance

Return type:

Tuple[List[str], DataFrame]

Returns:

List of features with fractional selection greater than min_frac_import, ordered by decreasing fractional selection
DataFrame of number of selections and fractional selections for all features

lumin.optimisation.features.rf_check_feat_removal(train_df, objective, train_feats, check_feats, targ_name='gen_target', wgt_name=None, val_df=None, subsample_rate=None, strat_key=None, n_estimators=40, n_rfs=1, rf_params=None)[source]¶

Checks whether features can be removed from the set of training features without degrading model performance using Random Forests Computes scores for model with all training features then for each feature listed in check_feats computes scores for a model trained on all training features except that feature E.g. if two features are highly correlated this function could be used to check whether one of them could be removed.

Parameters:

train_df (DataFrame) – training data as Pandas DataFrame
objective (str) – string representation of objective: either ‘classification’ or ‘regression’
train_feats (List[str]) – complete list of training features
check_feats (List[str]) – list of features to try removing
targ_name (str) – name of column containing target data
wgt_name (Optional[str]) – name of column containing weight data. If set, will use weights for training and evaluation, otherwise will not
val_df (Optional[DataFrame]) – optional validation data as Pandas DataFrame. If set will compute validation scores in addition to Out Of Bag scores And will optimise RF parameters if rf_params is None
subsample_rate (Optional[float]) – if set, will subsample the training data to the provided fraction. Subsample is repeated per Random Forest training
strat_key (Optional[str]) – column name to use for stratified subsampling, if desired
n_estimators (int) – number of trees to use in each forest
n_rfs (int) – number of trainings to perform on all training features in order to compute importances
rf_params (Optional[Dict[str, Any]]) – optional dictionary of keyword parameters for SK-Learn Random Forests If None and val_df is None will use default parameters of ‘min_samples_leaf’:3, ‘max_features’:0.5 Elif None and val_df is not None will optimise parameters using lumin.optimisation.hyper_param.get_opt_rf_params()

Return type:

Dict[str, float]

Returns:

Dictionary of results

lumin.optimisation.features.rf_rank_features(train_df, val_df, objective, train_feats, targ_name='gen_target', wgt_name=None, importance_cut=0.0, n_estimators=40, rf_params=None, optimise_rf=True, n_rfs=1, n_max_display=30, plot_results=True, retrain_on_import_feats=True, verbose=True, savename=None, plot_settings=<lumin.plotting.plot_settings.PlotSettings object>)[source]¶

Compute relative permutation importance of input features via using Random Forests. A reduced set of ‘important features’ is obtained by cutting on relative importance and a new model is trained and evaluated on this reduced set. RFs will have their hyper-parameters roughly optimised, both when training on all features and once when training on important features. Relative importances may be computed multiple times (via n_rfs) and averaged. In which case the standard error is also computed.

Parameters:

train_df (DataFrame) – training data as Pandas DataFrame
val_df (DataFrame) – validation data as Pandas DataFrame
objective (str) – string representation of objective: either ‘classification’ or ‘regression’
train_feats (List[str]) – complete list of training features
targ_name (str) – name of column containing target data
wgt_name (Optional[str]) – name of column containing weight data. If set, will use weights for training and evaluation, otherwise will not
importance_cut (float) – minimum importance required to be considered an ‘important feature’
n_estimators (int) – number of trees to use in each forest
rf_params (Optional[Dict[str, Any]]) – optional dictionary of keyword parameters for SK-Learn Random Forests Or ordered dictionary mapping parameters to optimise to list of values to consider If None and will optimise parameters using lumin.optimisation.hyper_param.get_opt_rf_params()
optimise_rf (bool) – if true will optimise RF params, passing rf_params to get_opt_rf_params()
n_rfs (int) – number of trainings to perform on all training features in order to compute importances
n_max_display (int) – maximum number of features to display in importance plot
plot_results (bool) – whether to plot the feature importances
retrain_on_import_feats (bool) – whether to train a new model on important features to compare to full model
verbose (bool) – whether to report results and progress
savename (Optional[str]) – Optional name of file to which to save the plot of feature importances
plot_settings (PlotSettings) – PlotSettings class to control figure appearance

Return type:

List[str]

Returns:

List of features passing importance_cut, ordered by decreasing importance

lumin.optimisation.hyper_param module¶

lumin.optimisation.hyper_param.get_opt_rf_params(x_trn, y_trn, x_val, y_val, objective, w_trn=None, w_val=None, params=None, n_estimators=40, verbose=True)[source]¶

Use an ordered parameter-scan to roughly optimise Random Forest hyper-parameters.

Parameters:

x_trn (ndarray) – training input data
y_trn (ndarray) – training target data
x_val (ndarray) – validation input data
y_val (ndarray) – validation target data
objective (str) – string representation of objective: either ‘classification’ or ‘regression’
w_trn (Optional[ndarray]) – training weights
w_val (Optional[ndarray]) – validation weights
params (Optional[OrderedDict]) – ordered dictionary mapping parameters to optimise to list of values to cosnider
n_estimators (int) – number of trees to use in each forest
verbose – Print extra information and show a live plot of model performance

Returns:

dictionary mapping parameters to their optimised values rf: best performing Random Forest

Return type:

params

lumin.optimisation.hyper_param.lr_find(fy, model_builder, bs, n_epochs=1, train_on_weights=True, n_repeats=-1, lr_bounds=[1e-05, 10], cb_partials=None, plot_settings=<lumin.plotting.plot_settings.PlotSettings object>, bulk_move=True, plot_savename=None, show_plot=True)[source]¶

Wrapper function for training using LRFinder which runs a Smith LR range test (https://arxiv.org/abs/1803.09820) using folds in FoldYielder. Trains models for a set number of repeats, interpolating LR between set bounds. This repeats for each fold in FoldYielder, and loss evolution is averaged.

Parameters:

fy (FoldYielder) – FoldYielder providing training data
model_builder (ModelBuilder) – ModelBuilder providing networks and optimisers
bs (int) – batch size
n_epochs (int) – number of epochs to train per fold
train_on_weights (bool) – If weights are present, whether to use them for training
shuffle_fold – whether to shuffle data in folds
n_folds – if >= 1, will only train n_folds number of models, otherwise will train one model per fold
lr_bounds (Tuple[float, float]) – starting and ending LR values
cb_partials (Optional[List[partial]]) – optional list of functools.partial, each of which will a instantiate Callback when called
plot_settings (PlotSettings) – PlotSettings class to control figure appearance
savename – Optional name of file to which to save the plot
show_plot (bool) – whether to show the plot, or just save them

Return type:

List[LRFinder]

Returns:

List of LRFinder which were used for each model trained

lumin.optimisation.threshold module¶

lumin.optimisation.threshold.binary_class_cut_by_ams(df, top_perc=5.0, min_pred=0.9, wgt_factor=1.0, br=0.0, syst_unc_b=0.0, pred_name='pred', targ_name='gen_target', wgt_name='gen_weight', plot_settings=<lumin.plotting.plot_settings.PlotSettings object>)[source]¶

Optimise a cut on a signal-background classifier prediction by the Approximate Median Significance Cut which should generalise better by taking the mean class prediction of the top top_perc percentage of points as ranked by AMS

Parameters:

df (DataFrame) – Pandas DataFrame containing data
top_perc (float) – top percentage of events to consider as ranked by AMS
min_pred (float) – minimum prediction to consider
wgt_factor (float) – single multiplicative coeficient for rescaling signal and background weights before computing AMS
br (float) – background offset bias
syst_unc_b (float) – fractional systemtatic uncertainty on background
pred_name (str) – column to use as predictions
targ_name (str) – column to use as truth labels for signal and background
wgt_name (str) – column to use as weights for signal and background events
plot_settings (PlotSettings) – PlotSettings class to control figure appearance

Return type:

Tuple[float, float, float]

Returns:

Optimised cut AMS at cut Maximum AMS

lumin.optimisation package¶

Submodules¶

lumin.optimisation.features module¶

lumin.optimisation.hyper_param module¶

lumin.optimisation.threshold module¶

Module contents¶

Docs

Tutorials