• Docs >
  • lumin.optimisation package
Shortcuts

lumin.optimisation package

Submodules

lumin.optimisation.features module

lumin.optimisation.features.auto_filter_on_linear_correlation(train_df, val_df, check_feats, objective, targ_name, strat_key=None, wgt_name=None, corr_threshold=0.8, n_estimators=40, rf_params=None, optimise_rf=True, n_rfs=5, subsample_rate=None, savename=None, plot_settings=<lumin.plotting.plot_settings.PlotSettings object>)[source]

Filters a list of possible training features by identifying pairs of linearly correlated features and then attempting to remove either feature from each pair by checking whether doing so would not decrease the performance Random Forests trained to perform classification or regression.

Linearly correlated features are identified by computing Spearman’s rank-order correlation coefficients for every pair of features. Hierachical clustering is then used to group features. Clusters of features with a correlation coefficient greater than a set threshold are candidates for removal. Candidate sets of features are tested, in order of decreasing correlation, by computing the mean performance of a Random Forests trained on all remaining training features and all remaining training features except each feature in the set in turn. If the RF trained on all remaining features consistently outperforms the other trainings, then no feature from the set is removed, otherwise the feature whose removal causes the largest mean increase in performance is removed. This test is then repeated on the remaining features in the set, until either no features are removed, or only one feature remains.

Since this function involves training many models, it can be slow on large datasets. In such cases one can use the subsample_rate argument to sample randomly a fraction of the whole dataset (with optionaly stratification). Resampling is performed prior to each RF training for maximum genralisation, and any weights in the data are automatically renormalised to the original weight sum (within each class).

Attention

This function combines plot_rank_order_dendrogram() with rf_check_feat_removal(). This is purely for convenience and should not be treated as a ‘black box’. We encourage users to convince themselves that it is really is reasonable to remove the features which are identified as redundant.

Parameters:
  • train_df (DataFrame) – training data as Pandas DataFrame

  • val_df (DataFrame) – validation data as Pandas DataFrame

  • check_feats (List[str]) – complete list of features to consider for training and removal

  • objective (str) – string representation of objective: either ‘classification’ or ‘regression’

  • targ_name (str) – name of column containing target data

  • strat_key (Optional[str]) – name of column to use to stratify data when resampling

  • wgt_name (Optional[str]) – name of column containing weight data. If set, will use weights for training and evaluation, otherwise will not

  • corr_threshold (float) – minimum threshold on Spearman’s rank-order correlation coefficient for pairs to be considered ‘correlated’

  • n_estimators (int) – number of trees to use in each forest

  • rf_params (Union[Dict, OrderedDict, None]) – either: a dictionare of keyword hyper-parameters to use for the Random Forests, if optimse_rf is False; or an OrderedDict of a range of hyper-parameters to test during optimisation. See get_opt_rf_params() for more details.

  • optimise_rf (bool) – whether to optimise the Random Forest hyper-parameters for the (sub-sambled) dataset

  • n_rfs (int) – number of trainings to perform during each perfromance impact test

  • subsample_rate (Optional[float]) – float between 0 and 1. If set will subsample the trainng data to the requested fraction

  • savename (Optional[str]) – Optional name of file to which to save the first plot of feature clustering

  • plot_settings (PlotSettings) – PlotSettings class to control figure appearance

Return type:

List[str]

Returns:

Filtered list of training features

lumin.optimisation.features.auto_filter_on_mutual_dependence(train_df, val_df, check_feats, objective, targ_name, strat_key=None, wgt_name=None, md_threshold=0.8, n_estimators=40, rf_params=None, optimise_rf=True, n_rfs=5, subsample_rate=None, plot_settings=<lumin.plotting.plot_settings.PlotSettings object>)[source]

Filters a list of possible training features via mutual dependence: By identifying features whose values can be accurately predicted using the other features. Features with a high ‘dependence’ are then checked to see whether removing them would not decrease the performance Random Forests trained to perform classification or regression. For best results, the features to check should be supplied in order to decreasing importance.

Dependent features are identified by training Random Forest regressors on the other features. Features with a dependence greater than a set threshold are candidates for removal. Candidate features are tested, in order of increasing importance, by computing the mean performance of a Random Forests trained on: all remaining training features; and all remaining training features except the candidate feature. If the RF trained on all remaining features except the candidate feature consistently outperforms or matches the training which uses all remaining features, then the candidate feature is removed, otherwise the feature remains and is no longer tested.

Since evaluating the mutual dependence via regression then allows the important features used by the regressor to be identified, it is possible to test multiple feature removals at once, provided a removal candidate is not important for predicting another removal candidate.

Since this function involves training many models, it can be slow on large datasets. In such cases one can use the subsample_rate argument to sample randomly a fraction of the whole dataset (with optionaly stratification). Resampling is performed prior to each RF training for maximum genralisation, and any weights in the data are automatically renormalised to the original weight sum (within each class).

Attention

This function combines RFPImp’s feature_dependence_matrix with rf_check_feat_removal(). This is purely for convenience and should not be treated as a ‘black box’. We encourage users to convince themselves that it is really is reasonable to remove the features which are identified as redundant.

Note

Technicalities related to RFPImp’s use of SVG for plots mean that the mutual dependence plots can have low resolution when shown or saved. Therefore this function does not take a savename argument. Users wiching to save the plots as PNG or PDF should compute the dependence matrix themselves using feature_dependence_matrix and then plot using plot_dependence_heatmap, calling .save([savename]) on the returned object. The plotting backend might need to be set to SVG, using: %config InlineBackend.figure_format = ‘svg’.

Parameters:
  • train_df (DataFrame) – training data as Pandas DataFrame

  • val_df (DataFrame) – validation data as Pandas DataFrame

  • check_feats (List[str]) – complete list of features to consider for training and removal

  • objective (str) – string representation of objective: either ‘classification’ or ‘regression’

  • targ_name (str) – name of column containing target data

  • strat_key (Optional[str]) – name of column to use to stratify data when resampling

  • wgt_name (Optional[str]) – name of column containing weight data. If set, will use weights for training and evaluation, otherwise will not

  • md_threshold (float) – minimum threshold on the mutual dependence coefficient for a feature to be considered ‘predictable’

  • n_estimators (int) – number of trees to use in each forest

  • rf_params (Optional[OrderedDict]) – either: a dictionare of keyword hyper-parameters to use for the Random Forests, if optimse_rf is False; or an OrderedDict of a range of hyper-parameters to test during optimisation. See get_opt_rf_params() for more details.

  • optimise_rf (bool) – whether to optimise the Random Forest hyper-parameters for the (sub-sambled) dataset

  • n_rfs (int) – number of trainings to perform during each perfromance impact test

  • subsample_rate (Optional[float]) – float between 0 and 1. If set will subsample the trainng data to the requested fraction

  • plot_settings (PlotSettings) – PlotSettings class to control figure appearance

Return type:

List[str]

Returns:

Filtered list of training features

lumin.optimisation.features.get_rf_feat_importance(rf, inputs, targets, weights=None)[source]

Compute feature importance for a Random Forest model using rfpimp.

Parameters:
  • rf (Union[RandomForestRegressor, RandomForestClassifier]) – trained Random Forest model

  • inputs (DataFrame) – input data as Pandas DataFrame

  • targets (ndarray) – target data as Numpy array

  • weights (Optional[ndarray]) – Optional data weights as Numpy array

Return type:

DataFrame

lumin.optimisation.features.repeated_rf_rank_features(train_df, val_df, n_reps, min_frac_import, objective, train_feats, targ_name='gen_target', wgt_name=None, strat_key=None, subsample_rate=None, resample_val=True, importance_cut=0.0, n_estimators=40, rf_params=None, optimise_rf=True, n_rfs=1, n_max_display=30, n_threads=1, savename=None, plot_settings=<lumin.plotting.plot_settings.PlotSettings object>)[source]

Runs rf_rank_features() multiple times on bootstrap resamples of training data and computes the fraction of times each feature passes the importance cut. Then returns a list features which are have a fractional selection as important great than some number. I.e. in cases where rf_rank_features() can be unstable (list of important features changes each run), this method can be used to help stabailse the list of important features

Parameters:
  • train_df (DataFrame) – training data as Pandas DataFrame

  • val_df (DataFrame) – validation data as Pandas DataFrame

  • n_reps (int) – number of times to resample and run rf_rank_features()

  • min_frac_import (float) – minimum fraction of times feature must be selected as important by rf_rank_features() in order to be considered generally important

  • objective (str) – string representation of objective: either ‘classification’ or ‘regression’

  • train_feats (List[str]) – complete list of training features

  • targ_name (str) – name of column containing target data

  • wgt_name (Optional[str]) – name of column containing weight data. If set, will use weights for training and evaluation, otherwise will not

  • strat_key (Optional[str]) – name of column to use to stratify data when resampling

  • subsample_rate (Optional[float]) – if set, will subsample the training data to the provided fraction. Subsample is repeated per Random Forest training

  • resample_val (bool) – whether to also resample the validation set, or use the original set for all evaluations

  • importance_cut (float) – minimum importance required to be considered an ‘important feature’

  • n_estimators (int) – number of trees to use in each forest

  • rf_params (Optional[Dict[str, Any]]) – optional dictionary of keyword parameters for SK-Learn Random Forests Or ordered dictionary mapping parameters to optimise to list of values to consider If None and will optimise parameters using lumin.optimisation.hyper_param.get_opt_rf_params()

  • optimise_rf (bool) – if true will optimise RF params, passing rf_params to get_opt_rf_params()

  • n_rfs (int) – number of trainings to perform on all training features in order to compute importances

  • n_max_display (int) – maximum number of features to display in importance plot

  • n_threads (int) – number of rankings to run simultaneously

  • savename (Optional[str]) – Optional name of file to which to save the plot of feature importances

  • plot_settings (PlotSettings) – PlotSettings class to control figure appearance

Return type:

Tuple[List[str], DataFrame]

Returns:

  • List of features with fractional selection greater than min_frac_import, ordered by decreasing fractional selection

  • DataFrame of number of selections and fractional selections for all features

lumin.optimisation.features.rf_check_feat_removal(train_df, objective, train_feats, check_feats, targ_name='gen_target', wgt_name=None, val_df=None, subsample_rate=None, strat_key=None, n_estimators=40, n_rfs=1, rf_params=None)[source]

Checks whether features can be removed from the set of training features without degrading model performance using Random Forests Computes scores for model with all training features then for each feature listed in check_feats computes scores for a model trained on all training features except that feature E.g. if two features are highly correlated this function could be used to check whether one of them could be removed.

Parameters:
  • train_df (DataFrame) – training data as Pandas DataFrame

  • objective (str) – string representation of objective: either ‘classification’ or ‘regression’

  • train_feats (List[str]) – complete list of training features

  • check_feats (List[str]) – list of features to try removing

  • targ_name (str) – name of column containing target data

  • wgt_name (Optional[str]) – name of column containing weight data. If set, will use weights for training and evaluation, otherwise will not

  • val_df (Optional[DataFrame]) – optional validation data as Pandas DataFrame. If set will compute validation scores in addition to Out Of Bag scores And will optimise RF parameters if rf_params is None

  • subsample_rate (Optional[float]) – if set, will subsample the training data to the provided fraction. Subsample is repeated per Random Forest training

  • strat_key (Optional[str]) – column name to use for stratified subsampling, if desired

  • n_estimators (int) – number of trees to use in each forest

  • n_rfs (int) – number of trainings to perform on all training features in order to compute importances

  • rf_params (Optional[Dict[str, Any]]) – optional dictionary of keyword parameters for SK-Learn Random Forests If None and val_df is None will use default parameters of ‘min_samples_leaf’:3, ‘max_features’:0.5 Elif None and val_df is not None will optimise parameters using lumin.optimisation.hyper_param.get_opt_rf_params()

Return type:

Dict[str, float]

Returns:

Dictionary of results

lumin.optimisation.features.rf_rank_features(train_df, val_df, objective, train_feats, targ_name='gen_target', wgt_name=None, importance_cut=0.0, n_estimators=40, rf_params=None, optimise_rf=True, n_rfs=1, n_max_display=30, plot_results=True, retrain_on_import_feats=True, verbose=True, savename=None, plot_settings=<lumin.plotting.plot_settings.PlotSettings object>)[source]

Compute relative permutation importance of input features via using Random Forests. A reduced set of ‘important features’ is obtained by cutting on relative importance and a new model is trained and evaluated on this reduced set. RFs will have their hyper-parameters roughly optimised, both when training on all features and once when training on important features. Relative importances may be computed multiple times (via n_rfs) and averaged. In which case the standard error is also computed.

Parameters:
  • train_df (DataFrame) – training data as Pandas DataFrame

  • val_df (DataFrame) – validation data as Pandas DataFrame

  • objective (str) – string representation of objective: either ‘classification’ or ‘regression’

  • train_feats (List[str]) – complete list of training features

  • targ_name (str) – name of column containing target data

  • wgt_name (Optional[str]) – name of column containing weight data. If set, will use weights for training and evaluation, otherwise will not

  • importance_cut (float) – minimum importance required to be considered an ‘important feature’

  • n_estimators (int) – number of trees to use in each forest

  • rf_params (Optional[Dict[str, Any]]) – optional dictionary of keyword parameters for SK-Learn Random Forests Or ordered dictionary mapping parameters to optimise to list of values to consider If None and will optimise parameters using lumin.optimisation.hyper_param.get_opt_rf_params()

  • optimise_rf (bool) – if true will optimise RF params, passing rf_params to get_opt_rf_params()

  • n_rfs (int) – number of trainings to perform on all training features in order to compute importances

  • n_max_display (int) – maximum number of features to display in importance plot

  • plot_results (bool) – whether to plot the feature importances

  • retrain_on_import_feats (bool) – whether to train a new model on important features to compare to full model

  • verbose (bool) – whether to report results and progress

  • savename (Optional[str]) – Optional name of file to which to save the plot of feature importances

  • plot_settings (PlotSettings) – PlotSettings class to control figure appearance

Return type:

List[str]

Returns:

List of features passing importance_cut, ordered by decreasing importance

lumin.optimisation.hyper_param module

lumin.optimisation.hyper_param.get_opt_rf_params(x_trn, y_trn, x_val, y_val, objective, w_trn=None, w_val=None, params=None, n_estimators=40, verbose=True)[source]

Use an ordered parameter-scan to roughly optimise Random Forest hyper-parameters.

Parameters:
  • x_trn (ndarray) – training input data

  • y_trn (ndarray) – training target data

  • x_val (ndarray) – validation input data

  • y_val (ndarray) – validation target data

  • objective (str) – string representation of objective: either ‘classification’ or ‘regression’

  • w_trn (Optional[ndarray]) – training weights

  • w_val (Optional[ndarray]) – validation weights

  • params (Optional[OrderedDict]) – ordered dictionary mapping parameters to optimise to list of values to cosnider

  • n_estimators (int) – number of trees to use in each forest

  • verbose – Print extra information and show a live plot of model performance

Returns:

dictionary mapping parameters to their optimised values rf: best performing Random Forest

Return type:

params

lumin.optimisation.hyper_param.lr_find(fy, model_builder, bs, n_epochs=1, train_on_weights=True, n_repeats=-1, lr_bounds=[1e-05, 10], cb_partials=None, plot_settings=<lumin.plotting.plot_settings.PlotSettings object>, bulk_move=True, plot_savename=None, show_plot=True)[source]

Wrapper function for training using LRFinder which runs a Smith LR range test (https://arxiv.org/abs/1803.09820) using folds in FoldYielder. Trains models for a set number of repeats, interpolating LR between set bounds. This repeats for each fold in FoldYielder, and loss evolution is averaged.

Parameters:
  • fy (FoldYielder) – FoldYielder providing training data

  • model_builder (ModelBuilder) – ModelBuilder providing networks and optimisers

  • bs (int) – batch size

  • n_epochs (int) – number of epochs to train per fold

  • train_on_weights (bool) – If weights are present, whether to use them for training

  • shuffle_fold – whether to shuffle data in folds

  • n_folds – if >= 1, will only train n_folds number of models, otherwise will train one model per fold

  • lr_bounds (Tuple[float, float]) – starting and ending LR values

  • cb_partials (Optional[List[partial]]) – optional list of functools.partial, each of which will a instantiate Callback when called

  • plot_settings (PlotSettings) – PlotSettings class to control figure appearance

  • savename – Optional name of file to which to save the plot

  • show_plot (bool) – whether to show the plot, or just save them

Return type:

List[LRFinder]

Returns:

List of LRFinder which were used for each model trained

lumin.optimisation.threshold module

lumin.optimisation.threshold.binary_class_cut_by_ams(df, top_perc=5.0, min_pred=0.9, wgt_factor=1.0, br=0.0, syst_unc_b=0.0, pred_name='pred', targ_name='gen_target', wgt_name='gen_weight', plot_settings=<lumin.plotting.plot_settings.PlotSettings object>)[source]

Optimise a cut on a signal-background classifier prediction by the Approximate Median Significance Cut which should generalise better by taking the mean class prediction of the top top_perc percentage of points as ranked by AMS

Parameters:
  • df (DataFrame) – Pandas DataFrame containing data

  • top_perc (float) – top percentage of events to consider as ranked by AMS

  • min_pred (float) – minimum prediction to consider

  • wgt_factor (float) – single multiplicative coeficient for rescaling signal and background weights before computing AMS

  • br (float) – background offset bias

  • syst_unc_b (float) – fractional systemtatic uncertainty on background

  • pred_name (str) – column to use as predictions

  • targ_name (str) – column to use as truth labels for signal and background

  • wgt_name (str) – column to use as weights for signal and background events

  • plot_settings (PlotSettings) – PlotSettings class to control figure appearance

Return type:

Tuple[float, float, float]

Returns:

Optimised cut AMS at cut Maximum AMS

Module contents

Docs

Access comprehensive developer and user documentation for LUMIN

View Docs

Tutorials

Get tutorials for beginner and advanced researchers demonstrating many of the features of LUMIN

View Tutorials