lumin.optimisation package¶
Submodules¶
lumin.optimisation.features module¶
- lumin.optimisation.features.auto_filter_on_linear_correlation(train_df, val_df, check_feats, objective, targ_name, strat_key=None, wgt_name=None, corr_threshold=0.8, n_estimators=40, rf_params=None, optimise_rf=True, n_rfs=5, subsample_rate=None, savename=None, plot_settings=<lumin.plotting.plot_settings.PlotSettings object>)[source]¶
Filters a list of possible training features by identifying pairs of linearly correlated features and then attempting to remove either feature from each pair by checking whether doing so would not decrease the performance Random Forests trained to perform classification or regression.
Linearly correlated features are identified by computing Spearman’s rank-order correlation coefficients for every pair of features. Hierachical clustering is then used to group features. Clusters of features with a correlation coefficient greater than a set threshold are candidates for removal. Candidate sets of features are tested, in order of decreasing correlation, by computing the mean performance of a Random Forests trained on all remaining training features and all remaining training features except each feature in the set in turn. If the RF trained on all remaining features consistently outperforms the other trainings, then no feature from the set is removed, otherwise the feature whose removal causes the largest mean increase in performance is removed. This test is then repeated on the remaining features in the set, until either no features are removed, or only one feature remains.
Since this function involves training many models, it can be slow on large datasets. In such cases one can use the subsample_rate argument to sample randomly a fraction of the whole dataset (with optionaly stratification). Resampling is performed prior to each RF training for maximum genralisation, and any weights in the data are automatically renormalised to the original weight sum (within each class).
Attention
This function combines
plot_rank_order_dendrogram()
withrf_check_feat_removal()
. This is purely for convenience and should not be treated as a ‘black box’. We encourage users to convince themselves that it is really is reasonable to remove the features which are identified as redundant.- Parameters:
train_df (
DataFrame
) – training data as Pandas DataFrameval_df (
DataFrame
) – validation data as Pandas DataFramecheck_feats (
List
[str
]) – complete list of features to consider for training and removalobjective (
str
) – string representation of objective: either ‘classification’ or ‘regression’targ_name (
str
) – name of column containing target datastrat_key (
Optional
[str
]) – name of column to use to stratify data when resamplingwgt_name (
Optional
[str
]) – name of column containing weight data. If set, will use weights for training and evaluation, otherwise will notcorr_threshold (
float
) – minimum threshold on Spearman’s rank-order correlation coefficient for pairs to be considered ‘correlated’n_estimators (
int
) – number of trees to use in each forestrf_params (
Union
[Dict
,OrderedDict
,None
]) – either: a dictionare of keyword hyper-parameters to use for the Random Forests, if optimse_rf is False; or an OrderedDict of a range of hyper-parameters to test during optimisation. Seeget_opt_rf_params()
for more details.optimise_rf (
bool
) – whether to optimise the Random Forest hyper-parameters for the (sub-sambled) datasetn_rfs (
int
) – number of trainings to perform during each perfromance impact testsubsample_rate (
Optional
[float
]) – float between 0 and 1. If set will subsample the trainng data to the requested fractionsavename (
Optional
[str
]) – Optional name of file to which to save the first plot of feature clusteringplot_settings (
PlotSettings
) –PlotSettings
class to control figure appearance
- Return type:
List
[str
]- Returns:
Filtered list of training features
- lumin.optimisation.features.auto_filter_on_mutual_dependence(train_df, val_df, check_feats, objective, targ_name, strat_key=None, wgt_name=None, md_threshold=0.8, n_estimators=40, rf_params=None, optimise_rf=True, n_rfs=5, subsample_rate=None, plot_settings=<lumin.plotting.plot_settings.PlotSettings object>)[source]¶
Filters a list of possible training features via mutual dependence: By identifying features whose values can be accurately predicted using the other features. Features with a high ‘dependence’ are then checked to see whether removing them would not decrease the performance Random Forests trained to perform classification or regression. For best results, the features to check should be supplied in order to decreasing importance.
Dependent features are identified by training Random Forest regressors on the other features. Features with a dependence greater than a set threshold are candidates for removal. Candidate features are tested, in order of increasing importance, by computing the mean performance of a Random Forests trained on: all remaining training features; and all remaining training features except the candidate feature. If the RF trained on all remaining features except the candidate feature consistently outperforms or matches the training which uses all remaining features, then the candidate feature is removed, otherwise the feature remains and is no longer tested.
Since evaluating the mutual dependence via regression then allows the important features used by the regressor to be identified, it is possible to test multiple feature removals at once, provided a removal candidate is not important for predicting another removal candidate.
Since this function involves training many models, it can be slow on large datasets. In such cases one can use the subsample_rate argument to sample randomly a fraction of the whole dataset (with optionaly stratification). Resampling is performed prior to each RF training for maximum genralisation, and any weights in the data are automatically renormalised to the original weight sum (within each class).
Attention
This function combines RFPImp’s feature_dependence_matrix with
rf_check_feat_removal()
. This is purely for convenience and should not be treated as a ‘black box’. We encourage users to convince themselves that it is really is reasonable to remove the features which are identified as redundant.Note
Technicalities related to RFPImp’s use of SVG for plots mean that the mutual dependence plots can have low resolution when shown or saved. Therefore this function does not take a savename argument. Users wiching to save the plots as PNG or PDF should compute the dependence matrix themselves using feature_dependence_matrix and then plot using plot_dependence_heatmap, calling .save([savename]) on the returned object. The plotting backend might need to be set to SVG, using: %config InlineBackend.figure_format = ‘svg’.
- Parameters:
train_df (
DataFrame
) – training data as Pandas DataFrameval_df (
DataFrame
) – validation data as Pandas DataFramecheck_feats (
List
[str
]) – complete list of features to consider for training and removalobjective (
str
) – string representation of objective: either ‘classification’ or ‘regression’targ_name (
str
) – name of column containing target datastrat_key (
Optional
[str
]) – name of column to use to stratify data when resamplingwgt_name (
Optional
[str
]) – name of column containing weight data. If set, will use weights for training and evaluation, otherwise will notmd_threshold (
float
) – minimum threshold on the mutual dependence coefficient for a feature to be considered ‘predictable’n_estimators (
int
) – number of trees to use in each forestrf_params (
Optional
[OrderedDict
]) – either: a dictionare of keyword hyper-parameters to use for the Random Forests, if optimse_rf is False; or an OrderedDict of a range of hyper-parameters to test during optimisation. Seeget_opt_rf_params()
for more details.optimise_rf (
bool
) – whether to optimise the Random Forest hyper-parameters for the (sub-sambled) datasetn_rfs (
int
) – number of trainings to perform during each perfromance impact testsubsample_rate (
Optional
[float
]) – float between 0 and 1. If set will subsample the trainng data to the requested fractionplot_settings (
PlotSettings
) –PlotSettings
class to control figure appearance
- Return type:
List
[str
]- Returns:
Filtered list of training features
- lumin.optimisation.features.get_rf_feat_importance(rf, inputs, targets, weights=None)[source]¶
Compute feature importance for a Random Forest model using rfpimp.
- Parameters:
rf (
Union
[RandomForestRegressor
,RandomForestClassifier
]) – trained Random Forest modelinputs (
DataFrame
) – input data as Pandas DataFrametargets (
ndarray
) – target data as Numpy arrayweights (
Optional
[ndarray
]) – Optional data weights as Numpy array
- Return type:
DataFrame
- lumin.optimisation.features.repeated_rf_rank_features(train_df, val_df, n_reps, min_frac_import, objective, train_feats, targ_name='gen_target', wgt_name=None, strat_key=None, subsample_rate=None, resample_val=True, importance_cut=0.0, n_estimators=40, rf_params=None, optimise_rf=True, n_rfs=1, n_max_display=30, n_threads=1, savename=None, plot_settings=<lumin.plotting.plot_settings.PlotSettings object>)[source]¶
Runs
rf_rank_features()
multiple times on bootstrap resamples of training data and computes the fraction of times each feature passes the importance cut. Then returns a list features which are have a fractional selection as important great than some number. I.e. in cases whererf_rank_features()
can be unstable (list of important features changes each run), this method can be used to help stabailse the list of important features- Parameters:
train_df (
DataFrame
) – training data as Pandas DataFrameval_df (
DataFrame
) – validation data as Pandas DataFramen_reps (
int
) – number of times to resample and runrf_rank_features()
min_frac_import (
float
) – minimum fraction of times feature must be selected as important byrf_rank_features()
in order to be considered generally importantobjective (
str
) – string representation of objective: either ‘classification’ or ‘regression’train_feats (
List
[str
]) – complete list of training featurestarg_name (
str
) – name of column containing target datawgt_name (
Optional
[str
]) – name of column containing weight data. If set, will use weights for training and evaluation, otherwise will notstrat_key (
Optional
[str
]) – name of column to use to stratify data when resamplingsubsample_rate (
Optional
[float
]) – if set, will subsample the training data to the provided fraction. Subsample is repeated per Random Forest trainingresample_val (
bool
) – whether to also resample the validation set, or use the original set for all evaluationsimportance_cut (
float
) – minimum importance required to be considered an ‘important feature’n_estimators (
int
) – number of trees to use in each forestrf_params (
Optional
[Dict
[str
,Any
]]) – optional dictionary of keyword parameters for SK-Learn Random Forests Or ordered dictionary mapping parameters to optimise to list of values to consider If None and will optimise parameters usinglumin.optimisation.hyper_param.get_opt_rf_params()
optimise_rf (
bool
) – if true will optimise RF params, passing rf_params toget_opt_rf_params()
n_rfs (
int
) – number of trainings to perform on all training features in order to compute importancesn_max_display (
int
) – maximum number of features to display in importance plotn_threads (
int
) – number of rankings to run simultaneouslysavename (
Optional
[str
]) – Optional name of file to which to save the plot of feature importancesplot_settings (
PlotSettings
) –PlotSettings
class to control figure appearance
- Return type:
Tuple
[List
[str
],DataFrame
]- Returns:
List of features with fractional selection greater than min_frac_import, ordered by decreasing fractional selection
DataFrame of number of selections and fractional selections for all features
- lumin.optimisation.features.rf_check_feat_removal(train_df, objective, train_feats, check_feats, targ_name='gen_target', wgt_name=None, val_df=None, subsample_rate=None, strat_key=None, n_estimators=40, n_rfs=1, rf_params=None)[source]¶
Checks whether features can be removed from the set of training features without degrading model performance using Random Forests Computes scores for model with all training features then for each feature listed in check_feats computes scores for a model trained on all training features except that feature E.g. if two features are highly correlated this function could be used to check whether one of them could be removed.
- Parameters:
train_df (
DataFrame
) – training data as Pandas DataFrameobjective (
str
) – string representation of objective: either ‘classification’ or ‘regression’train_feats (
List
[str
]) – complete list of training featurescheck_feats (
List
[str
]) – list of features to try removingtarg_name (
str
) – name of column containing target datawgt_name (
Optional
[str
]) – name of column containing weight data. If set, will use weights for training and evaluation, otherwise will notval_df (
Optional
[DataFrame
]) – optional validation data as Pandas DataFrame. If set will compute validation scores in addition to Out Of Bag scores And will optimise RF parameters if rf_params is Nonesubsample_rate (
Optional
[float
]) – if set, will subsample the training data to the provided fraction. Subsample is repeated per Random Forest trainingstrat_key (
Optional
[str
]) – column name to use for stratified subsampling, if desiredn_estimators (
int
) – number of trees to use in each forestn_rfs (
int
) – number of trainings to perform on all training features in order to compute importancesrf_params (
Optional
[Dict
[str
,Any
]]) – optional dictionary of keyword parameters for SK-Learn Random Forests If None and val_df is None will use default parameters of ‘min_samples_leaf’:3, ‘max_features’:0.5 Elif None and val_df is not None will optimise parameters usinglumin.optimisation.hyper_param.get_opt_rf_params()
- Return type:
Dict
[str
,float
]- Returns:
Dictionary of results
- lumin.optimisation.features.rf_rank_features(train_df, val_df, objective, train_feats, targ_name='gen_target', wgt_name=None, importance_cut=0.0, n_estimators=40, rf_params=None, optimise_rf=True, n_rfs=1, n_max_display=30, plot_results=True, retrain_on_import_feats=True, verbose=True, savename=None, plot_settings=<lumin.plotting.plot_settings.PlotSettings object>)[source]¶
Compute relative permutation importance of input features via using Random Forests. A reduced set of ‘important features’ is obtained by cutting on relative importance and a new model is trained and evaluated on this reduced set. RFs will have their hyper-parameters roughly optimised, both when training on all features and once when training on important features. Relative importances may be computed multiple times (via n_rfs) and averaged. In which case the standard error is also computed.
- Parameters:
train_df (
DataFrame
) – training data as Pandas DataFrameval_df (
DataFrame
) – validation data as Pandas DataFrameobjective (
str
) – string representation of objective: either ‘classification’ or ‘regression’train_feats (
List
[str
]) – complete list of training featurestarg_name (
str
) – name of column containing target datawgt_name (
Optional
[str
]) – name of column containing weight data. If set, will use weights for training and evaluation, otherwise will notimportance_cut (
float
) – minimum importance required to be considered an ‘important feature’n_estimators (
int
) – number of trees to use in each forestrf_params (
Optional
[Dict
[str
,Any
]]) – optional dictionary of keyword parameters for SK-Learn Random Forests Or ordered dictionary mapping parameters to optimise to list of values to consider If None and will optimise parameters usinglumin.optimisation.hyper_param.get_opt_rf_params()
optimise_rf (
bool
) – if true will optimise RF params, passing rf_params toget_opt_rf_params()
n_rfs (
int
) – number of trainings to perform on all training features in order to compute importancesn_max_display (
int
) – maximum number of features to display in importance plotplot_results (
bool
) – whether to plot the feature importancesretrain_on_import_feats (
bool
) – whether to train a new model on important features to compare to full modelverbose (
bool
) – whether to report results and progresssavename (
Optional
[str
]) – Optional name of file to which to save the plot of feature importancesplot_settings (
PlotSettings
) –PlotSettings
class to control figure appearance
- Return type:
List
[str
]- Returns:
List of features passing importance_cut, ordered by decreasing importance
lumin.optimisation.hyper_param module¶
- lumin.optimisation.hyper_param.get_opt_rf_params(x_trn, y_trn, x_val, y_val, objective, w_trn=None, w_val=None, params=None, n_estimators=40, verbose=True)[source]¶
Use an ordered parameter-scan to roughly optimise Random Forest hyper-parameters.
- Parameters:
x_trn (
ndarray
) – training input datay_trn (
ndarray
) – training target datax_val (
ndarray
) – validation input datay_val (
ndarray
) – validation target dataobjective (
str
) – string representation of objective: either ‘classification’ or ‘regression’w_trn (
Optional
[ndarray
]) – training weightsw_val (
Optional
[ndarray
]) – validation weightsparams (
Optional
[OrderedDict
]) – ordered dictionary mapping parameters to optimise to list of values to cosnidern_estimators (
int
) – number of trees to use in each forestverbose – Print extra information and show a live plot of model performance
- Returns:
dictionary mapping parameters to their optimised values rf: best performing Random Forest
- Return type:
params
- lumin.optimisation.hyper_param.lr_find(fy, model_builder, bs, n_epochs=1, train_on_weights=True, n_repeats=-1, lr_bounds=[1e-05, 10], cb_partials=None, plot_settings=<lumin.plotting.plot_settings.PlotSettings object>, bulk_move=True, plot_savename=None, show_plot=True)[source]¶
Wrapper function for training using
LRFinder
which runs a Smith LR range test (https://arxiv.org/abs/1803.09820) using folds inFoldYielder
. Trains models for a set number of repeats, interpolating LR between set bounds. This repeats for each fold inFoldYielder
, and loss evolution is averaged.- Parameters:
fy (
FoldYielder
) –FoldYielder
providing training datamodel_builder (
ModelBuilder
) –ModelBuilder
providing networks and optimisersbs (
int
) – batch sizen_epochs (
int
) – number of epochs to train per foldtrain_on_weights (
bool
) – If weights are present, whether to use them for trainingshuffle_fold – whether to shuffle data in folds
n_folds – if >= 1, will only train n_folds number of models, otherwise will train one model per fold
lr_bounds (
Tuple
[float
,float
]) – starting and ending LR valuescb_partials (
Optional
[List
[partial
]]) – optional list of functools.partial, each of which will a instantiateCallback
when calledplot_settings (
PlotSettings
) –PlotSettings
class to control figure appearancesavename – Optional name of file to which to save the plot
show_plot (
bool
) – whether to show the plot, or just save them
- Return type:
List
[LRFinder
]- Returns:
List of
LRFinder
which were used for each model trained
lumin.optimisation.threshold module¶
- lumin.optimisation.threshold.binary_class_cut_by_ams(df, top_perc=5.0, min_pred=0.9, wgt_factor=1.0, br=0.0, syst_unc_b=0.0, pred_name='pred', targ_name='gen_target', wgt_name='gen_weight', plot_settings=<lumin.plotting.plot_settings.PlotSettings object>)[source]¶
Optimise a cut on a signal-background classifier prediction by the Approximate Median Significance Cut which should generalise better by taking the mean class prediction of the top top_perc percentage of points as ranked by AMS
- Parameters:
df (
DataFrame
) – Pandas DataFrame containing datatop_perc (
float
) – top percentage of events to consider as ranked by AMSmin_pred (
float
) – minimum prediction to considerwgt_factor (
float
) – single multiplicative coeficient for rescaling signal and background weights before computing AMSbr (
float
) – background offset biassyst_unc_b (
float
) – fractional systemtatic uncertainty on backgroundpred_name (
str
) – column to use as predictionstarg_name (
str
) – column to use as truth labels for signal and backgroundwgt_name (
str
) – column to use as weights for signal and background eventsplot_settings (
PlotSettings
) –PlotSettings
class to control figure appearance
- Return type:
Tuple
[float
,float
,float
]- Returns:
Optimised cut AMS at cut Maximum AMS