Core concepts¶

The fold file¶

The fold file is the core data-structure used throughout LUMIN. It is stored on disc as an HDF5 file. In the top level are several groups. The meta_data group stores various datasets containing information about the data, such as the names of features. The other top-level groups are the folds. These store subsamples of the full dataset and are designed to be read into memory individually, and provide several advantages, such as:

Memory requirements are reduced
Specific fold indices can be designated for training and others for validation, e.g. for k-fold cross-validation
Some methods can compute averaged metrics over folds and produce uncertainties based on standard deviation

Each fold group contains several datasets:

targets will be used to provide target data for training NNs
inputs contains the input data in the following
weights, if present, will be used to weight losses during training
matrix_inputs can be used to store 2D matrix, or higher-order (sparse) tensor data

Additional datasets can be added, too, e.g. extra features that are necessary for interpreting results. Named predictions can also be saved to the fold file e.g. during Model.predict. Datasets can also be compressed to reduce size and loading time.

Creating fold files¶

lumin.data_processing.file_proc contains the recommended methods for constructing fold files from pandas.DataFrame objects with the main construction method being df2foldfile, although the methods it calls can be used directly for complex or large data.

Reading fold files¶

The main interface class is the FoldYielder. Its primary function is to load data from the fold files, however it can also act as hub for meta information and objects concerning the dataset, such as feature names and processing pipelines. Specific features can be marked as ‘ignore’ and will be filtered out when loading folds.

Calling fy.get_fold(i) or indexing an instance fy[i] will return a dictionary of inputs, targets, and weights for fold i via the get_data method. Flat inputs will be passed through np.nan_to_num. If matrix or tensor inputs are present then they will be processed into a tuple with the flat data ([flat inputs, dense tensor]).

fy.get_df can be used to construct a pandas.DataFrame from the data (either specific folds, or all folds together). The method has various arguments for controlling what columns should be included. By default only targets, weights, and predictions are included. Additional datasets can also be loaded via the get_column method.

Since during training and inference folds are loaded into memory one at a time, used once, and overwritten LUMIN can optionally apply data augmentation when loading folds. The inheriting class HEPAugFoldYielder provides an example of this, where particle collision events can be rotated and flipped.

Models¶

Model building¶

Contrary to other high-level interfaces, in LUMIN the user is expected to define how models, optimisers, and loss functions should be built, rather than build them themselves. The ModelBuilder class helps to capture these definitions, and once instantiated, can be used produce models on demand.

LUMIN models consist of three types of blocks:

The Head, which takes all inputs from the data and processes them if necessary.
- The default head is CatEmbHead, which passes continuous inputs through an optional dropout layer, and categorical inputs through embedding matrices (see Guo & Berkhahn, 2016) and an optional dropout layer.
- Matrix or tensor data can also be passed through appropriate head blocks, e.g. RNNs, CNNs, and GNNs.
- Data containing both matrix/tensor data and flat data (continuous+categorical) can be passed through a MultiHead block, which in turn sends data through the appropriate head and concatenates the outputs.
- The output of the head is a flat vector (batch, head width)
The Body is where the majority of the computation occurs (at least in the case of a flat FCNN). The default body is FullyConnected, consisting of multiple hidden layers.
- MultiBlock can be used to split features across separate body blocks, e.g. for wide-deep networks.
- The output of the body is also a flat vector (batch, body width)
The Tail is designed to alter the body width to match the target width, as well as apply any output activation function and output rescaling.
- The default tail is ClassRegMulti, which can handle single- & multi-target regression, and binary, multi-label, and multi-class classification (it configures itself using the objective attribute of the ModelBuilder).

The ModelBuilder has arguments to change the blocks from their default values. Custom blocks should be passed as classes, rather than instantiated objects (i.e. use partial to configure their arguments). There are some arguments for blocks which will be set automatically by the ModelBuilder: For heads, these are cont_feats, cat_embedder, lookup_init, and freeze; for bodies n_in, feat_map, lookup_init, lookup_act, and freeze; and for tails n_in, n_out, objective, lookup_init, and freeze. model_args can also be used to set arguments via a dictionary, e.g. {‘head’:{‘depth’:3}}.

The ModelBuilder also returns an optimiser set to update the parameters of the model. This can be configured via opt_args and custom optimisers can be passed as classes to 'opt', e.g. opt_args={'opt':AdamW, 'lr':3e-2}. The loss function is controlled by the loss argument, and can either be left as auto and set via the objective, or explicitly set to a class by the user. Use of pretrained models can be achieved by setting the pretrain_file argument to a previously trained model, which will then be loaded when a new model is built.

Model wrapper¶

The models built by the ModelBuilder are torch.nn.Module objects, and so to provide high-level functionality, LUMIN wraps these objects with a Model class. This provides a range of methods for e.g. training, saving, loading, and predicting with DNNs. The torch.nn.Module is set as the Model.model attribute.

A similar high-level wrapper class exists for ensembles (Ensemble), in which the methods extend over a range of Model objects.

Model training¶

Model.fit will train Model.model using data provided via a FoldYielder. A specific fold index can be set to be used as validation data, and the rest will be used as training data (or the user can specify explicitly which fold indices to use for training). Callbacks can be used to augment the training, as described later on. Training is ‘stateful’, with a Model.fit_params object having various attributes such as the data, current state (training or validation), and callbacks. Since each callback has the model as an attribute, they can access all aspects of the training via the fit_params.

Training proceeds thusly:

For epoch in epochs:
1. Training epoch begins
  1. Training-fold indices are shuffled
  2. For fold in training folds (referred to as a sub-epoch):
    1. Load fold data into a BatchYielder, a class that yields batches of input, target, and weight data
    2. For batch in batches:
      1. Pass inputs x through network to get predictions y_pred
      2. Compute loss based on y_pred and targets y
      3. Back-propagate loss through network
      4. Update network parameters using optimiser
2. Validation epoch begins
  1. Load validation-fold data into a BatchYielder
    1. For batch in batches:
      1. Pass inputs x through network to get predictions y_pred
      2. Compute loss based on y_pred and targets y

Training method¶

Whilst Model.fit can be used by the user, there is still a lot of boilerplate code that must be written to support convenient training and monitoring of models, plus one of the distinguishing characteristics of LUMIN is that training many models should be as easy as training one model. To this end, the recommended training function is train_models. This function will train a specified number of models and save them to a specified directory. It doesn’t return the trained models, but rather a dictionary of results containing information about the training, and the paths to the models. This can then be used to instantiate an Ensemble via the from_results class-method.

Callbacks¶

Just like in Keras and FastAI, callbacks are a powerful and flexible way to augment the general training loop outlined above, by offering series of fine-grained interjection points:

on_train_begin: after all preparations are made and the first epoch is about to begin; allows callbacks to initialise and prepare for the training
on_epoch_begin: when a new training or validation epoch is about to begin
on_fold_begin: when a new training or validation fold is about to begin and after the batch yielder has been instantiated; allows callbacks to modify the entirety of the data for the fold via fit_params.by
on_batch_begin: when a new batch or data is about to be processed and inputs, targets, and weights have been set to fit_params.x, fit_params.y, and fit_params.w; allows callbacks to modify the batch before it is passed through the network
on_forwards_end: after the inputs have been passed through the network and the predictions fit_params.y_pred and the loss value fit_params.loss_val computed; allows callbacks to modify the loss before it is back-propagated (e.g. adversarial training), or to compute a new loss value and set fit_params.loss_val manually
on_backwards_begin: after the optimiser gradients have been zeroed and before the loss value has been back-propagated
on_backwards_end: after the loss value has been back-propagated but before the optimiser update has been made; allows callbacks to modify the parameter gradients
on_batch_end: after the batch has been processed, the loss computed, and any parameter updates made
on_fold_end: after a training or validation fold has finished
on_epoch_end: after a training or validation epoch has finished
on_train_end: after the training has finished; allows callbacks to clean up and compute final results

In addition to callbacks during training, LUMIN offers callbacks at prediction, which can interject at:

on_pred_begin: After all preparations are made and the prediction data has been loaded into a BatchYielder
on_batch_begin
on_forwards_end
on_batch_end
on_pred_end: After predictions have been made for al the data

Callbacks passed to the Model prediction methods come in two varieties: normal callbacks can be passed to cbs; and a special prediction callback can be passed to pred_cb. The prediction callback is responsible for storing and processing model predictions, and then returning the via a get_preds method. The default prediction callback simply returns predictions in the same order they were generated, however users may wish to e.g. rescale or bin predictions for convenience. An example use for other callbacks during prediction would be e.g. for inference of parameterised training model ParameterisedPrediction, Baldi et al., 2016.

Callbacks in LUMIN¶

A range of common, or useful, callbacks are provided in LUMIN:

Optimiser and Cyclic callbacks are designed to modify optimiser hyper-parameters during training, e.g. OneCycle Smith, 2018. Classes inheriting from AbsCyclicCallback can signal to other callbacks to only act when a cycle has finished (e.g. stop training after no improvement).
Data callbacks modify aspects of the data, e.g. for label smoothing, resampling, and replacing/removing values and data.
Loss callbacks adjust the loss values and gradients, or even manually compute losses themselves.
Model callbacks are a special type of callback that trains alternative models and can be polled for loss values, have their performance tracked, and have their models saved instead of the main model, e.g. SWA Izmailov et al., 2018.
Monitor callbacks keep track of performance during the training, and provide a realtime report of metrics. Additionally, they can be used to save models when performance improves and stop training after improvements cease.
Prediction handler callbacks are responsible for storing and adjusting the network outputs when predicting on new data.