Shortcuts

lumin.nn.data package

Submodules

lumin.nn.data.batch_yielder module

class lumin.nn.data.batch_yielder.BatchYielder(inputs, bs, objective, targets=None, weights=None, shuffle=True, use_weights=True, bulk_move=True, input_mask=None, drop_last=True)[source]

Bases: object

Yields minibatches to model during training. Iteration provides one minibatch as tuple of tensors of inputs, targets, and weights.

TODO: Improve this/change to dataloader

Parameters:
  • inputs (Union[ndarray, Tuple[ndarray, ndarray]]) – input array for (sub-)epoch

  • targets (Optional[ndarray]) – target array for (sub-)epoch

  • bs (int) – batchsize, number of data to include per minibatch

  • objective (str) – ‘classification’, ‘multiclass classification’, or ‘regression’. Used for casting target dtype.

  • weights (Optional[ndarray]) – Optional weight array for (sub-)epoch

  • shuffle (bool) – whether to shuffle the data at the beginning of an iteration

  • use_weights (bool) – if passed weights, whether to actually pass them to the model

  • bulk_move (bool) – whether to move all data to device at once. Default is true (saves time), but if device has low memory you can set to False.

  • input_mask (Optional[ndarray]) – optionally only use Boolean-masked inputs

  • drop_last (bool) – whether to drop the last batch if it does not contain bs elements

get_inputs(on_device=False)[source]

Returns all data.

Parameters:

on_device (bool) – whether to place tensor on device

Return type:

Union[Tensor, Tuple[Tensor, Tensor]]

Returns:

tuple of inputs, targets, and weights as tensors on device

class lumin.nn.data.batch_yielder.TorchGeometricBatchYielder(inputs, bs, shuffle=True, exclude_keys=None, use_weights=True, **kwargs)[source]

Bases: BatchYielder

BatchYielder for PyTorch Geometric data. kwargs for compatibility only.

Parameters:
  • inputs (Dataset) – PyTorch Geometric Dataset containing inputs, weights, and targets

  • bs (int) – batchsize, number of data to include per minibatch

  • shuffle (bool) – whether to shuffle the data at the beginning of an iteration

  • exclude_keys (Optional[List[str]]) – data keys to exclude from inputs

get_inputs(on_device=False)[source]

Returns all data.

Parameters:

on_device (bool) – whether to place tensor on device

Return type:

Union[Tensor, Tuple[Tensor, Tensor]]

Returns:

tuple of inputs, targets, and weights as dictionaries of tensors on device

lumin.nn.data.fold_yielder module

class lumin.nn.data.fold_yielder.FoldYielder(foldfile, cont_feats=None, cat_feats=None, ignore_feats=None, input_pipe=None, output_pipe=None, yield_matrix=True, matrix_pipe=None, batch_yielder_type=<class 'lumin.nn.data.batch_yielder.BatchYielder'>)[source]

Bases: object

Interface class for accessing data from foldfiles created by df2foldfile()

Parameters:
  • foldfile (Union[str, Path, File]) – filename of hdf5 file or opened hdf5 file

  • cont_feats (Optional[List[str]]) – list of names of continuous features present in input data, not required if foldfile contains meta data already

  • cat_feats (Optional[List[str]]) – list of names of categorical features present in input data, not required if foldfile contains meta data already

  • ignore_feats (Optional[List[str]]) – optional list of input features which should be ignored

  • input_pipe (Union[str, Pipeline, Path, None]) – optional Pipeline, or filename for pickled Pipeline, which was used for processing the inputs

  • output_pipe (Union[str, Pipeline, Path, None]) – optional Pipeline, or filename for pickled Pipeline, which was used for processing the targets

  • yield_matrix (bool) – whether to actually yield matrix data if present

  • matrix_pipe (Union[str, Pipeline, Path, None]) – preprocessing pipe for matrix data

  • batch_yielder_type (Type[BatchYielder]) – Class of BatchYielder to instantiate to yield inputs

Examples::
>>> fy = FoldYielder('train.h5')
>>>
>>> fy = FoldYielder('train.h5', ignore_feats=['phi'], input_pipe='input_pipe.pkl')
>>>
>>> fy = FoldYielder('train.h5', input_pipe=input_pipe, matrix_pipe=matrix_pipe)
>>>
>>> fy = FoldYielder('train.h5', input_pipe=input_pipe, yield_matrix=False)
add_ignore(feats)[source]

Add features to ignored features.

Parameters:

feats (Union[str, List[str]]) – list of feature names to ignore

Return type:

None

add_input_pipe(input_pipe)[source]

Adds an input pipe to the FoldYielder for use when deprocessing data

Parameters:

input_pipe (Union[str, Pipeline]) – Pipeline which was used for preprocessing the input data or name of pkl file containing Pipeline

Return type:

None

add_input_pipe_from_file(name)[source]

Adds an input pipe from a pkl file to the FoldYielder for use when deprocessing data

Parameters:

name (Union[str, Path]) – name of pkl file containing Pipeline which was used for preprocessing the input data

Return type:

None

add_matrix_pipe(matrix_pipe)[source]

Adds an matrix pipe to the FoldYielder for use when deprocessing data

Warning

Deprocessing matrix data is not yet implemented

Parameters:

matrix_pipe (Union[str, Pipeline]) – Pipeline which was used for preprocessing the input data or name of pkl file containing Pipeline

Return type:

None

add_matrix_pipe_from_file(name)[source]

Adds an matrix pipe from a pkl file to the FoldYielder for use when deprocessing data

Parameters:

name (str) – name of pkl file containing Pipeline which was used for preprocessing the matrix data

Return type:

None

add_output_pipe(output_pipe)[source]

Adds an output pipe to the FoldYielder for use when deprocessing data

Parameters:

output_pipe (Union[str, Pipeline]) – Pipeline which was used for preprocessing the target data or name of pkl file containing Pipeline

Return type:

None

add_output_pipe_from_file(name)[source]

Adds an output pipe from a pkl file to the FoldYielder for use when deprocessing data

Parameters:

name (Union[str, Path]) – name of pkl file containing Pipeline which was used for preprocessing the target data

Return type:

None

close()[source]

Closes the foldfile

Return type:

None

columns()[source]

Returns list of columns present in foldfile

Return type:

List[str]

Returns:

list of columns present in foldfile

get_column(column, n_folds=None, fold_idx=None, add_newaxis=False)[source]

Load column (h5py group) from foldfile. Used for getting arbitrary data which isn’t automatically grabbed by other methods.

Parameters:
  • column (str) – name of h5py group to get

  • n_folds (Optional[int]) – number of folds to get data from. Default all folds. Not compatable with fold_idx

  • fold_idx (Optional[int]) – Only load group from a single, specified fold. Not compatable with n_folds

  • add_newaxis (bool) – whether expand shape of returned data if data shape is ()

Return type:

Optional[ndarray]

Returns:

Numpy array of column data

get_data(n_folds=None, fold_idx=None)[source]

Get data for single, specified fold or several of folds. Data consists of dictionary of inputs, targets, and weights. Does not account for ignored features. Inputs are passed through np.nan_to_num to deal with nans and infs.

Parameters:
  • n_folds (Optional[int]) – number of folds to get data from. Default all folds. Not compatible with fold_idx

  • fold_idx (Optional[int]) – Only load group from a single, specified fold. Not compatible with n_folds

Return type:

Dict[str, ndarray]

Returns:

tuple of inputs, targets, and weights as Numpy arrays

get_data_count(idxs=None)[source]

Returns total number of data entries in requested folds

Parameters:

idxs (Union[int, List[int], None]) – list of indices to check

Return type:

int

Returns:

Total number of entries in the folds

get_df(pred_name='pred', targ_name='targets', wgt_name='weights', n_folds=None, fold_idx=None, inc_inputs=False, inc_ignore=False, deprocess=False, verbose=True, suppress_warn=False, nan_to_num=False, inc_matrix=False)[source]

Get a Pandas DataFrame of the data in the foldfile. Will add columns for inputs (if requested), targets, weights, and predictions (if present)

Parameters:
  • pred_name (str) – name of prediction group

  • targ_name (str) – name of target group

  • wgt_name (str) – name of weight group

  • n_folds (Optional[int]) – number of folds to get data from. Default all folds. Not compatible with fold_idx

  • fold_idx (Optional[int]) – Only load group from a single, specified fold. Not compatible with n_folds

  • inc_inputs (bool) – whether to include input data

  • inc_ignore (bool) – whether to include ignored features

  • deprocess (bool) – whether to deprocess inputs and targets if pipelines have been

  • verbose (bool) – whether to print the number of datapoints loaded

  • suppress_warn (bool) – whether to suppress the warning about missing columns

  • nan_to_num (bool) – whether to pass input data through np.nan_to_num

  • inc_matrix (bool) – whether to include flattened matrix data in output, if present

Return type:

DataFrame

Returns:

Pandas DataFrame with requested data

get_fold(idx)[source]

Get data for single fold. Data consists of dictionary of inputs, targets, and weights. Accounts for ignored features. Inputs, except for matrix data, are passed through np.nan_to_num to deal with nans and infs.

Parameters:

idx (int) – fold index to load

Return type:

Dict[str, ndarray]

Returns:

tuple of inputs, targets, and weights as Numpy arrays

get_ignore()[source]

Returns list of ignored features

Return type:

List[str]

Returns:

Features removed from training data

get_use_cat_feats()[source]

Returns list of categorical features which will be present in training data, accounting for ignored features.

Return type:

List[str]

Returns:

List of categorical features

get_use_cont_feats()[source]

Returns list of continuous features which will be present in training data, accounting for ignored features.

Return type:

List[str]

Returns:

List of continuous features

save_fold_pred(pred, fold_idx, pred_name='pred')[source]

Save predictions for given fold as a new column in the foldfile

Parameters:
  • pred (ndarray) – array of predictions in the same order as data appears in the file

  • fold_idx (int) – index for fold

  • pred_name (str) – name of column to save predictions under

Return type:

None

class lumin.nn.data.fold_yielder.HEPAugFoldYielder(foldfile, cont_feats=None, cat_feats=None, ignore_feats=None, aug_targ_feats=None, rot_mult=2, random_rot=False, reflect_x=False, reflect_y=True, reflect_z=True, train_time_aug=True, test_time_aug=True, input_pipe=None, output_pipe=None, yield_matrix=True, matrix_pipe=None)[source]

Bases: FoldYielder

Specialised version of FoldYielder providing HEP specific data augmetation at train and test time.

Parameters:
  • foldfile (Union[str, Path, File]) – filename of hdf5 file or opened hdf5 file

  • cont_feats (Optional[List[str]]) – list of names of continuous features present in input data, not required if foldfile contains meta data already

  • cat_feats (Optional[List[str]]) – list of names of categorical features present in input data, not required if foldfile contains meta data already

  • ignore_feats (Optional[List[str]]) – optional list of input features which should be ignored

  • aug_targ_feats (Optional[List[str]]) – optional list of target vectors to also be transformed, leave as None for no augmentation of targets vectirs

  • rot_mult (int) – number of rotations of event in phi to make at test-time (currently must be even). Greater than zero will also apply random rotations during train-time

  • random_rot (bool) – whether test-time rotation angles should be random or in steps of 2pi/rot_mult

  • reflect_x (bool) – whether to reflect events in x axis at train and test time

  • reflect_y (bool) – whether to reflect events in y axis at train and test time

  • reflect_z (bool) – whether to reflect events in z axis at train and test time

  • train_time_aug (bool) – whether to apply augmentations at train time

  • test_time_aug (bool) – whether to apply augmentations at test time

  • input_pipe (Optional[Pipeline]) – optional Pipeline, or filename for pickled Pipeline, which was used for processing the inputs

  • output_pipe (Optional[Pipeline]) – optional Pipeline, or filename for pickled Pipeline, which was used for processing the targets

  • yield_matrix (bool) – whether to actually yield matrix data if present

  • matrix_pipe (Union[str, Pipeline, None]) – preprocessing pipe for matrix data

Examples::
>>> fy = HEPAugFoldYielder('train.h5',
...                        cont_feats=['pT','eta','phi','mass'],
...                        rot_mult=2, reflect_y=True, reflect_z=True,
...                        input_pipe='input_pipe.pkl')
get_fold(idx)[source]

Get data for single fold applying random train-time data augmentation. Data consists of dictionary of inputs, targets, and weights. Accounts for ignored features. Inputs, except for matrix data, are passed through np.nan_to_num to deal with nans and infs.

Parameters:

idx (int) – fold index to load

Return type:

Dict[str, ndarray]

Returns:

tuple of inputs, targets, and weights as Numpy arrays

get_test_fold(idx, aug_idx)[source]

Get test data for single fold applying test-time data augmentaion. Data consists of dictionary of inputs, targets, and weights. Accounts for ignored features. Inputs, except for matrix data, are passed through np.nan_to_num to deal with nans and infs.

Parameters:
  • idx (int) – fold index to load

  • aug_idx (int) – index for the test-time augmentaion (ignored if random test-time augmentation requested)

Return type:

Dict[str, ndarray]

Returns:

tuple of inputs, targets, and weights as Numpy arrays

class lumin.nn.data.fold_yielder.TorchGeometricFoldYielder(dataset, n_folds, fold_indices=None, shuffle=True, seed=None, batch_yielder_type=<class 'lumin.nn.data.batch_yielder.TorchGeometricBatchYielder'>)[source]

Bases: FoldYielder

Interface class for accessing data from PyTorch Geometric datasets. Dataset will be split into sub-folds; either provide a value for the fold_indices argument with your own split as a list of lists of indices, or specify the number of folds for a random split (n_folds)

..warning::

Much functionality has yet to be implemented for this class

Parameters:
  • dataset (Dataset) – PyTorch Geometric Dataset containing inputs, weights, and targets

  • n_folds (Optional[int]) – number of folds in which to randomly split the dataset. Must provide either this or fold_indices

  • fold_indices (Optional[List[List[int]]]) – list of lists of indices; each list of indices is a fold. Must provide either this or n_folds

  • shuffle (bool) – if no fold_indeces are provided, data will be split into the speified number of folds. This controls whether the indeces will be shuffled beforehand or not.

  • seed (Optional[int]) – if no fold_indeces are provided, data will be split into the speified number of folds. This sets the random seed used for shuffling, if requested.

  • batch_yielder_type (Type[BatchYielder]) – Class of BatchYielder to instantiate to yield inputs

add_ignore(feats)[source]

Add features to ignored features.

Parameters:

feats (Union[str, List[str]]) – list of feature names to ignore

Return type:

None

close()[source]

Closes the foldfile

Return type:

None

columns()[source]

Returns list of columns present in foldfile

Return type:

List[str]

Returns:

list of columns present in foldfile

get_column(column, n_folds=None, fold_idx=None, add_newaxis=False)[source]

Load column (h5py group) from foldfile. Used for getting arbitrary data which isn’t automatically grabbed by other methods.

Parameters:
  • column (str) – name of h5py group to get

  • n_folds (Optional[int]) – number of folds to get data from. Default all folds. Not compatable with fold_idx

  • fold_idx (Optional[int]) – Only load group from a single, specified fold. Not compatable with n_folds

  • add_newaxis (bool) – whether expand shape of returned data if data shape is ()

Return type:

Optional[ndarray]

Returns:

Numpy array of column data

get_data(n_folds=None, fold_idx=None)[source]

Get data for single, specified fold or several of folds. Data consists of dictionary of inputs, targets, and weights. Does not account for ignored features. Inputs are passed through np.nan_to_num to deal with nans and infs.

Parameters:
  • n_folds (Optional[int]) – number of folds to get data from. Default all folds. Not compatible with fold_idx

  • fold_idx (Optional[int]) – Only load group from a single, specified fold. Not compatible with n_folds

Return type:

Dict[str, ndarray]

Returns:

tuple of inputs, targets, and weights as Numpy arrays

get_df(pred_name='pred', targ_name='targets', wgt_name='weights', n_folds=None, fold_idx=None, inc_inputs=False, inc_ignore=False, deprocess=False, verbose=True, suppress_warn=False, nan_to_num=False, inc_matrix=False)[source]

Get a Pandas DataFrame of the data in the foldfile. Will add columns for inputs (if requested), targets, weights, and predictions (if present)

Parameters:
  • pred_name (str) – name of prediction group

  • targ_name (str) – name of target group

  • wgt_name (str) – name of weight group

  • n_folds (Optional[int]) – number of folds to get data from. Default all folds. Not compatible with fold_idx

  • fold_idx (Optional[int]) – Only load group from a single, specified fold. Not compatible with n_folds

  • inc_inputs (bool) – whether to include input data

  • inc_ignore (bool) – whether to include ignored features

  • deprocess (bool) – whether to deprocess inputs and targets if pipelines have been

  • verbose (bool) – whether to print the number of datapoints loaded

  • suppress_warn (bool) – whether to suppress the warning about missing columns

  • nan_to_num (bool) – whether to pass input data through np.nan_to_num

  • inc_matrix (bool) – whether to include flattened matrix data in output, if present

Return type:

DataFrame

Returns:

Pandas DataFrame with requested data

get_fold(idx)[source]

Get data for single fold. Data consists of a slice of a PyTorch Geometric Dataset.

Parameters:

idx (int) – fold index to load

Return type:

Dict[str, ndarray]

Returns:

PyTorch Geometric Dataset slice

save_fold_pred(pred, fold_idx, pred_name='pred')[source]

Save predictions for given fold as a new column in the foldfile

Parameters:
  • pred (ndarray) – array of predictions in the same order as data appears in the file

  • fold_idx (int) – index for fold

  • pred_name (str) – name of column to save predictions under

Return type:

None

Module contents

Docs

Access comprehensive developer and user documentation for LUMIN

View Docs

Tutorials

Get tutorials for beginner and advanced researchers demonstrating many of the features of LUMIN

View Tutorials