lumin.nn.data package¶

Submodules¶

lumin.nn.data.batch_yielder module¶

class lumin.nn.data.batch_yielder.BatchYielder(inputs, bs, objective, targets=None, weights=None, shuffle=True, use_weights=True, bulk_move=True, input_mask=None, drop_last=True)[source]¶

Bases: object

Yields minibatches to model during training. Iteration provides one minibatch as tuple of tensors of inputs, targets, and weights.

TODO: Improve this/change to dataloader

Parameters:

inputs (Union[ndarray, Tuple[ndarray, ndarray]]) – input array for (sub-)epoch
targets (Optional[ndarray]) – target array for (sub-)epoch
bs (int) – batchsize, number of data to include per minibatch
objective (str) – ‘classification’, ‘multiclass classification’, or ‘regression’. Used for casting target dtype.
weights (Optional[ndarray]) – Optional weight array for (sub-)epoch
shuffle (bool) – whether to shuffle the data at the beginning of an iteration
use_weights (bool) – if passed weights, whether to actually pass them to the model
bulk_move (bool) – whether to move all data to device at once. Default is true (saves time), but if device has low memory you can set to False.
input_mask (Optional[ndarray]) – optionally only use Boolean-masked inputs
drop_last (bool) – whether to drop the last batch if it does not contain bs elements

get_inputs(on_device=False)[source]¶

Returns all data.

Parameters:: on_device (bool) – whether to place tensor on device
Return type:: Union[Tensor, Tuple[Tensor, Tensor]]
Returns:: tuple of inputs, targets, and weights as tensors on device

class lumin.nn.data.batch_yielder.TorchGeometricBatchYielder(inputs, bs, shuffle=True, exclude_keys=None, use_weights=True, **kwargs)[source]¶

Bases: BatchYielder

BatchYielder for PyTorch Geometric data. kwargs for compatibility only.

Parameters:

inputs (Dataset) – PyTorch Geometric Dataset containing inputs, weights, and targets
bs (int) – batchsize, number of data to include per minibatch
shuffle (bool) – whether to shuffle the data at the beginning of an iteration
exclude_keys (Optional[List[str]]) – data keys to exclude from inputs

get_inputs(on_device=False)[source]¶

Returns all data.

Parameters:: on_device (bool) – whether to place tensor on device
Return type:: Union[Tensor, Tuple[Tensor, Tensor]]
Returns:: tuple of inputs, targets, and weights as dictionaries of tensors on device

lumin.nn.data.fold_yielder module¶

class lumin.nn.data.fold_yielder.FoldYielder(foldfile, cont_feats=None, cat_feats=None, ignore_feats=None, input_pipe=None, output_pipe=None, yield_matrix=True, matrix_pipe=None, batch_yielder_type=<class 'lumin.nn.data.batch_yielder.BatchYielder'>)[source]¶

Bases: object

Interface class for accessing data from foldfiles created by df2foldfile()

Parameters:

foldfile (Union[str, Path, File]) – filename of hdf5 file or opened hdf5 file
cont_feats (Optional[List[str]]) – list of names of continuous features present in input data, not required if foldfile contains meta data already
cat_feats (Optional[List[str]]) – list of names of categorical features present in input data, not required if foldfile contains meta data already
ignore_feats (Optional[List[str]]) – optional list of input features which should be ignored
input_pipe (Union[str, Pipeline, Path, None]) – optional Pipeline, or filename for pickled Pipeline, which was used for processing the inputs
output_pipe (Union[str, Pipeline, Path, None]) – optional Pipeline, or filename for pickled Pipeline, which was used for processing the targets
yield_matrix (bool) – whether to actually yield matrix data if present
matrix_pipe (Union[str, Pipeline, Path, None]) – preprocessing pipe for matrix data
batch_yielder_type (Type[BatchYielder]) – Class of BatchYielder to instantiate to yield inputs

Examples::

>>> fy = FoldYielder('train.h5')
>>>
>>> fy = FoldYielder('train.h5', ignore_feats=['phi'], input_pipe='input_pipe.pkl')
>>>
>>> fy = FoldYielder('train.h5', input_pipe=input_pipe, matrix_pipe=matrix_pipe)
>>>
>>> fy = FoldYielder('train.h5', input_pipe=input_pipe, yield_matrix=False)

add_ignore(feats)[source]¶

Add features to ignored features.

Parameters:: feats (Union[str, List[str]]) – list of feature names to ignore
Return type:: None

add_input_pipe(input_pipe)[source]¶

Adds an input pipe to the FoldYielder for use when deprocessing data

Parameters:: input_pipe (Union[str, Pipeline]) – Pipeline which was used for preprocessing the input data or name of pkl file containing Pipeline
Return type:: None

add_input_pipe_from_file(name)[source]¶

Adds an input pipe from a pkl file to the FoldYielder for use when deprocessing data

Parameters:: name (Union[str, Path]) – name of pkl file containing Pipeline which was used for preprocessing the input data
Return type:: None

add_matrix_pipe(matrix_pipe)[source]¶

Adds an matrix pipe to the FoldYielder for use when deprocessing data

Warning

Deprocessing matrix data is not yet implemented

Parameters:: matrix_pipe (Union[str, Pipeline]) – Pipeline which was used for preprocessing the input data or name of pkl file containing Pipeline
Return type:: None

add_matrix_pipe_from_file(name)[source]¶

Adds an matrix pipe from a pkl file to the FoldYielder for use when deprocessing data

Parameters:: name (str) – name of pkl file containing Pipeline which was used for preprocessing the matrix data
Return type:: None

add_output_pipe(output_pipe)[source]¶

Adds an output pipe to the FoldYielder for use when deprocessing data

Parameters:: output_pipe (Union[str, Pipeline]) – Pipeline which was used for preprocessing the target data or name of pkl file containing Pipeline
Return type:: None

add_output_pipe_from_file(name)[source]¶

Adds an output pipe from a pkl file to the FoldYielder for use when deprocessing data

Parameters:: name (Union[str, Path]) – name of pkl file containing Pipeline which was used for preprocessing the target data
Return type:: None

close()[source]¶

Closes the foldfile

Return type:: None

columns()[source]¶

Returns list of columns present in foldfile

Return type:: List[str]
Returns:: list of columns present in foldfile

get_column(column, n_folds=None, fold_idx=None, add_newaxis=False)[source]¶

Load column (h5py group) from foldfile. Used for getting arbitrary data which isn’t automatically grabbed by other methods.

Parameters:

column (str) – name of h5py group to get
n_folds (Optional[int]) – number of folds to get data from. Default all folds. Not compatable with fold_idx
fold_idx (Optional[int]) – Only load group from a single, specified fold. Not compatable with n_folds
add_newaxis (bool) – whether expand shape of returned data if data shape is ()

Return type:

Optional[ndarray]

Returns:

Numpy array of column data

get_data(n_folds=None, fold_idx=None)[source]¶

Get data for single, specified fold or several of folds. Data consists of dictionary of inputs, targets, and weights. Does not account for ignored features. Inputs are passed through np.nan_to_num to deal with nans and infs.

Parameters:

n_folds (Optional[int]) – number of folds to get data from. Default all folds. Not compatible with fold_idx
fold_idx (Optional[int]) – Only load group from a single, specified fold. Not compatible with n_folds

Return type:

Dict[str, ndarray]

Returns:

tuple of inputs, targets, and weights as Numpy arrays

get_data_count(idxs=None)[source]¶

Returns total number of data entries in requested folds

Parameters:: idxs (Union[int, List[int], None]) – list of indices to check
Return type:: int
Returns:: Total number of entries in the folds

get_df(pred_name='pred', targ_name='targets', wgt_name='weights', n_folds=None, fold_idx=None, inc_inputs=False, inc_ignore=False, deprocess=False, verbose=True, suppress_warn=False, nan_to_num=False, inc_matrix=False)[source]¶

Get a Pandas DataFrame of the data in the foldfile. Will add columns for inputs (if requested), targets, weights, and predictions (if present)

Parameters:

pred_name (str) – name of prediction group
targ_name (str) – name of target group
wgt_name (str) – name of weight group
n_folds (Optional[int]) – number of folds to get data from. Default all folds. Not compatible with fold_idx
fold_idx (Optional[int]) – Only load group from a single, specified fold. Not compatible with n_folds
inc_inputs (bool) – whether to include input data
inc_ignore (bool) – whether to include ignored features
deprocess (bool) – whether to deprocess inputs and targets if pipelines have been
verbose (bool) – whether to print the number of datapoints loaded
suppress_warn (bool) – whether to suppress the warning about missing columns
nan_to_num (bool) – whether to pass input data through np.nan_to_num
inc_matrix (bool) – whether to include flattened matrix data in output, if present

Return type:

DataFrame

Returns:

Pandas DataFrame with requested data

get_fold(idx)[source]¶

Get data for single fold. Data consists of dictionary of inputs, targets, and weights. Accounts for ignored features. Inputs, except for matrix data, are passed through np.nan_to_num to deal with nans and infs.

Parameters:: idx (int) – fold index to load
Return type:: Dict[str, ndarray]
Returns:: tuple of inputs, targets, and weights as Numpy arrays

get_ignore()[source]¶

Returns list of ignored features

Return type:: List[str]
Returns:: Features removed from training data

get_use_cat_feats()[source]¶

Returns list of categorical features which will be present in training data, accounting for ignored features.

Return type:: List[str]
Returns:: List of categorical features

get_use_cont_feats()[source]¶

Returns list of continuous features which will be present in training data, accounting for ignored features.

Return type:: List[str]
Returns:: List of continuous features

save_fold_pred(pred, fold_idx, pred_name='pred')[source]¶

Save predictions for given fold as a new column in the foldfile

Parameters:

pred (ndarray) – array of predictions in the same order as data appears in the file
fold_idx (int) – index for fold
pred_name (str) – name of column to save predictions under

Return type:

None

class lumin.nn.data.fold_yielder.HEPAugFoldYielder(foldfile, cont_feats=None, cat_feats=None, ignore_feats=None, aug_targ_feats=None, rot_mult=2, random_rot=False, reflect_x=False, reflect_y=True, reflect_z=True, train_time_aug=True, test_time_aug=True, input_pipe=None, output_pipe=None, yield_matrix=True, matrix_pipe=None)[source]¶

Bases: FoldYielder

Specialised version of FoldYielder providing HEP specific data augmetation at train and test time.

Parameters:

foldfile (Union[str, Path, File]) – filename of hdf5 file or opened hdf5 file
cont_feats (Optional[List[str]]) – list of names of continuous features present in input data, not required if foldfile contains meta data already
cat_feats (Optional[List[str]]) – list of names of categorical features present in input data, not required if foldfile contains meta data already
ignore_feats (Optional[List[str]]) – optional list of input features which should be ignored
aug_targ_feats (Optional[List[str]]) – optional list of target vectors to also be transformed, leave as None for no augmentation of targets vectirs
rot_mult (int) – number of rotations of event in phi to make at test-time (currently must be even). Greater than zero will also apply random rotations during train-time
random_rot (bool) – whether test-time rotation angles should be random or in steps of 2pi/rot_mult
reflect_x (bool) – whether to reflect events in x axis at train and test time
reflect_y (bool) – whether to reflect events in y axis at train and test time
reflect_z (bool) – whether to reflect events in z axis at train and test time
train_time_aug (bool) – whether to apply augmentations at train time
test_time_aug (bool) – whether to apply augmentations at test time
input_pipe (Optional[Pipeline]) – optional Pipeline, or filename for pickled Pipeline, which was used for processing the inputs
output_pipe (Optional[Pipeline]) – optional Pipeline, or filename for pickled Pipeline, which was used for processing the targets
yield_matrix (bool) – whether to actually yield matrix data if present
matrix_pipe (Union[str, Pipeline, None]) – preprocessing pipe for matrix data

Examples::

>>> fy = HEPAugFoldYielder('train.h5',
...                        cont_feats=['pT','eta','phi','mass'],
...                        rot_mult=2, reflect_y=True, reflect_z=True,
...                        input_pipe='input_pipe.pkl')

get_fold(idx)[source]¶

Get data for single fold applying random train-time data augmentation. Data consists of dictionary of inputs, targets, and weights. Accounts for ignored features. Inputs, except for matrix data, are passed through np.nan_to_num to deal with nans and infs.

Parameters:: idx (int) – fold index to load
Return type:: Dict[str, ndarray]
Returns:: tuple of inputs, targets, and weights as Numpy arrays

get_test_fold(idx, aug_idx)[source]¶

Get test data for single fold applying test-time data augmentaion. Data consists of dictionary of inputs, targets, and weights. Accounts for ignored features. Inputs, except for matrix data, are passed through np.nan_to_num to deal with nans and infs.

Parameters:

idx (int) – fold index to load
aug_idx (int) – index for the test-time augmentaion (ignored if random test-time augmentation requested)

Return type:

Dict[str, ndarray]

Returns:

tuple of inputs, targets, and weights as Numpy arrays

class lumin.nn.data.fold_yielder.TorchGeometricFoldYielder(dataset, n_folds, fold_indices=None, shuffle=True, seed=None, batch_yielder_type=<class 'lumin.nn.data.batch_yielder.TorchGeometricBatchYielder'>)[source]¶

Bases: FoldYielder

Interface class for accessing data from PyTorch Geometric datasets. Dataset will be split into sub-folds; either provide a value for the fold_indices argument with your own split as a list of lists of indices, or specify the number of folds for a random split (n_folds)

..warning::: Much functionality has yet to be implemented for this class

Parameters:

dataset (Dataset) – PyTorch Geometric Dataset containing inputs, weights, and targets
n_folds (Optional[int]) – number of folds in which to randomly split the dataset. Must provide either this or fold_indices
fold_indices (Optional[List[List[int]]]) – list of lists of indices; each list of indices is a fold. Must provide either this or n_folds
shuffle (bool) – if no fold_indeces are provided, data will be split into the speified number of folds. This controls whether the indeces will be shuffled beforehand or not.
seed (Optional[int]) – if no fold_indeces are provided, data will be split into the speified number of folds. This sets the random seed used for shuffling, if requested.
batch_yielder_type (Type[BatchYielder]) – Class of BatchYielder to instantiate to yield inputs

add_ignore(feats)[source]¶

Add features to ignored features.

Parameters:: feats (Union[str, List[str]]) – list of feature names to ignore
Return type:: None

close()[source]¶

Closes the foldfile

Return type:: None

columns()[source]¶

Returns list of columns present in foldfile

Return type:: List[str]
Returns:: list of columns present in foldfile

get_column(column, n_folds=None, fold_idx=None, add_newaxis=False)[source]¶

Load column (h5py group) from foldfile. Used for getting arbitrary data which isn’t automatically grabbed by other methods.

Parameters:

column (str) – name of h5py group to get
n_folds (Optional[int]) – number of folds to get data from. Default all folds. Not compatable with fold_idx
fold_idx (Optional[int]) – Only load group from a single, specified fold. Not compatable with n_folds
add_newaxis (bool) – whether expand shape of returned data if data shape is ()

Return type:

Optional[ndarray]

Returns:

Numpy array of column data

get_data(n_folds=None, fold_idx=None)[source]¶

Get data for single, specified fold or several of folds. Data consists of dictionary of inputs, targets, and weights. Does not account for ignored features. Inputs are passed through np.nan_to_num to deal with nans and infs.

Parameters:

n_folds (Optional[int]) – number of folds to get data from. Default all folds. Not compatible with fold_idx
fold_idx (Optional[int]) – Only load group from a single, specified fold. Not compatible with n_folds

Return type:

Dict[str, ndarray]

Returns:

tuple of inputs, targets, and weights as Numpy arrays

get_df(pred_name='pred', targ_name='targets', wgt_name='weights', n_folds=None, fold_idx=None, inc_inputs=False, inc_ignore=False, deprocess=False, verbose=True, suppress_warn=False, nan_to_num=False, inc_matrix=False)[source]¶

Get a Pandas DataFrame of the data in the foldfile. Will add columns for inputs (if requested), targets, weights, and predictions (if present)

Parameters:

pred_name (str) – name of prediction group
targ_name (str) – name of target group
wgt_name (str) – name of weight group
n_folds (Optional[int]) – number of folds to get data from. Default all folds. Not compatible with fold_idx
fold_idx (Optional[int]) – Only load group from a single, specified fold. Not compatible with n_folds
inc_inputs (bool) – whether to include input data
inc_ignore (bool) – whether to include ignored features
deprocess (bool) – whether to deprocess inputs and targets if pipelines have been
verbose (bool) – whether to print the number of datapoints loaded
suppress_warn (bool) – whether to suppress the warning about missing columns
nan_to_num (bool) – whether to pass input data through np.nan_to_num
inc_matrix (bool) – whether to include flattened matrix data in output, if present

Return type:

DataFrame

Returns:

Pandas DataFrame with requested data

get_fold(idx)[source]¶

Get data for single fold. Data consists of a slice of a PyTorch Geometric Dataset.

Parameters:: idx (int) – fold index to load
Return type:: Dict[str, ndarray]
Returns:: PyTorch Geometric Dataset slice

save_fold_pred(pred, fold_idx, pred_name='pred')[source]¶

Save predictions for given fold as a new column in the foldfile

Parameters:

pred (ndarray) – array of predictions in the same order as data appears in the file
fold_idx (int) – index for fold
pred_name (str) – name of column to save predictions under

Return type:

None

lumin.nn.data package¶

Submodules¶

lumin.nn.data.batch_yielder module¶

lumin.nn.data.fold_yielder module¶

Module contents¶

Docs

Tutorials