lumin.nn.data package¶
Submodules¶
lumin.nn.data.batch_yielder module¶
- class lumin.nn.data.batch_yielder.BatchYielder(inputs, bs, objective, targets=None, weights=None, shuffle=True, use_weights=True, bulk_move=True, input_mask=None, drop_last=True)[source]¶
Bases:
objectYields minibatches to model during training. Iteration provides one minibatch as tuple of tensors of inputs, targets, and weights.
TODO: Improve this/change to dataloader
- Parameters:
inputs (
Union[ndarray,Tuple[ndarray,ndarray]]) – input array for (sub-)epochtargets (
Optional[ndarray]) – target array for (sub-)epochbs (
int) – batchsize, number of data to include per minibatchobjective (
str) – ‘classification’, ‘multiclass classification’, or ‘regression’. Used for casting target dtype.weights (
Optional[ndarray]) – Optional weight array for (sub-)epochshuffle (
bool) – whether to shuffle the data at the beginning of an iterationuse_weights (
bool) – if passed weights, whether to actually pass them to the modelbulk_move (
bool) – whether to move all data to device at once. Default is true (saves time), but if device has low memory you can set to False.input_mask (
Optional[ndarray]) – optionally only use Boolean-masked inputsdrop_last (
bool) – whether to drop the last batch if it does not contain bs elements
- class lumin.nn.data.batch_yielder.TorchGeometricBatchYielder(inputs, bs, shuffle=True, exclude_keys=None, use_weights=True, **kwargs)[source]¶
Bases:
BatchYielderBatchYielderfor PyTorch Geometric data. kwargs for compatibility only.- Parameters:
inputs (
Dataset) – PyTorch Geometric Dataset containing inputs, weights, and targetsbs (
int) – batchsize, number of data to include per minibatchshuffle (
bool) – whether to shuffle the data at the beginning of an iterationexclude_keys (
Optional[List[str]]) – data keys to exclude from inputs
lumin.nn.data.fold_yielder module¶
- class lumin.nn.data.fold_yielder.FoldYielder(foldfile, cont_feats=None, cat_feats=None, ignore_feats=None, input_pipe=None, output_pipe=None, yield_matrix=True, matrix_pipe=None, batch_yielder_type=<class 'lumin.nn.data.batch_yielder.BatchYielder'>)[source]¶
Bases:
objectInterface class for accessing data from foldfiles created by
df2foldfile()- Parameters:
foldfile (
Union[str,Path,File]) – filename of hdf5 file or opened hdf5 filecont_feats (
Optional[List[str]]) – list of names of continuous features present in input data, not required if foldfile contains meta data alreadycat_feats (
Optional[List[str]]) – list of names of categorical features present in input data, not required if foldfile contains meta data alreadyignore_feats (
Optional[List[str]]) – optional list of input features which should be ignoredinput_pipe (
Union[str,Pipeline,Path,None]) – optional Pipeline, or filename for pickled Pipeline, which was used for processing the inputsoutput_pipe (
Union[str,Pipeline,Path,None]) – optional Pipeline, or filename for pickled Pipeline, which was used for processing the targetsyield_matrix (
bool) – whether to actually yield matrix data if presentmatrix_pipe (
Union[str,Pipeline,Path,None]) – preprocessing pipe for matrix databatch_yielder_type (
Type[BatchYielder]) – Class ofBatchYielderto instantiate to yield inputs
- Examples::
>>> fy = FoldYielder('train.h5') >>> >>> fy = FoldYielder('train.h5', ignore_feats=['phi'], input_pipe='input_pipe.pkl') >>> >>> fy = FoldYielder('train.h5', input_pipe=input_pipe, matrix_pipe=matrix_pipe) >>> >>> fy = FoldYielder('train.h5', input_pipe=input_pipe, yield_matrix=False)
- add_ignore(feats)[source]¶
Add features to ignored features.
- Parameters:
feats (
Union[str,List[str]]) – list of feature names to ignore- Return type:
None
- add_input_pipe(input_pipe)[source]¶
Adds an input pipe to the FoldYielder for use when deprocessing data
- Parameters:
input_pipe (
Union[str,Pipeline]) – Pipeline which was used for preprocessing the input data or name of pkl file containing Pipeline- Return type:
None
- add_input_pipe_from_file(name)[source]¶
Adds an input pipe from a pkl file to the FoldYielder for use when deprocessing data
- Parameters:
name (
Union[str,Path]) – name of pkl file containing Pipeline which was used for preprocessing the input data- Return type:
None
- add_matrix_pipe(matrix_pipe)[source]¶
Adds an matrix pipe to the FoldYielder for use when deprocessing data
Warning
Deprocessing matrix data is not yet implemented
- Parameters:
matrix_pipe (
Union[str,Pipeline]) – Pipeline which was used for preprocessing the input data or name of pkl file containing Pipeline- Return type:
None
- add_matrix_pipe_from_file(name)[source]¶
Adds an matrix pipe from a pkl file to the FoldYielder for use when deprocessing data
- Parameters:
name (
str) – name of pkl file containing Pipeline which was used for preprocessing the matrix data- Return type:
None
- add_output_pipe(output_pipe)[source]¶
Adds an output pipe to the FoldYielder for use when deprocessing data
- Parameters:
output_pipe (
Union[str,Pipeline]) – Pipeline which was used for preprocessing the target data or name of pkl file containing Pipeline- Return type:
None
- add_output_pipe_from_file(name)[source]¶
Adds an output pipe from a pkl file to the FoldYielder for use when deprocessing data
- Parameters:
name (
Union[str,Path]) – name of pkl file containing Pipeline which was used for preprocessing the target data- Return type:
None
- columns()[source]¶
Returns list of columns present in foldfile
- Return type:
List[str]- Returns:
list of columns present in foldfile
- get_column(column, n_folds=None, fold_idx=None, add_newaxis=False)[source]¶
Load column (h5py group) from foldfile. Used for getting arbitrary data which isn’t automatically grabbed by other methods.
- Parameters:
column (
str) – name of h5py group to getn_folds (
Optional[int]) – number of folds to get data from. Default all folds. Not compatable with fold_idxfold_idx (
Optional[int]) – Only load group from a single, specified fold. Not compatable with n_foldsadd_newaxis (
bool) – whether expand shape of returned data if data shape is ()
- Return type:
Optional[ndarray]- Returns:
Numpy array of column data
- get_data(n_folds=None, fold_idx=None)[source]¶
Get data for single, specified fold or several of folds. Data consists of dictionary of inputs, targets, and weights. Does not account for ignored features. Inputs are passed through np.nan_to_num to deal with nans and infs.
- Parameters:
n_folds (
Optional[int]) – number of folds to get data from. Default all folds. Not compatible with fold_idxfold_idx (
Optional[int]) – Only load group from a single, specified fold. Not compatible with n_folds
- Return type:
Dict[str,ndarray]- Returns:
tuple of inputs, targets, and weights as Numpy arrays
- get_data_count(idxs=None)[source]¶
Returns total number of data entries in requested folds
- Parameters:
idxs (
Union[int,List[int],None]) – list of indices to check- Return type:
int- Returns:
Total number of entries in the folds
- get_df(pred_name='pred', targ_name='targets', wgt_name='weights', n_folds=None, fold_idx=None, inc_inputs=False, inc_ignore=False, deprocess=False, verbose=True, suppress_warn=False, nan_to_num=False, inc_matrix=False)[source]¶
Get a Pandas DataFrame of the data in the foldfile. Will add columns for inputs (if requested), targets, weights, and predictions (if present)
- Parameters:
pred_name (
str) – name of prediction grouptarg_name (
str) – name of target groupwgt_name (
str) – name of weight groupn_folds (
Optional[int]) – number of folds to get data from. Default all folds. Not compatible with fold_idxfold_idx (
Optional[int]) – Only load group from a single, specified fold. Not compatible with n_foldsinc_inputs (
bool) – whether to include input datainc_ignore (
bool) – whether to include ignored featuresdeprocess (
bool) – whether to deprocess inputs and targets if pipelines have beenverbose (
bool) – whether to print the number of datapoints loadedsuppress_warn (
bool) – whether to suppress the warning about missing columnsnan_to_num (
bool) – whether to pass input data through np.nan_to_numinc_matrix (
bool) – whether to include flattened matrix data in output, if present
- Return type:
DataFrame- Returns:
Pandas DataFrame with requested data
- get_fold(idx)[source]¶
Get data for single fold. Data consists of dictionary of inputs, targets, and weights. Accounts for ignored features. Inputs, except for matrix data, are passed through np.nan_to_num to deal with nans and infs.
- Parameters:
idx (
int) – fold index to load- Return type:
Dict[str,ndarray]- Returns:
tuple of inputs, targets, and weights as Numpy arrays
- get_ignore()[source]¶
Returns list of ignored features
- Return type:
List[str]- Returns:
Features removed from training data
- get_use_cat_feats()[source]¶
Returns list of categorical features which will be present in training data, accounting for ignored features.
- Return type:
List[str]- Returns:
List of categorical features
- get_use_cont_feats()[source]¶
Returns list of continuous features which will be present in training data, accounting for ignored features.
- Return type:
List[str]- Returns:
List of continuous features
- save_fold_pred(pred, fold_idx, pred_name='pred')[source]¶
Save predictions for given fold as a new column in the foldfile
- Parameters:
pred (
ndarray) – array of predictions in the same order as data appears in the filefold_idx (
int) – index for foldpred_name (
str) – name of column to save predictions under
- Return type:
None
- class lumin.nn.data.fold_yielder.HEPAugFoldYielder(foldfile, cont_feats=None, cat_feats=None, ignore_feats=None, aug_targ_feats=None, rot_mult=2, random_rot=False, reflect_x=False, reflect_y=True, reflect_z=True, train_time_aug=True, test_time_aug=True, input_pipe=None, output_pipe=None, yield_matrix=True, matrix_pipe=None)[source]¶
Bases:
FoldYielderSpecialised version of
FoldYielderproviding HEP specific data augmetation at train and test time.- Parameters:
foldfile (
Union[str,Path,File]) – filename of hdf5 file or opened hdf5 filecont_feats (
Optional[List[str]]) – list of names of continuous features present in input data, not required if foldfile contains meta data alreadycat_feats (
Optional[List[str]]) – list of names of categorical features present in input data, not required if foldfile contains meta data alreadyignore_feats (
Optional[List[str]]) – optional list of input features which should be ignoredaug_targ_feats (
Optional[List[str]]) – optional list of target vectors to also be transformed, leave as None for no augmentation of targets vectirsrot_mult (
int) – number of rotations of event in phi to make at test-time (currently must be even). Greater than zero will also apply random rotations during train-timerandom_rot (
bool) – whether test-time rotation angles should be random or in steps of 2pi/rot_multreflect_x (
bool) – whether to reflect events in x axis at train and test timereflect_y (
bool) – whether to reflect events in y axis at train and test timereflect_z (
bool) – whether to reflect events in z axis at train and test timetrain_time_aug (
bool) – whether to apply augmentations at train timetest_time_aug (
bool) – whether to apply augmentations at test timeinput_pipe (
Optional[Pipeline]) – optional Pipeline, or filename for pickled Pipeline, which was used for processing the inputsoutput_pipe (
Optional[Pipeline]) – optional Pipeline, or filename for pickled Pipeline, which was used for processing the targetsyield_matrix (
bool) – whether to actually yield matrix data if presentmatrix_pipe (
Union[str,Pipeline,None]) – preprocessing pipe for matrix data
- Examples::
>>> fy = HEPAugFoldYielder('train.h5', ... cont_feats=['pT','eta','phi','mass'], ... rot_mult=2, reflect_y=True, reflect_z=True, ... input_pipe='input_pipe.pkl')
- get_fold(idx)[source]¶
Get data for single fold applying random train-time data augmentation. Data consists of dictionary of inputs, targets, and weights. Accounts for ignored features. Inputs, except for matrix data, are passed through np.nan_to_num to deal with nans and infs.
- Parameters:
idx (
int) – fold index to load- Return type:
Dict[str,ndarray]- Returns:
tuple of inputs, targets, and weights as Numpy arrays
- get_test_fold(idx, aug_idx)[source]¶
Get test data for single fold applying test-time data augmentaion. Data consists of dictionary of inputs, targets, and weights. Accounts for ignored features. Inputs, except for matrix data, are passed through np.nan_to_num to deal with nans and infs.
- Parameters:
idx (
int) – fold index to loadaug_idx (
int) – index for the test-time augmentaion (ignored if random test-time augmentation requested)
- Return type:
Dict[str,ndarray]- Returns:
tuple of inputs, targets, and weights as Numpy arrays
- class lumin.nn.data.fold_yielder.TorchGeometricFoldYielder(dataset, n_folds, fold_indices=None, shuffle=True, seed=None, batch_yielder_type=<class 'lumin.nn.data.batch_yielder.TorchGeometricBatchYielder'>)[source]¶
Bases:
FoldYielderInterface class for accessing data from PyTorch Geometric datasets. Dataset will be split into sub-folds; either provide a value for the fold_indices argument with your own split as a list of lists of indices, or specify the number of folds for a random split (n_folds)
- ..warning::
Much functionality has yet to be implemented for this class
- Parameters:
dataset (
Dataset) – PyTorch Geometric Dataset containing inputs, weights, and targetsn_folds (
Optional[int]) – number of folds in which to randomly split the dataset. Must provide either this or fold_indicesfold_indices (
Optional[List[List[int]]]) – list of lists of indices; each list of indices is a fold. Must provide either this or n_foldsshuffle (
bool) – if no fold_indeces are provided, data will be split into the speified number of folds. This controls whether the indeces will be shuffled beforehand or not.seed (
Optional[int]) – if no fold_indeces are provided, data will be split into the speified number of folds. This sets the random seed used for shuffling, if requested.batch_yielder_type (
Type[BatchYielder]) – Class ofBatchYielderto instantiate to yield inputs
- add_ignore(feats)[source]¶
Add features to ignored features.
- Parameters:
feats (
Union[str,List[str]]) – list of feature names to ignore- Return type:
None
- columns()[source]¶
Returns list of columns present in foldfile
- Return type:
List[str]- Returns:
list of columns present in foldfile
- get_column(column, n_folds=None, fold_idx=None, add_newaxis=False)[source]¶
Load column (h5py group) from foldfile. Used for getting arbitrary data which isn’t automatically grabbed by other methods.
- Parameters:
column (
str) – name of h5py group to getn_folds (
Optional[int]) – number of folds to get data from. Default all folds. Not compatable with fold_idxfold_idx (
Optional[int]) – Only load group from a single, specified fold. Not compatable with n_foldsadd_newaxis (
bool) – whether expand shape of returned data if data shape is ()
- Return type:
Optional[ndarray]- Returns:
Numpy array of column data
- get_data(n_folds=None, fold_idx=None)[source]¶
Get data for single, specified fold or several of folds. Data consists of dictionary of inputs, targets, and weights. Does not account for ignored features. Inputs are passed through np.nan_to_num to deal with nans and infs.
- Parameters:
n_folds (
Optional[int]) – number of folds to get data from. Default all folds. Not compatible with fold_idxfold_idx (
Optional[int]) – Only load group from a single, specified fold. Not compatible with n_folds
- Return type:
Dict[str,ndarray]- Returns:
tuple of inputs, targets, and weights as Numpy arrays
- get_df(pred_name='pred', targ_name='targets', wgt_name='weights', n_folds=None, fold_idx=None, inc_inputs=False, inc_ignore=False, deprocess=False, verbose=True, suppress_warn=False, nan_to_num=False, inc_matrix=False)[source]¶
Get a Pandas DataFrame of the data in the foldfile. Will add columns for inputs (if requested), targets, weights, and predictions (if present)
- Parameters:
pred_name (
str) – name of prediction grouptarg_name (
str) – name of target groupwgt_name (
str) – name of weight groupn_folds (
Optional[int]) – number of folds to get data from. Default all folds. Not compatible with fold_idxfold_idx (
Optional[int]) – Only load group from a single, specified fold. Not compatible with n_foldsinc_inputs (
bool) – whether to include input datainc_ignore (
bool) – whether to include ignored featuresdeprocess (
bool) – whether to deprocess inputs and targets if pipelines have beenverbose (
bool) – whether to print the number of datapoints loadedsuppress_warn (
bool) – whether to suppress the warning about missing columnsnan_to_num (
bool) – whether to pass input data through np.nan_to_numinc_matrix (
bool) – whether to include flattened matrix data in output, if present
- Return type:
DataFrame- Returns:
Pandas DataFrame with requested data
- get_fold(idx)[source]¶
Get data for single fold. Data consists of a slice of a PyTorch Geometric Dataset.
- Parameters:
idx (
int) – fold index to load- Return type:
Dict[str,ndarray]- Returns:
PyTorch Geometric Dataset slice
- save_fold_pred(pred, fold_idx, pred_name='pred')[source]¶
Save predictions for given fold as a new column in the foldfile
- Parameters:
pred (
ndarray) – array of predictions in the same order as data appears in the filefold_idx (
int) – index for foldpred_name (
str) – name of column to save predictions under
- Return type:
None