lumin.nn.data package¶
Submodules¶
lumin.nn.data.batch_yielder module¶
-
class
lumin.nn.data.batch_yielder.
BatchYielder
(inputs, bs, objective, targets=None, weights=None, shuffle=True, use_weights=True, bulk_move=True, input_mask=None, drop_last=True)[source]¶ Bases:
object
Yields minibatches to model during training. Iteration provides one minibatch as tuple of tensors of inputs, targets, and weights.
- Parameters
inputs (
Union
[ndarray
,Tuple
[ndarray
,ndarray
]]) – input array for (sub-)epochtargets (
Optional
[ndarray
]) – target array for (sub-)epochbs (
int
) – batchsize, number of data to include per minibatchobjective (
str
) – ‘classification’, ‘multiclass classification’, or ‘regression’. Used for casting target dtype.weights (
Optional
[ndarray
]) – Optional weight array for (sub-)epochshuffle (
bool
) – whether to shuffle the data at the beginning of an iterationuse_weights (
bool
) – if passed weights, whether to actually pass them to the modelbulk_move (
bool
) – whether to move all data to device at once. Default is true (saves time), but if device has low memory you can set to False.input_mask (
Optional
[ndarray
]) – optionally only use Boolean-masked inputsdrop_last (
bool
) – whether to drop the last batch if it does not contain bs elements
lumin.nn.data.fold_yielder module¶
-
class
lumin.nn.data.fold_yielder.
FoldYielder
(foldfile, cont_feats=None, cat_feats=None, ignore_feats=None, input_pipe=None, output_pipe=None, yield_matrix=True, matrix_pipe=None)[source]¶ Bases:
object
Interface class for accessing data from foldfiles created by
df2foldfile()
- Parameters
foldfile (
Union
[str
,Path
,File
]) – filename of hdf5 file or opened hdf5 filecont_feats (
Optional
[List
[str
]]) – list of names of continuous features present in input data, not required if foldfile contains meta data alreadycat_feats (
Optional
[List
[str
]]) – list of names of categorical features present in input data, not required if foldfile contains meta data alreadyignore_feats (
Optional
[List
[str
]]) – optional list of input features which should be ignoredinput_pipe (
Union
[str
,Pipeline
,Path
,None
]) – optional Pipeline, or filename for pickled Pipeline, which was used for processing the inputsoutput_pipe (
Union
[str
,Pipeline
,Path
,None
]) – optional Pipeline, or filename for pickled Pipeline, which was used for processing the targetsyield_matrix (
bool
) – whether to actually yield matrix data if presentmatrix_pipe (
Union
[str
,Pipeline
,Path
,None
]) – preprocessing pipe for matrix data
- Examples::
>>> fy = FoldYielder('train.h5') >>> >>> fy = FoldYielder('train.h5', ignore_feats=['phi'], input_pipe='input_pipe.pkl') >>> >>> fy = FoldYielder('train.h5', input_pipe=input_pipe, matrix_pipe=matrix_pipe) >>> >>> fy = FoldYielder('train.h5', input_pipe=input_pipe, yield_matrix=False)
-
add_ignore
(feats)[source]¶ Add features to ignored features.
- Parameters
feats (
Union
[str
,List
[str
]]) – list of feature names to ignore- Return type
None
-
add_input_pipe
(input_pipe)[source]¶ Adds an input pipe to the FoldYielder for use when deprocessing data
- Parameters
input_pipe (
Union
[str
,Pipeline
]) – Pipeline which was used for preprocessing the input data or name of pkl file containing Pipeline- Return type
None
-
add_input_pipe_from_file
(name)[source]¶ Adds an input pipe from a pkl file to the FoldYielder for use when deprocessing data
- Parameters
name (
Union
[str
,Path
]) – name of pkl file containing Pipeline which was used for preprocessing the input data- Return type
None
-
add_matrix_pipe
(matrix_pipe)[source]¶ Adds an matrix pipe to the FoldYielder for use when deprocessing data
Warning
Deprocessing matrix data is not yet implemented
- Parameters
matrix_pipe (
Union
[str
,Pipeline
]) – Pipeline which was used for preprocessing the input data or name of pkl file containing Pipeline- Return type
None
-
add_matrix_pipe_from_file
(name)[source]¶ Adds an matrix pipe from a pkl file to the FoldYielder for use when deprocessing data
- Parameters
name (
str
) – name of pkl file containing Pipeline which was used for preprocessing the matrix data- Return type
None
-
add_output_pipe
(output_pipe)[source]¶ Adds an output pipe to the FoldYielder for use when deprocessing data
- Parameters
output_pipe (
Union
[str
,Pipeline
]) – Pipeline which was used for preprocessing the target data or name of pkl file containing Pipeline- Return type
None
-
add_output_pipe_from_file
(name)[source]¶ Adds an output pipe from a pkl file to the FoldYielder for use when deprocessing data
- Parameters
name (
Union
[str
,Path
]) – name of pkl file containing Pipeline which was used for preprocessing the target data- Return type
None
-
columns
()[source]¶ Returns list of columns present in foldfile
- Return type
List
[str
]- Returns
list of columns present in foldfile
-
get_column
(column, n_folds=None, fold_idx=None, add_newaxis=False)[source]¶ Load column (h5py group) from foldfile. Used for getting arbitrary data which isn’t automatically grabbed by other methods.
- Parameters
column (
str
) – name of h5py group to getn_folds (
Optional
[int
]) – number of folds to get data from. Default all folds. Not compatable with fold_idxfold_idx (
Optional
[int
]) – Only load group from a single, specified fold. Not compatable with n_foldsadd_newaxis (
bool
) – whether expand shape of returned data if data shape is ()
- Return type
Optional
[ndarray
]- Returns
Numpy array of column data
-
get_data
(n_folds=None, fold_idx=None)[source]¶ Get data for single, specified fold or several of folds. Data consists of dictionary of inputs, targets, and weights. Does not account for ignored features. Inputs are passed through np.nan_to_num to deal with nans and infs.
- Parameters
n_folds (
Optional
[int
]) – number of folds to get data from. Default all folds. Not compatable with fold_idxfold_idx (
Optional
[int
]) – Only load group from a single, specified fold. Not compatable with n_folds
- Return type
Dict
[str
,ndarray
]- Returns
tuple of inputs, targets, and weights as Numpy arrays
-
get_data_count
(idxs)[source]¶ Returns total number of data entries in requested folds
- Parameters
idxs (
Union
[int
,List
[int
]]) – list of indices to check- Return type
int
- Returns
Total number of entries in the folds
-
get_df
(pred_name='pred', targ_name='targets', wgt_name='weights', n_folds=None, fold_idx=None, inc_inputs=False, inc_ignore=False, deprocess=False, verbose=True, suppress_warn=False, nan_to_num=False, inc_matrix=False)[source]¶ Get a Pandas DataFrameof the data in the foldfile. Will add columns for inputs (if requested), targets, weights, and predictions (if present)
- Parameters
pred_name (
str
) – name of prediction grouptarg_name (
str
) – name of target groupwgt_name (
str
) – name of weight groupn_folds (
Optional
[int
]) – number of folds to get data from. Default all folds. Not compatable with fold_idxfold_idx (
Optional
[int
]) – Only load group from a single, specified fold. Not compatable with n_foldsinc_inputs (
bool
) – whether to include input datainc_ignore (
bool
) – whether to include ignored featuresdeprocess (
bool
) – whether to deprocess inputs and targets if pipelines have beenverbose (
bool
) – whether to print the number of datapoints loadedsuppress_warn (
bool
) – whether to supress the warning about missing columnsnan_to_num (
bool
) – whether to pass input data through np.nan_to_numinc_matrix (
bool
) – whether to include flattened matrix data in output, if present
- Return type
DataFrame
- Returns
Pandas DataFrame with requested data
-
get_fold
(idx)[source]¶ Get data for single fold. Data consists of dictionary of inputs, targets, and weights. Accounts for ignored features. Inputs, except for matrix data, are passed through np.nan_to_num to deal with nans and infs.
- Parameters
idx (
int
) – fold index to load- Return type
Dict
[str
,ndarray
]- Returns
tuple of inputs, targets, and weights as Numpy arrays
-
get_ignore
()[source]¶ Returns list of ignored features
- Return type
List
[str
]- Returns
Features removed from training data
-
get_use_cat_feats
()[source]¶ Returns list of categorical features which will be present in training data, accounting for ignored features.
- Return type
List
[str
]- Returns
List of categorical features
-
get_use_cont_feats
()[source]¶ Returns list of continuous features which will be present in training data, accounting for ignored features.
- Return type
List
[str
]- Returns
List of continuous features
-
save_fold_pred
(pred, fold_idx, pred_name='pred')[source]¶ Save predictions for given fold as a new column in the foldfile
- Parameters
pred (
ndarray
) – array of predictions in the same order as data appears in the filefold_idx (
int
) – index for foldpred_name (
str
) – name of column to save predictions under
- Return type
None
-
class
lumin.nn.data.fold_yielder.
HEPAugFoldYielder
(foldfile, cont_feats=None, cat_feats=None, ignore_feats=None, targ_feats=None, rot_mult=2, random_rot=False, reflect_x=False, reflect_y=True, reflect_z=True, train_time_aug=True, test_time_aug=True, input_pipe=None, output_pipe=None, yield_matrix=True, matrix_pipe=None)[source]¶ Bases:
lumin.nn.data.fold_yielder.FoldYielder
Specialised version of
FoldYielder
providing HEP specific data augmetation at train and test time.- Parameters
foldfile (
Union
[str
,Path
,File
]) – filename of hdf5 file or opened hdf5 filecont_feats (
Optional
[List
[str
]]) – list of names of continuous features present in input data, not required if foldfile contains meta data alreadycat_feats (
Optional
[List
[str
]]) – list of names of categorical features present in input data, not required if foldfile contains meta data alreadyignore_feats (
Optional
[List
[str
]]) – optional list of input features which should be ignoredtarg_feats (
Optional
[List
[str
]]) – optional list of target features to also be transformedrot_mult (
int
) – number of rotations of event in phi to make at test-time (currently must be even). Greater than zero will also apply random rotations during train-timerandom_rot (
bool
) – whether test-time rotation angles should be random or in steps of 2pi/rot_multreflect_x (
bool
) – whether to reflect events in x axis at train and test timereflect_y (
bool
) – whether to reflect events in y axis at train and test timereflect_z (
bool
) – whether to reflect events in z axis at train and test timetrain_time_aug (
bool
) – whether to apply augmentations at train timetest_time_aug (
bool
) – whether to apply augmentations at test timeinput_pipe (
Optional
[Pipeline
]) – optional Pipeline, or filename for pickled Pipeline, which was used for processing the inputsoutput_pipe (
Optional
[Pipeline
]) – optional Pipeline, or filename for pickled Pipeline, which was used for processing the targetsyield_matrix (
bool
) – whether to actually yield matrix data if presentmatrix_pipe (
Union
[str
,Pipeline
,None
]) – preprocessing pipe for matrix data
- Examples::
>>> fy = HEPAugFoldYielder('train.h5', ... cont_feats=['pT','eta','phi','mass'], ... rot_mult=2, reflect_y=True, reflect_z=True, ... input_pipe='input_pipe.pkl')
-
get_fold
(idx)[source]¶ Get data for single fold applying random train-time data augmentaion. Data consists of dictionary of inputs, targets, and weights. Accounts for ignored features. Inputs, except for matrix data, are passed through np.nan_to_num to deal with nans and infs.
- Parameters
idx (
int
) – fold index to load- Return type
Dict
[str
,ndarray
]- Returns
tuple of inputs, targets, and weights as Numpy arrays
-
get_test_fold
(idx, aug_idx)[source]¶ Get test data for single fold applying test-time data augmentaion. Data consists of dictionary of inputs, targets, and weights. Accounts for ignored features. Inputs, except for matrix data, are passed through np.nan_to_num to deal with nans and infs.
- Parameters
idx (
int
) – fold index to loadaug_idx (
int
) – index for the test-time augmentaion (ignored if random test-time augmentation requested)
- Return type
Dict
[str
,ndarray
]- Returns
tuple of inputs, targets, and weights as Numpy arrays