lumin.nn.data package¶
Submodules¶
lumin.nn.data.batch_yielder module¶
-
class
lumin.nn.data.batch_yielder.
BatchYielder
(inputs, targets, bs, objective, weights=None, shuffle=True, use_weights=True, bulk_move=True)[source]¶ Bases:
object
Yields minibatches to model during training. Iteration provides one minibatch as tuple of tensors of inputs, targets, and weights.
- Parameters
inputs (
ndarray
) – input array for (sub-)epochtargets (
ndarray
) – targte array for (sub-)epochbs (
int
) – batchsize, number of data to include per minibatchobjective (
str
) – ‘classification’, ‘multiclass classification’, or ‘regression’. Used for casting target dtype.weights (
Optional
[ndarray
]) – Optional weight array for (sub-)epochshuffle – whether to shuffle the data at the beginning of an iteration
use_weights (
bool
) – if passed weights, whether to actually pass them to the modelbulk_move – whether to move all data to device at once. Default is true (saves time), but if device has low memory you can set to False.
lumin.nn.data.fold_yielder module¶
-
class
lumin.nn.data.fold_yielder.
FoldYielder
(foldfile, cont_feats, cat_feats, ignore_feats=None, input_pipe=None, output_pipe=None)[source]¶ Bases:
object
Interface class for accessing data from foldfiles created by
df2foldfile()
- Parameters
foldfile (
File
) – filename of hdf5 filecont_feats (
List
[str
]) – list of names of continuous features present in input datacat_feats (
List
[str
]) – list of names of categorical features present in input dataignore_feats (
Optional
[List
[str
]]) – optional list of input features which should be ignoredinput_pipe (
Union
[str
,Pipeline
,None
]) – optional Pipeline, or filename for pickled Pipeline, which was used for processing the inputsoutput_pipe (
Union
[str
,Pipeline
,None
]) – optional Pipeline, or filename for pickled Pipeline, which was used for processing the targets
- Examples::
>>> fy = FoldYielder('train.h5', cont_feats=['pT','eta','phi','mass'], ... cat_feats=['channel'], ignore_feats=['phi'], ... input_pipe='input_pipe.pkl')
-
add_ignore
(feats)[source]¶ Add features to ignored features.
- Parameters
feats (
List
[str
]) – list of feature names to ignore- Return type
None
-
add_input_pipe
(input_pipe)[source]¶ Adds an input pipe to the FoldYielder for use when deprocessing data
- Parameters
input_pipe (
Pipeline
) – Pipeline which was used for preprocessing the input data- Return type
None
-
add_input_pipe_from_file
(name)[source]¶ Adds an input pipe from a pkl file to the FoldYielder for use when deprocessing data
- Parameters
name (
str
) – name of pkl file containing Pipeline which was used for preprocessing the input data- Return type
None
-
add_output_pipe
(output_pipe)[source]¶ Adds an output pipe to the FoldYielder for use when deprocessing data
- Parameters
output_pipe (
Pipeline
) – Pipeline which was used for preprocessing the target data- Return type
None
-
add_output_pipe_from_file
(name)[source]¶ Adds an output pipe from a pkl file to the FoldYielder for use when deprocessing data
- Parameters
name (
str
) – name of pkl file containing Pipeline which was used for preprocessing the target data- Return type
None
-
get_column
(column, n_folds=None, fold_idx=None, add_newaxis=False)[source]¶ Load column (h5py group) from foldfile. Used for getting arbitrary data which isn’t automatically grabbed by other methods.
- Parameters
column (
str
) – name of h5py group to getn_folds (
Optional
[int
]) – number of folds to get data from. Default all folds. Not compatable with fold_idxfold_idx (
Optional
[int
]) – Only load group from a single, specified fold. Not compatable with n_foldsadd_newaxis (
bool
) – whether expand shape of returned data if data shape is ()
- Return type
Optional
[ndarray
]- Returns
Numpy array of column data
-
get_data
(n_folds=None, fold_idx=None)[source]¶ Get data for single, specified fold or several of folds. Data consists of dictionary of inputs, targets, and weights. Does not accounts for ignored features. Inputs are passed through np.nan_to_num to deal with nans and infs.
- Parameters
n_folds (
Optional
[int
]) – number of folds to get data from. Default all folds. Not compatable with fold_idxfold_idx (
Optional
[int
]) – Only load group from a single, specified fold. Not compatable with n_folds
- Return type
Dict
[str
,ndarray
]- Returns
tuple of inputs, targets, and weights as Numpy arrays
-
get_df
(pred_name='pred', targ_name='targets', wgt_name='weights', n_folds=None, fold_idx=None, inc_inputs=False, inc_ignore=False, deprocess=False, verbose=True, suppress_warn=False)[source]¶ Get a Pandas DataFrameof the data in the foldfile. Will add columns for inputs (if requested), targets, weights, and predictions (if present)
- Parameters
pred_name (
str
) – name of prediction grouptarg_name (
str
) – name of target groupwgt_name (
str
) – name of weight groupn_folds (
Optional
[int
]) – number of folds to get data from. Default all folds. Not compatable with fold_idxfold_idx (
Optional
[int
]) – Only load group from a single, specified fold. Not compatable with n_foldsinc_inputs (
bool
) – whether to include input datainc_ignore (
bool
) – whether to include ignored featuresdeprocess (
bool
) – whether to deprocess inputs and targets if pipelines have beenverbose (
bool
) – whether to print the number of datapoints loadedsuppress_warn (
bool
) – whether to supress the warning about missing columns
- Return type
DataFrame
- Returns
Pandas DataFrame with requested data
-
get_fold
(idx)[source]¶ Get data for single fold. Data consists of dictionary of inputs, targets, and weights. Accounts for ignored features. Inputs are passed through np.nan_to_num to deal with nans and infs.
- Parameters
idx (
int
) – fold index to load- Return type
Dict
[str
,ndarray
]- Returns
tuple of inputs, targets, and weights as Numpy arrays
-
get_ignore
()[source]¶ Returns list of ignored features
- Return type
List
[str
]- Returns
Features removed from training data
-
save_fold_pred
(pred, fold_idx, pred_name='pred')[source]¶ Save predictions for given fold as a new column in the foldfile
- Parameters
pred (
ndarray
) – array of predictions in the same order as data appears in the filefold_idx (
int
) – index for foldpred_name (
str
) – name of column to save predictions under
- Return type
None
-
class
lumin.nn.data.fold_yielder.
HEPAugFoldYielder
(foldfile, cont_feats, cat_feats, ignore_feats=None, targ_feats=None, rot_mult=2, random_rot=False, reflect_x=False, reflect_y=True, reflect_z=True, train_time_aug=True, test_time_aug=True, input_pipe=None, output_pipe=None)[source]¶ Bases:
lumin.nn.data.fold_yielder.FoldYielder
Specialised version of
FoldYielder
providing HEP specific data augmetation at train and test time.- Parameters
foldfile (
File
) – filename of hdf5 filecont_feats (
List
[str
]) – list of names of continuous features present in input datacat_feats (
List
[str
]) – list of names of categorical features present in input dataignore_feats (
Optional
[List
[str
]]) – optional list of input features which should be ignoredtarg_feats (
Optional
[List
[str
]]) – optional list of target features to also be transformedrot_mult (
int
) – number of rotations of event in phi to make at test-time (currently must be even). Greater than zero will also apply random rotations during train-timerandom_rot (
bool
) – whether test-time rotation angles should be random or in steps of 2pi/rot_multreflect_x (
bool
) – whether to reflect events in x axis at train and test timereflect_y (
bool
) – whether to reflect events in y axis at train and test timereflect_z (
bool
) – whether to reflect events in z axis at train and test timetrain_time_aug (
bool
) – whether to apply augmentations at train timetest_time_aug (
bool
) – whether to apply augmentations at test timeinput_pipe (
Optional
[Pipeline
]) – optional Pipeline, or filename for pickled Pipeline, which was used for processing the inputsoutput_pipe (
Optional
[Pipeline
]) – optional Pipeline, or filename for pickled Pipeline, which was used for processing the targets
- Examples::
>>> fy = HEPAugFoldYielder('train.h5', ... cont_feats=['pT','eta','phi','mass'], ... rot_mult=2, reflect_y=True, reflect_z=True, ... input_pipe='input_pipe.pkl')
-
get_fold
(idx)[source]¶ Get data for single fold applying random train-time data augmentaion. Data consists of dictionary of inputs, targets, and weights. Accounts for ignored features. Inputs are passed through np.nan_to_num to deal with nans and infs.
- Parameters
idx (
int
) – fold index to load- Return type
Dict
[str
,ndarray
]- Returns
tuple of inputs, targets, and weights as Numpy arrays
-
get_test_fold
(idx, aug_idx)[source]¶ Get test data for single fold applying test-time data augmentaion. Data consists of dictionary of inputs, targets, and weights. Accounts for ignored features. Inputs are passed through np.nan_to_num to deal with nans and infs.
- Parameters
idx (
int
) – fold index to loadaug_idx (
int
) – index for the test-time augmentaion (ignored if random test-time augmentation requested)
- Return type
Dict
[str
,ndarray
]- Returns
tuple of inputs, targets, and weights as Numpy arrays