Shortcuts

lumin.nn.data package

Submodules

lumin.nn.data.batch_yielder module

class lumin.nn.data.batch_yielder.BatchYielder(inputs, targets, bs, objective, weights=None, shuffle=True, use_weights=True, bulk_move=True)[source]

Bases: object

Yields minibatches to model during training. Iteration provides one minibatch as tuple of tensors of inputs, targets, and weights.

Parameters
  • inputs (Union[ndarray, Tuple[ndarray, ndarray]]) – input array for (sub-)epoch

  • targets (ndarray) – targte array for (sub-)epoch

  • bs (int) – batchsize, number of data to include per minibatch

  • objective (str) – ‘classification’, ‘multiclass classification’, or ‘regression’. Used for casting target dtype.

  • weights (Optional[ndarray]) – Optional weight array for (sub-)epoch

  • shuffle (bool) – whether to shuffle the data at the beginning of an iteration

  • use_weights (bool) – if passed weights, whether to actually pass them to the model

  • bulk_move (bool) – whether to move all data to device at once. Default is true (saves time), but if device has low memory you can set to False.

get_inputs(on_device=False)[source]
Return type

Union[Tensor, Tuple[Tensor, Tensor]]

lumin.nn.data.fold_yielder module

class lumin.nn.data.fold_yielder.FoldYielder(foldfile, cont_feats=None, cat_feats=None, ignore_feats=None, input_pipe=None, output_pipe=None, yield_matrix=True, matrix_pipe=None)[source]

Bases: object

Interface class for accessing data from foldfiles created by df2foldfile()

Parameters
  • foldfile (Union[str, Path, File]) – filename of hdf5 file or opened hdf5 file

  • cont_feats (Optional[List[str]]) – list of names of continuous features present in input data, not required if foldfile contains meta data already

  • cat_feats (Optional[List[str]]) – list of names of categorical features present in input data, not required if foldfile contains meta data already

  • ignore_feats (Optional[List[str]]) – optional list of input features which should be ignored

  • input_pipe (Union[str, Pipeline, Path, None]) – optional Pipeline, or filename for pickled Pipeline, which was used for processing the inputs

  • output_pipe (Union[str, Pipeline, Path, None]) – optional Pipeline, or filename for pickled Pipeline, which was used for processing the targets

  • yield_matrix (bool) – whether to actually yield matrix data if present

  • matrix_pipe (Union[str, Pipeline, Path, None]) – preprocessing pipe for matrix data

Examples::
>>> fy = FoldYielder('train.h5')
>>>
>>> fy = FoldYielder('train.h5', ignore_feats=['phi'], input_pipe='input_pipe.pkl')
>>>
>>> fy = FoldYielder('train.h5', input_pipe=input_pipe, matrix_pipe=matrix_pipe)
>>>
>>> fy = FoldYielder('train.h5', input_pipe=input_pipe, yield_matrix=False)
add_ignore(feats)[source]

Add features to ignored features.

Parameters

feats (List[str]) – list of feature names to ignore

Return type

None

add_input_pipe(input_pipe)[source]

Adds an input pipe to the FoldYielder for use when deprocessing data

Parameters

input_pipe (Union[str, Pipeline]) – Pipeline which was used for preprocessing the input data or name of pkl file containing Pipeline

Return type

None

add_input_pipe_from_file(name)[source]

Adds an input pipe from a pkl file to the FoldYielder for use when deprocessing data

Parameters

name (str) – name of pkl file containing Pipeline which was used for preprocessing the input data

Return type

None

add_matrix_pipe(matrix_pipe)[source]

Adds an matrix pipe to the FoldYielder for use when deprocessing data

Warning

Deprocessing matrix data is not yet implemented

Parameters

matrix_pipe (Union[str, Pipeline]) – Pipeline which was used for preprocessing the input data or name of pkl file containing Pipeline

Return type

None

add_matrix_pipe_from_file(name)[source]

Adds an matrix pipe from a pkl file to the FoldYielder for use when deprocessing data

Parameters

name (str) – name of pkl file containing Pipeline which was used for preprocessing the matrix data

Return type

None

add_output_pipe(output_pipe)[source]

Adds an output pipe to the FoldYielder for use when deprocessing data

Parameters

output_pipe (Union[str, Pipeline]) – Pipeline which was used for preprocessing the target data or name of pkl file containing Pipeline

Return type

None

add_output_pipe_from_file(name)[source]

Adds an output pipe from a pkl file to the FoldYielder for use when deprocessing data

Parameters

name (str) – name of pkl file containing Pipeline which was used for preprocessing the target data

Return type

None

close()[source]

Closes the foldfile

Return type

None

columns()[source]

Returns list of columns present in foldfile

Return type

List[str]

Returns

list of columns present in foldfile

get_column(column, n_folds=None, fold_idx=None, add_newaxis=False)[source]

Load column (h5py group) from foldfile. Used for getting arbitrary data which isn’t automatically grabbed by other methods.

Parameters
  • column (str) – name of h5py group to get

  • n_folds (Optional[int]) – number of folds to get data from. Default all folds. Not compatable with fold_idx

  • fold_idx (Optional[int]) – Only load group from a single, specified fold. Not compatable with n_folds

  • add_newaxis (bool) – whether expand shape of returned data if data shape is ()

Return type

Optional[ndarray]

Returns

Numpy array of column data

get_data(n_folds=None, fold_idx=None)[source]

Get data for single, specified fold or several of folds. Data consists of dictionary of inputs, targets, and weights. Does not account for ignored features. Inputs are passed through np.nan_to_num to deal with nans and infs.

Parameters
  • n_folds (Optional[int]) – number of folds to get data from. Default all folds. Not compatable with fold_idx

  • fold_idx (Optional[int]) – Only load group from a single, specified fold. Not compatable with n_folds

Return type

Dict[str, ndarray]

Returns

tuple of inputs, targets, and weights as Numpy arrays

get_df(pred_name='pred', targ_name='targets', wgt_name='weights', n_folds=None, fold_idx=None, inc_inputs=False, inc_ignore=False, deprocess=False, verbose=True, suppress_warn=False, nan_to_num=False, inc_matrix=False)[source]

Get a Pandas DataFrameof the data in the foldfile. Will add columns for inputs (if requested), targets, weights, and predictions (if present)

Parameters
  • pred_name (str) – name of prediction group

  • targ_name (str) – name of target group

  • wgt_name (str) – name of weight group

  • n_folds (Optional[int]) – number of folds to get data from. Default all folds. Not compatable with fold_idx

  • fold_idx (Optional[int]) – Only load group from a single, specified fold. Not compatable with n_folds

  • inc_inputs (bool) – whether to include input data

  • inc_ignore (bool) – whether to include ignored features

  • deprocess (bool) – whether to deprocess inputs and targets if pipelines have been

  • verbose (bool) – whether to print the number of datapoints loaded

  • suppress_warn (bool) – whether to supress the warning about missing columns

  • nan_to_num (bool) – whether to pass input data through np.nan_to_num

  • inc_matrix (bool) – whether to include flattened matrix data in output, if present

Return type

DataFrame

Returns

Pandas DataFrame with requested data

get_fold(idx)[source]

Get data for single fold. Data consists of dictionary of inputs, targets, and weights. Accounts for ignored features. Inputs, except for matrix data, are passed through np.nan_to_num to deal with nans and infs.

Parameters

idx (int) – fold index to load

Return type

Dict[str, ndarray]

Returns

tuple of inputs, targets, and weights as Numpy arrays

get_ignore()[source]

Returns list of ignored features

Return type

List[str]

Returns

Features removed from training data

get_use_cat_feats()[source]

Returns list of categorical features which will be present in training data, accounting for ignored features.

Return type

List[str]

Returns

List of categorical features

get_use_cont_feats()[source]

Returns list of continuous features which will be present in training data, accounting for ignored features.

Return type

List[str]

Returns

List of continuous features

save_fold_pred(pred, fold_idx, pred_name='pred')[source]

Save predictions for given fold as a new column in the foldfile

Parameters
  • pred (ndarray) – array of predictions in the same order as data appears in the file

  • fold_idx (int) – index for fold

  • pred_name (str) – name of column to save predictions under

Return type

None

class lumin.nn.data.fold_yielder.HEPAugFoldYielder(foldfile, cont_feats=None, cat_feats=None, ignore_feats=None, targ_feats=None, rot_mult=2, random_rot=False, reflect_x=False, reflect_y=True, reflect_z=True, train_time_aug=True, test_time_aug=True, input_pipe=None, output_pipe=None, yield_matrix=True, matrix_pipe=None)[source]

Bases: lumin.nn.data.fold_yielder.FoldYielder

Specialised version of FoldYielder providing HEP specific data augmetation at train and test time.

Parameters
  • foldfile (Union[str, Path, File]) – filename of hdf5 file or opened hdf5 file

  • cont_feats (Optional[List[str]]) – list of names of continuous features present in input data, not required if foldfile contains meta data already

  • cat_feats (Optional[List[str]]) – list of names of categorical features present in input data, not required if foldfile contains meta data already

  • ignore_feats (Optional[List[str]]) – optional list of input features which should be ignored

  • targ_feats (Optional[List[str]]) – optional list of target features to also be transformed

  • rot_mult (int) – number of rotations of event in phi to make at test-time (currently must be even). Greater than zero will also apply random rotations during train-time

  • random_rot (bool) – whether test-time rotation angles should be random or in steps of 2pi/rot_mult

  • reflect_x (bool) – whether to reflect events in x axis at train and test time

  • reflect_y (bool) – whether to reflect events in y axis at train and test time

  • reflect_z (bool) – whether to reflect events in z axis at train and test time

  • train_time_aug (bool) – whether to apply augmentations at train time

  • test_time_aug (bool) – whether to apply augmentations at test time

  • input_pipe (Optional[Pipeline]) – optional Pipeline, or filename for pickled Pipeline, which was used for processing the inputs

  • output_pipe (Optional[Pipeline]) – optional Pipeline, or filename for pickled Pipeline, which was used for processing the targets

  • yield_matrix (bool) – whether to actually yield matrix data if present

  • matrix_pipe (Union[str, Pipeline, None]) – preprocessing pipe for matrix data

Examples::
>>> fy = HEPAugFoldYielder('train.h5',
...                        cont_feats=['pT','eta','phi','mass'],
...                        rot_mult=2, reflect_y=True, reflect_z=True,
...                        input_pipe='input_pipe.pkl')
get_fold(idx)[source]

Get data for single fold applying random train-time data augmentaion. Data consists of dictionary of inputs, targets, and weights. Accounts for ignored features. Inputs, except for matrix data, are passed through np.nan_to_num to deal with nans and infs.

Parameters

idx (int) – fold index to load

Return type

Dict[str, ndarray]

Returns

tuple of inputs, targets, and weights as Numpy arrays

get_test_fold(idx, aug_idx)[source]

Get test data for single fold applying test-time data augmentaion. Data consists of dictionary of inputs, targets, and weights. Accounts for ignored features. Inputs, except for matrix data, are passed through np.nan_to_num to deal with nans and infs.

Parameters
  • idx (int) – fold index to load

  • aug_idx (int) – index for the test-time augmentaion (ignored if random test-time augmentation requested)

Return type

Dict[str, ndarray]

Returns

tuple of inputs, targets, and weights as Numpy arrays

Module contents

Read the Docs v: v0.6.0
Versions
latest
stable
v0.6.0
v0.5.1
v0.5.0
v0.4.0.1
v0.3.1
Downloads
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.

Docs

Access comprehensive developer and user documentation for LUMIN

View Docs

Tutorials

Get tutorials for beginner and advanced researchers demonstrating many of the features of LUMIN

View Tutorials