lumin.data_processing package¶

Submodules¶

lumin.data_processing.file_proc module¶

lumin.data_processing.file_proc.save_to_grp(arr, grp, name, compression=None)[source]¶

Save Numpy array as a dataset in an h5py Group

Parameters

arr (ndarray) – array to be saved
grp (Group) – group in which to save arr
name (str) – name of dataset to create
compression (Optional[str]) – optional compression argument for h5py, e.g. ‘lzf’

Return type

None

lumin.data_processing.file_proc.fold2foldfile(df, out_file, fold_idx, cont_feats, cat_feats, targ_feats, targ_type, misc_feats=None, wgt_feat=None, matrix_lookup=None, matrix_missing=None, matrix_shape=None, tensor_data=None, compression=None)[source]¶

Save fold of data into an h5py Group

Parameters

df (DataFrame) – Dataframe from which to save data
out_file (File) – h5py file to save data in
fold_idx (int) – ID for the fold; used name h5py group according to ‘fold_{fold_idx}’
cont_feats (List[str]) – list of columns in df to save as continuous variables
cat_feats (List[str]) – list of columns in df to save as discreet variables
targ_feats (Union[str, List[str]]) – (list of) column(s) in df to save as target feature(s)
targ_type (Any) – type of target feature, e.g. int,’float32’
misc_feats (Optional[List[str]]) – any extra columns to save
wgt_feat (Optional[str]) – column to save as data weights
matrix_vecs – list of objects for matrix encoding, i.e. feature prefixes
matrix_feats_per_vec – list of features per vector for matrix encoding, i.e. feature suffixes. Features listed but not present in df will be replaced with NaN.
matrix_row_wise – whether objects encoded as a matrix should be encoded row-wise (i.e. all the features associated with an object are in their own row), or column-wise (i.e. all the features associated with an object are in their own column)
tensor_data (Optional[ndarray]) – data of higher order than a matrix can be passed directly as a numpy array, rather than beign extracted and reshaped from the DataFrame. The array will be saved under matrix data, and this is incompatible with also setting matrix_lookup, matrix_missing, and matrix_shape. The first dimension of the array must be compatible with the length of the data frame.
compression (Optional[str]) – optional compression argument for h5py, e.g. ‘lzf’

Return type

None

lumin.data_processing.file_proc.df2foldfile(df, n_folds, cont_feats, cat_feats, targ_feats, savename, targ_type, strat_key=None, misc_feats=None, wgt_feat=None, cat_maps=None, matrix_vecs=None, matrix_feats_per_vec=None, matrix_row_wise=None, tensor_data=None, tensor_name=None, tensor_is_sparse=False, compression=None)[source]¶

Convert dataframe into h5py file by splitting data into sub-folds to be accessed by a FoldYielder

Parameters

df (DataFrame) – Dataframe from which to save data
n_folds (int) – number of folds to split df into
cont_feats (List[str]) – list of columns in df to save as continuous variables
cat_feats (List[str]) – list of columns in df to save as discreet variables
targ_feats (Union[str, List[str]]) – (list of) column(s) in df to save as target feature(s)
savename (Union[Path, str]) – name of h5py file to create (.h5py extension not required)
targ_type (str) – type of target feature, e.g. int,’float32’
strat_key (Optional[str]) – column to use for stratified splitting
misc_feats (Optional[List[str]]) – any extra columns to save
wgt_feat (Optional[str]) – column to save as data weights
cat_maps (Optional[Dict[str, Dict[int, Any]]]) – Dictionary mapping categorical features to dictionary mapping codes to categories
matrix_vecs (Optional[List[str]]) – list of objects for matrix encoding, i.e. feature prefixes
matrix_feats_per_vec (Optional[List[str]]) – list of features per vector for matrix encoding, i.e. feature suffixes. Features listed but not present in df will be replaced with NaN.
matrix_row_wise (Optional[bool]) – whether objects encoded as a matrix should be encoded row-wise (i.e. all the features associated with an object are in their own row), or column-wise (i.e. all the features associated with an object are in their own column)
tensor_data (Optional[ndarray]) – data of higher order than a matrix can be passed directly as a numpy array, rather than beign extracted and reshaped from the DataFrame. The array will be saved under matrix data, and this is incompatible with also setting matrix_vecs, matrix_feats_per_vec, and matrix_row_wise. The first dimension of the array must be compatible with the length of the data frame.
tensor_name (Optional[str]) – if tensor_data is set, then this is the name that will to the foldfile’s metadata.
tensor_is_sparse (bool) – Set to True if the matrix is in sparse COO format and should be densified later on The format expected is coo_x = sparse.as_coo(x); m = np.vstack((coo_x.data, coo_x.coords)), where m is the tensor passed to tensor_data.
compression (Optional[str]) – optional compression argument for h5py, e.g. ‘lzf’

Return type

None

lumin.data_processing.file_proc.add_meta_data(out_file, feats, cont_feats, cat_feats, cat_maps, targ_feats, wgt_feat=None, matrix_vecs=None, matrix_feats_per_vec=None, matrix_row_wise=None, tensor_name=None, tensor_shp=None, tensor_is_sparse=False)[source]¶

Adds meta data to foldfile containing information about the data: feature names, matrix information, etc. FoldYielder objects will access this and automatically extract it to save the user from having to manually pass lists of features.

Parameters

out_file (File) – h5py file to save data in
feats (List[str]) – list of all features in data
cont_feats (List[str]) – list of continuous features
cat_feats (List[str]) – list of categorical features
cat_maps (Optional[Dict[str, Dict[int, Any]]]) – Dictionary mapping categorical features to dictionary mapping codes to categories
targ_feats (Union[str, List[str]]) – (list of) target feature(s)
wgt_feat (Optional[str]) – name of weight feature
matrix_vecs (Optional[List[str]]) – list of objects for matrix encoding, i.e. feature prefixes
matrix_feats_per_vec (Optional[List[str]]) – list of features per vector for matrix encoding, i.e. feature suffixes. Features listed but not present in df will be replaced with NaN.
matrix_row_wise (Optional[bool]) – whether objects encoded as a matrix should be encoded row-wise (i.e. all the features associated with an object are in their own row), or column-wise (i.e. all the features associated with an object are in their own column)
tensor_name (Optional[str]) – Name used to refer to the tensor when displaying model information
tensor_shp (Optional[Tuple[int]]) – The shape of the tensor data (exclusing batch dimension)
tensor_is_sparse (bool) – Whether the tensor is sparse (COO format) and should be densified prior to use

Return type

None

lumin.data_processing.hep_proc module¶

lumin.data_processing.hep_proc.to_cartesian(df, vec, drop=False)[source]¶

Vectoriesed conversion of 3-momenta to Cartesian coordinates inplace, optionally dropping old pT,eta,phi features

Parameters

df (DataFrame) – DataFrame to alter
vec (str) – column prefix of vector components to alter, e.g. ‘muon’ for columns [‘muon_pt’, ‘muon_phi’, ‘muon_eta’]
drop (bool) – Whether to remove original columns and just keep the new ones

Return type

None

lumin.data_processing.hep_proc.to_pt_eta_phi(df, vec, drop=False)[source]¶

Vectorised conversion of 3-momenta to pT,eta,phi coordinates inplace, optionally dropping old px,py,pz features

Parameters

df (DataFrame) – DataFrame to alter
vec (str) – column prefix of vector components to alter, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]
drop (bool) – Whether to remove original columns and just keep the new ones

Return type

None

lumin.data_processing.hep_proc.delta_phi(arr_a, arr_b)[source]¶

Vectorised computation of modulo 2pi angular seperation of array of angles b from array of angles a, in range [-pi,pi]

Parameters

arr_a (Union[float, ndarray]) – reference angles
arr_b (Union[float, ndarray]) – final angles

Return type

Union[float, ndarray]

Returns

angular separation as float or np.ndarray

lumin.data_processing.hep_proc.twist(dphi, deta)[source]¶

Vectorised computation of twist between vectors (https://arxiv.org/abs/1010.3698)

Parameters

dphi (Union[float, ndarray]) – delta phi separations
deta (Union[float, ndarray]) – delta eta separations

Return type

Union[float, ndarray]

Returns

angular separation as float or np.ndarray

lumin.data_processing.hep_proc.add_abs_mom(df, vec, z=True)[source]¶

Vectorised computation 3-momenta magnitude, adding new column in place. Currently only works for Cartesian vectors

Parameters

df (DataFrame) – DataFrame to alter
vec (str) – column prefix of vector components, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]
z (bool) – whether to consider the z-component of the momenta

Return type

None

lumin.data_processing.hep_proc.add_mass(df, vec)[source]¶

Vectorised computation of mass of 4-vector, adding new column in place.

Parameters

df (DataFrame) – DataFrame to alter
vec (str) – column prefix of vector components, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]

Return type

None

lumin.data_processing.hep_proc.add_energy(df, vec)[source]¶

Vectorised computation of energy of 4-vector, adding new column in place.

Parameters

df (DataFrame) – DataFrame to alter
vec (str) – column prefix of vector components, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]

Return type

None

lumin.data_processing.hep_proc.add_mt(df, vec, mpt_name='mpt')[source]¶

Vectorised computation of transverse mass of 4-vector with respect to missing transverse momenta, adding new column in place. Currently only works for pT, eta, phi vectors

Parameters

df (DataFrame) – DataFrame to alter
vec (str) – column prefix of vector components, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]
mpt_name (str) – column prefix of vector of missing transverse momenta components, e.g. ‘mpt’ for columns [‘mpt_pT’, ‘mpt_phi’]

lumin.data_processing.hep_proc.get_vecs(feats, strict=True)[source]¶

Filter list of features to get list of 3-momenta defined in the list. Works for both pT, eta, phi and Cartesian coordinates. If strict, return only vectors with all coordinates present in feature list.

Parameters

feats (List[str]) – list of features to filter
strict (bool) – whether to require all 3-momenta components to be present in the list

Return type

Set[str]

Returns

set of unique 3-momneta prefixes

lumin.data_processing.hep_proc.fix_event_phi(df, ref_vec)[source]¶

Rotate event in phi such that ref_vec is at phi == 0. Performed inplace. Currently only works on vectors defined in pT, eta, phi

Parameters

df (DataFrame) – DataFrame to alter
ref_vec (str) – column prefix of vector components to use as reference, e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]

Return type

None

lumin.data_processing.hep_proc.fix_event_z(df, ref_vec)[source]¶

Flip event in z-axis such that ref_vec is in positive z-direction. Performed inplace. Works for both pT, eta, phi and Cartesian coordinates.

Parameters

df (DataFrame) – DataFrame to alter
ref_vec (str) – column prefix of vector components to use as reference, e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]

Return type

None

lumin.data_processing.hep_proc.fix_event_y(df, ref_vec_0, ref_vec_1)[source]¶

Flip event in y-axis such that ref_vec_1 has a higher py than ref_vec_0. Performed in place. Works for both pT, eta, phi and Cartesian coordinates.

Parameters

df (DataFrame) – DataFrame to alter
ref_vec_0 (str) – column prefix of vector components to use as reference 0, e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]
ref_vec_1 (str) – column prefix of vector components to use as reference 1, e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]

Return type

None

lumin.data_processing.hep_proc.event_to_cartesian(df, drop=False, ignore=None)[source]¶

Convert entire event to Cartesian coordinates, except vectors listed in ignore. Optionally, drop old pT,eta,phi features. Perfomed inplace.

Parameters

df (DataFrame) – DataFrame to alter
drop (bool) – whether to drop old coordinates
ignore (Optional[List[str]]) – vectors to ignore when converting

Return type

None

lumin.data_processing.hep_proc.proc_event(df, fix_phi=False, fix_y=False, fix_z=False, use_cartesian=False, ref_vec_0=None, ref_vec_1=None, keep_feats=None, default_vals=None)[source]¶

Process event: Pass data through inplace various conversions and drop uneeded columns. Data expected to consist of vectors defined in pT, eta, phi.

Parameters

df (DataFrame) – DataFrame to alter
fix_phi (bool) – whether to rotate events using fix_event_phi()
fix_y – whether to flip events using fix_event_y()
fix_z – whether to flip events using fix_event_z()
use_cartesian – wether to convert vectors to Cartesian coordinates
ref_vec_0 (Optional[str]) – column prefix of vector components to use as reference (0) for :meth:~lumin.data_prcoessing.hep_proc.fix_event_phi`, fix_event_y(), and fix_event_z() e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]
ref_vec_1 (Optional[str]) – column prefix of vector components to use as reference (1) for fix_event_y(), e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]
keep_feats (Optional[List[str]]) – columns to keep which would otherwise be dropped
default_vals (Optional[List[str]]) – list of default values which might be used to represent missing vector components. These will be replaced with np.nan.

Return type

None

lumin.data_processing.hep_proc.calc_pair_mass(df, masses, feat_map)[source]¶

Vectorised computation of invarient mass of pair of particles with given masses, using 3-momenta. Only works for vectors defined in Cartesian coordinates.

Parameters

df (DataFrame) – DataFrame vector components
masses (Union[Tuple[float, float], Tuple[ndarray, ndarray]]) – tuple of masses of particles (either constant or different pair of masses per pair of particles)
feat_map (Dict[str, str]) – dictionary mapping of requested momentum components to the features in df

Return type

ndarray

Returns

np.ndarray of invarient masses

lumin.data_processing.hep_proc.boost(ref_vec, boost_vec, df=None, rescale_boost=False)[source]¶

Vectorised boosting of reference vectors along boosting vectors. N.B. Implementation adapted from ROOT (https://root.cern/)

Parameters

vec_0 – either (N,4) array of 4-momenta coordinates for starting vector, or prefix name for starting vector, i.e. columns should have names of the form [vec_0]_px, etc.
vec_1 – either (N,4) array of 4-momenta coordinates for boosting vector, or prefix name for boosting vector, i.e. columns should have names of the form [vec_1]_px, etc.
df (Optional[DataFrame]) – DataFrame with data
rescale_boost (bool) – whether to divide the boost vector by its energy

Return type

ndarray

Returns

(N,4) array of boosted vector in Cartesian coordinates

lumin.data_processing.hep_proc.boost2cm(vec, df=None)[source]¶

Vectorised computation of boosting vector required to boost a vector to its centre-of-mass frame

Parameters

vec (Union[ndarray, str]) – either (N,4) array of 4-momenta coordinates for starting vector, or prefix name for starting vector, i.e. columns should have names of the form [vec]_px, etc.
df (Optional[DataFrame]) – DataFrame with data is supplying a string vec

Return type

ndarray

Returns

(N,3) array of boosting vector in Cartesian coordinates

lumin.data_processing.hep_proc.get_momentum(df, vec, include_E=False, as_cart=False)[source]¶

Extracts array of 3- or 4-momenta coordinates from DataFrame columns

Parameters

df (DataFrame) – DataFrame with data
vec (str) – prefix name for vector, i.e. columns should have names of the form [vec]_px, etc.
as_cart (bool) – if True will return momenta in Cartesian coordinates

Returns

(px, py, pz, (E)) or (pT, phi, eta, (E))

Return type

(N, 3|4) array with columns

lumin.data_processing.hep_proc.cos_delta(vec_0, vec_1, df=None, name=None, inplace=False)[source]¶

Vectorised compututation of the cosine of the angular seperation of vec_1 from vec_0 If vec_* are strings, then columns are extracted from DataFrame df. If inplace is True Cosine angle is added a new column to the DataFrame with name cosdelta_[vec_0]_[vec_1] or cosdelta, unless name is set

Parameters

vec_0 (Union[ndarray, str]) – either (N,3) array of 3-momenta coordinates for vector 0, or prefix name for vector zero, i.e. columns should have names of the form [vec_0]_px, etc.
vec_1 (Union[ndarray, str]) – either (N,3) array of 3-momenta coordinates for vector 1, or prefix name for vector one, i.e. columns should have names of the form [vec_1]_px, etc.
df (Optional[DataFrame]) – DataFrame with data
name (Optional[str]) – if set, will create a new column in df for cosdelta with given name, otherwise will generate a name
inplace (bool) – if True will add new column to df, otherwise will return array of cos_deltas

Return type

Union[None, ndarray]

Returns

array of cos deltas in not inplace

lumin.data_processing.hep_proc.delta_r(dphi, deta)[source]¶

Vectorised computation of delta R separation for arrays of delta phi and delta eta (rapidity or pseudorapidity)

Parameters

dphi (Union[float, ndarray]) – delta phi separations
deta (Union[float, ndarray]) – delta eta separations

Return type

Union[float, ndarray]

Returns

delta R separation as float or np.ndarray

lumin.data_processing.hep_proc.delta_r_boosted(vec_0, vec_1, ref_vec, df=None, name=None, inplace=False)[source]¶

Vectorised compututation of the deltaR seperation of vec_1 from vec_0 in the rest-frame of another vector If vec_* are strings, then columns are extracted from DataFrame df. If inplace is True deltaR is added a new column to the DataFrame with name dR_[vec_0]_[vec_1]_boosted_[ref_vec] or dR_boosted, unless name is set

Parameters

vec_0 (Union[ndarray, str]) – either (N,4) array of 4-momenta coordinates for vector 0, in Cartesian coordinates or prefix name for vector zero, i.e. columns should have names of the form [vec_0]_px, etc.
vec_1 (Union[ndarray, str]) – either (N,4) array of 4-momenta coordinates for vector 1, in Cartesian coordinates or prefix name for vector one, i.e. columns should have names of the form [vec_1]_px, etc.
ref_vec (Union[ndarray, str]) – either (N,4) array of 4-momenta coordinates for the vector in whos rest-frame deltaR should be computed, in Cartesian coordinates or prefix name for reference vector, i.e. columns should have names of the form [ref_vec]_px, etc.
df (Optional[DataFrame]) – DataFrame with data
name (Optional[str]) – if set, will create a new column in df for cosdelta with given name, otherwise will generate a name
inplace (bool) – if True will add new column to df, otherwise will return array of cos_deltas

Return type

Union[None, ndarray]

Returns

array of boosted deltaR in not inplace

lumin.data_processing.pre_proc module¶

lumin.data_processing.pre_proc.get_pre_proc_pipes(norm_in=True, norm_out=False, pca=False, whiten=False, with_mean=True, with_std=True, n_components=None)[source]¶

Configure SKLearn Pipelines for processing inputs and targets with the requested transformations.

Parameters

norm_in (bool) – whether to apply StandardScaler to inputs
norm_out (bool) – whether to apply StandardScaler to outputs
pca (bool) – whether to apply PCA to inputs. Perforemed prior to StandardScaler. No dimensionality reduction is applied, purely rotation.
whiten (bool) – whether PCA should whiten inputs.
with_mean (bool) – whether StandardScalers should shift means to 0
with_std (bool) – whether StandardScalers should scale standard deviations to 1
n_components (Optional[int]) – if set, causes PCA to reduce the dimensionality of the input data

Return type

Tuple[Pipeline, Pipeline]

Returns

Pipeline for input data Pipeline for target data

lumin.data_processing.pre_proc.fit_input_pipe(df, cont_feats, savename=None, input_pipe=None, norm_in=True, pca=False, whiten=False, with_mean=True, with_std=True, n_components=None)[source]¶

Fit input pipeline to continuous features and optionally save.

Parameters

df (DataFrame) – DataFrame with data to fit pipeline
cont_feats (Union[str, List[str]]) – (list of) column(s) to use as input data for fitting
savename (Optional[str]) – if set will save the fitted Pipeline to with that name as Pickle (.pkl extension added automatically)
input_pipe (Optional[Pipeline]) – if set will fit, otherwise will instantiate a new Pipeline
norm_in (bool) – whether to apply StandardScaler to inputs. Only used if input_pipe is not set.
pca (bool) – whether to apply PCA to inputs. Perforemed prior to StandardScaler. No dimensionality reduction is applied, purely rotation. Only used if input_pipe is not set.
whiten (bool) – whether PCA should whiten inputs. Only used if input_pipe is not set.
with_mean (bool) – whether StandardScalers should shift means to 0. Only used if input_pipe is not set.
with_std (bool) – whether StandardScalers should scale standard deviations to 1. Only used if input_pipe is not set.
n_components (Optional[int]) – if set, causes PCA to reduce the dimensionality of the input data. Only used if input_pipe is not set.

Return type

Pipeline

Returns

Fitted Pipeline

lumin.data_processing.pre_proc.fit_output_pipe(df, targ_feats, savename=None, output_pipe=None, norm_out=True)[source]¶

Fit output pipeline to target features and optionally save. Have you thought about using a y_range for regression instead?

Parameters

df (DataFrame) – DataFrame with data to fit pipeline
targ_feats (Union[str, List[str]]) – (list of) column(s) to use as input data for fitting
savename (Optional[str]) – if set will save the fitted Pipeline to with that name as Pickle (.pkl extension added automatically)
output_pipe (Optional[Pipeline]) – if set will fit, otherwise will instantiate a new Pipeline
norm_out (bool) – whether to apply StandardScaler to outputs . Only used if output_pipe is not set.

Return type

Pipeline

Returns

Fitted Pipeline

lumin.data_processing.pre_proc.proc_cats(train_df, cat_feats, val_df=None, test_df=None)[source]¶

Process categorical features in train_df to be valued 0->cardinality-1. Applied inplace. Applies same transformation to validation and testing data is passed. Will complain if validation or testing sets contain categories which are not present in the training data.

Parameters

train_df (DataFrame) – DataFrame with the training data, which will also be used to specify all the categories to consider
cat_feats (List[str]) – list of columns to use as categorical features
val_df (Optional[DataFrame]) – if set will apply the same category to code mapping to the validation data as was performed on the training data
test_df (Optional[DataFrame]) – if set will apply the same category to code mapping to the testing data as was performed on the training data

Return type

Tuple[OrderedDict, OrderedDict]

Returns

ordered dictionary mapping categorical features to dictionaries mapping categories to codes ordered dictionary mapping categorical features to their cardinalities

lumin.data_processing package¶

Submodules¶

lumin.data_processing.file_proc module¶

lumin.data_processing.hep_proc module¶

lumin.data_processing.pre_proc module¶

Module contents¶

Docs

Tutorials