lumin.data_processing package¶

Submodules¶

lumin.data_processing.file_proc module¶

lumin.data_processing.file_proc.add_meta_data(out_file, feats, cont_feats, cat_feats, cat_maps, targ_feats, wgt_feat=None, matrix_vecs=None, matrix_feats_per_vec=None, matrix_row_wise=None, tensor_name=None, tensor_shp=None, tensor_is_sparse=False, target_tensor_shp=None, tensor_target_is_sparse=False)[source]¶

Adds meta data to foldfile containing information about the data: feature names, matrix information, etc. FoldYielder objects will access this and automatically extract it to save the user from having to manually pass lists of features.

Parameters:

out_file (File) – h5py file to save data in
feats (List[str]) – list of all features in data
cont_feats (List[str]) – list of continuous features
cat_feats (List[str]) – list of categorical features
cat_maps (Optional[Dict[str, Dict[int, Any]]]) – Dictionary mapping categorical features to dictionary mapping codes to categories
targ_feats (Union[str, List[str]]) – (list of) target feature(s)
wgt_feat (Optional[str]) – name of weight feature
matrix_vecs (Optional[List[str]]) – list of objects for matrix encoding, i.e. feature prefixes
matrix_feats_per_vec (Optional[List[str]]) – list of features per vector for matrix encoding, i.e. feature suffixes. Features listed but not present in df will be replaced with NaN.
matrix_row_wise (Optional[bool]) – whether objects encoded as a matrix should be encoded row-wise (i.e. all the features associated with an object are in their own row), or column-wise (i.e. all the features associated with an object are in their own column)
tensor_name (Optional[str]) – Name used to refer to the tensor when displaying model information
tensor_shp (Optional[Tuple[int]]) – The shape of the tensor data (excluding batch dimension)
tensor_is_sparse (bool) – Whether the tensor is sparse (COO format) and should be densified prior to use
target_tensor_shp (Optional[Tuple[int]]) – The shape of the target tensor data (excluding batch dimension)
tensor_target_is_sparse (bool) – Whether the target tensor is sparse (COO format) and should be densified prior to use

Return type:

None

lumin.data_processing.file_proc.df2foldfile(df, n_folds, cont_feats, cat_feats, targ_feats, savename, targ_type, shuffle=True, strat_key=None, misc_feats=None, wgt_feat=None, cat_maps=None, matrix_vecs=None, matrix_feats_per_vec=None, matrix_row_wise=None, tensor_data=None, tensor_name=None, tensor_as_sparse=False, compression=None, tensor_target=None, tensor_target_as_sparse=False)[source]¶

Convert dataframe into h5py file by splitting data into sub-folds to be accessed by a FoldYielder

Parameters:

df (Optional[DataFrame]) – Dataframe from which to save data, can contain flat input features, weights, targets, but is entriely optional if targets and inputs are passed as tensors
n_folds (int) – number of folds to split df into
cont_feats (List[str]) – list of columns in df to save as continuous variables
cat_feats (List[str]) – list of columns in df to save as discreet variables
targ_feats (Union[str, List[str]]) – (list of) column(s) in df to save as target feature(s)
savename (Union[Path, str]) – name of h5py file to create (.h5py extension not required)
targ_type (str) – type of target feature, e.g. int,’float32’
shuffle (bool) – if true will shuffle data prior to splitting into folds, otherwise folds will be contiguous splits of the unsuffled data, useful e.g. for testing datasets
strat_key (Optional[str]) – column to use for stratified splitting
misc_feats (Optional[List[str]]) – any extra columns to save
wgt_feat (Optional[str]) – column to save as data weights
cat_maps (Optional[Dict[str, Dict[int, Any]]]) – Dictionary mapping categorical features to dictionary mapping codes to categories
matrix_vecs (Optional[List[str]]) – list of objects for matrix encoding, i.e. feature prefixes
matrix_feats_per_vec (Optional[List[str]]) – list of features per vector for matrix encoding, i.e. feature suffixes. Features listed but not present in df will be replaced with NaN.
matrix_row_wise (Optional[bool]) – whether objects encoded as a matrix should be encoded row-wise (i.e. all the features associated with an object are in their own row), or column-wise (i.e. all the features associated with an object are in their own column)
tensor_data (Optional[ndarray]) – data of higher order than a matrix can be passed directly as a numpy array, rather than beign extracted and reshaped from the DataFrame. The array will be saved under matrix data, and this is incompatible with also setting matrix_vecs, matrix_feats_per_vec, and matrix_row_wise. The first dimension of the array must be compatible with the length of the data frame.
tensor_name (Optional[str]) – if tensor_data is set, then this is the name that will to the foldfile’s metadata.
tensor_as_sparse (bool) – Set to True to store the matrix in sparse COO format The format is coo_x = sparse.as_coo(x); m = np.vstack((coo_x.data, coo_x.coords)), where m is the tensor passed to tensor_data.
compression (Optional[str]) – optional compression argument for h5py, e.g. ‘lzf’
tensor_target (Optional[ndarray]) – optional encoding of multi-dimensional targets as a numpy array
tensor_target_as_sparse (bool) – Set to True to store the matrix in sparse COO format The format is coo_x = sparse.as_coo(x); m = np.vstack((coo_x.data, coo_x.coords)), where m is the tensor passed to tensor_target.

Return type:

None

lumin.data_processing.file_proc.fold2foldfile(df, out_file, fold_idx, cont_feats, cat_feats, targ_feats, targ_type, misc_feats=None, wgt_feat=None, matrix_lookup=None, matrix_missing=None, matrix_shape=None, tensor_data=None, tensor_target=None, compression=None, n_samples=None)[source]¶

Save fold of data into an h5py Group

Parameters:

df (Optional[DataFrame]) – Dataframe from which to save data, can contain flat input features, weights, targets, but is entriely optional if targets and inputs are passed as tensors
out_file (File) – h5py file to save data in
fold_idx (int) – ID for the fold; used name h5py group according to ‘fold_{fold_idx}’
cont_feats (List[str]) – list of columns in df to save as continuous variables
cat_feats (List[str]) – list of columns in df to save as discreet variables
targ_feats (Union[str, List[str]]) – (list of) column(s) in df to save as target feature(s)
targ_type (Any) – type of target feature, e.g. int,’float32’
misc_feats (Optional[List[str]]) – any extra columns to save
wgt_feat (Optional[str]) – column to save as data weights
matrix_vecs – list of objects for matrix encoding, i.e. feature prefixes
matrix_feats_per_vec – list of features per vector for matrix encoding, i.e. feature suffixes. Features listed but not present in df will be replaced with NaN.
matrix_row_wise – whether objects encoded as a matrix should be encoded row-wise (i.e. all the features associated with an object are in their own row), or column-wise (i.e. all the features associated with an object are in their own column)
tensor_data (Optional[ndarray]) – data of higher order than a matrix can be passed directly as a numpy array, rather than beign extracted and reshaped from the DataFrame. The array will be saved under matrix data, and this is incompatible with also setting matrix_lookup, matrix_missing, and matrix_shape. The first dimension of the array must be compatible with the length of the data frame.
tensor_target (Optional[ndarray]) – optional encoding of multi-dimensional targets as a numpy array
compression (Optional[str]) – optional compression argument for h5py, e.g. ‘lzf’
n_samples (Optional[int]) – in case df is None, please supply the number of samples in the fold: this cannot be determined otherwise, since tensor_data and tensor_target may be sparse

Return type:

None

lumin.data_processing.file_proc.save_to_grp(arr, grp, name, compression=None)[source]¶

Save Numpy array as a dataset in an h5py Group

Parameters:

arr (ndarray) – array to be saved
grp (Group) – group in which to save arr
name (str) – name of dataset to create
compression (Optional[str]) – optional compression argument for h5py, e.g. ‘lzf’

Return type:

None

lumin.data_processing.hep_proc module¶

lumin.data_processing.hep_proc.add_abs_mom(df, vec, z=True)[source]¶

Vectorised computation 3-momenta magnitude, adding new column in place. Currently only works for Cartesian vectors

Parameters:

df (DataFrame) – DataFrame to alter
vec (str) – column prefix of vector components, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]
z (bool) – whether to consider the z-component of the momenta

Return type:

None

lumin.data_processing.hep_proc.add_energy(df, vec)[source]¶

Vectorised computation of energy of 4-vector, adding new column in place.

Parameters:

df (DataFrame) – DataFrame to alter
vec (str) – column prefix of vector components, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]

Return type:

None

lumin.data_processing.hep_proc.add_mass(df, vec)[source]¶

Vectorised computation of mass of 4-vector, adding new column in place.

Parameters:

df (DataFrame) – DataFrame to alter
vec (str) – column prefix of vector components, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]

Return type:

None

lumin.data_processing.hep_proc.add_mt(df, vec, mpt_name='mpt')[source]¶

Vectorised computation of transverse mass of 4-vector with respect to missing transverse momenta, adding new column in place. Currently only works for pT, eta, phi vectors

Parameters:

df (DataFrame) – DataFrame to alter
vec (str) – column prefix of vector components, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]
mpt_name (str) – column prefix of vector of missing transverse momenta components, e.g. ‘mpt’ for columns [‘mpt_pT’, ‘mpt_phi’]

lumin.data_processing.hep_proc.boost(ref_vec, boost_vec, df=None, rescale_boost=False)[source]¶

Vectorised boosting of reference vectors along boosting vectors. N.B. Implementation adapted from ROOT (https://root.cern/)

Parameters:

vec_0 – either (N,4) array of 4-momenta coordinates for starting vector, or prefix name for starting vector, i.e. columns should have names of the form [vec_0]_px, etc.
vec_1 – either (N,4) array of 4-momenta coordinates for boosting vector, or prefix name for boosting vector, i.e. columns should have names of the form [vec_1]_px, etc.
df (Optional[DataFrame]) – DataFrame with data
rescale_boost (bool) – whether to divide the boost vector by its energy

Return type:

ndarray

Returns:

(N,4) array of boosted vector in Cartesian coordinates

lumin.data_processing.hep_proc.boost2cm(vec, df=None)[source]¶

Vectorised computation of boosting vector required to boost a vector to its centre-of-mass frame

Parameters:

vec (Union[ndarray, str]) – either (N,4) array of 4-momenta coordinates for starting vector, or prefix name for starting vector, i.e. columns should have names of the form [vec]_px, etc.
df (Optional[DataFrame]) – DataFrame with data is supplying a string vec

Return type:

ndarray

Returns:

(N,3) array of boosting vector in Cartesian coordinates

lumin.data_processing.hep_proc.calc_pair_mass(df, masses, feat_map)[source]¶

Vectorised computation of invarient mass o

f pair of particles with given masses, using 3-momenta. Only works for vectors defined in Cartesian coordinates.

Arguments:
df: DataFrame vector components masses: tuple of masses of particles (either constant or different pair of masses per pair of particles) feat_map: dictionary mapping of requested momentum components to the features in df

Returns:
np.ndarray of invarient masses

Return type:: ndarray

lumin.data_processing.hep_proc.cos_delta(vec_0, vec_1, df=None, name=None, inplace=False)[source]¶

Vectorised compututation of the cosine of the angular seperation of vec_1 from vec_0 If vec_* are strings, then columns are extracted from DataFrame df. If inplace is True Cosine angle is added a new column to the DataFrame with name cosdelta_[vec_0]_[vec_1] or cosdelta, unless name is set

Parameters:

vec_0 (Union[ndarray, str]) – either (N,3) array of 3-momenta coordinates for vector 0, or prefix name for vector zero, i.e. columns should have names of the form [vec_0]_px, etc.
vec_1 (Union[ndarray, str]) – either (N,3) array of 3-momenta coordinates for vector 1, or prefix name for vector one, i.e. columns should have names of the form [vec_1]_px, etc.
df (Optional[DataFrame]) – DataFrame with data
name (Optional[str]) – if set, will create a new column in df for cosdelta with given name, otherwise will generate a name
inplace (bool) – if True will add new column to df, otherwise will return array of cos_deltas

Return type:

Optional[ndarray]

Returns:

array of cos deltas in not inplace

lumin.data_processing.hep_proc.delta_phi(arr_a, arr_b)[source]¶

Vectorised computation of modulo 2pi angular seperation of array of angles b from array of angles a, in range [-pi,pi]

Parameters:

arr_a (Union[float, ndarray]) – reference angles
arr_b (Union[float, ndarray]) – final angles

Return type:

Union[float, ndarray]

Returns:

angular separation as float or np.ndarray

lumin.data_processing.hep_proc.delta_r(dphi, deta)[source]¶

Vectorised computation of delta R separation for arrays of delta phi and delta eta (rapidity or pseudorapidity)

Parameters:

dphi (Union[float, ndarray]) – delta phi separations
deta (Union[float, ndarray]) – delta eta separations

Return type:

Union[float, ndarray]

Returns:

delta R separation as float or np.ndarray

lumin.data_processing.hep_proc.delta_r_boosted(vec_0, vec_1, ref_vec, df=None, name=None, inplace=False)[source]¶

Vectorised compututation of the deltaR seperation of vec_1 from vec_0 in the rest-frame of another vector If vec_* are strings, then columns are extracted from DataFrame df. If inplace is True deltaR is added a new column to the DataFrame with name dR_[vec_0]_[vec_1]_boosted_[ref_vec] or dR_boosted, unless name is set

Parameters:

vec_0 (Union[ndarray, str]) – either (N,4) array of 4-momenta coordinates for vector 0, in Cartesian coordinates or prefix name for vector zero, i.e. columns should have names of the form [vec_0]_px, etc.
vec_1 (Union[ndarray, str]) – either (N,4) array of 4-momenta coordinates for vector 1, in Cartesian coordinates or prefix name for vector one, i.e. columns should have names of the form [vec_1]_px, etc.
ref_vec (Union[ndarray, str]) – either (N,4) array of 4-momenta coordinates for the vector in whos rest-frame deltaR should be computed, in Cartesian coordinates or prefix name for reference vector, i.e. columns should have names of the form [ref_vec]_px, etc.
df (Optional[DataFrame]) – DataFrame with data
name (Optional[str]) – if set, will create a new column in df for cosdelta with given name, otherwise will generate a name
inplace (bool) – if True will add new column to df, otherwise will return array of cos_deltas

Return type:

Optional[ndarray]

Returns:

array of boosted deltaR in not inplace

lumin.data_processing.hep_proc.event_to_cartesian(df, drop=False, ignore=None)[source]¶

Convert entire event to Cartesian coordinates, except vectors listed in ignore. Optionally, drop old pT,eta,phi features. Perfomed inplace.

Parameters:

df (DataFrame) – DataFrame to alter
drop (bool) – whether to drop old coordinates
ignore (Optional[List[str]]) – vectors to ignore when converting

Return type:

None

lumin.data_processing.hep_proc.fix_event_phi(df, ref_vec)[source]¶

Rotate event in phi such that ref_vec is at phi == 0. Performed inplace. Currently only works on vectors defined in pT, eta, phi

Parameters:

df (DataFrame) – DataFrame to alter
ref_vec (str) – column prefix of vector components to use as reference, e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]

Return type:

None

lumin.data_processing.hep_proc.fix_event_y(df, ref_vec_0, ref_vec_1)[source]¶

Flip event in y-axis such that ref_vec_1 has a higher py than ref_vec_0. Performed in place. Works for both pT, eta, phi and Cartesian coordinates.

Parameters:

df (DataFrame) – DataFrame to alter
ref_vec_0 (str) – column prefix of vector components to use as reference 0, e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]
ref_vec_1 (str) – column prefix of vector components to use as reference 1, e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]

Return type:

None

lumin.data_processing.hep_proc.fix_event_z(df, ref_vec)[source]¶

Flip event in z-axis such that ref_vec is in positive z-direction. Performed inplace. Works for both pT, eta, phi and Cartesian coordinates.

Parameters:

df (DataFrame) – DataFrame to alter
ref_vec (str) – column prefix of vector components to use as reference, e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]

Return type:

None

lumin.data_processing.hep_proc.get_momentum(df, vec, include_E=False, as_cart=False)[source]¶

Extracts array of 3- or 4-momenta coordinates from DataFrame columns

Parameters:

df (DataFrame) – DataFrame with data
vec (str) – prefix name for vector, i.e. columns should have names of the form [vec]_px, etc.
as_cart (bool) – if True will return momenta in Cartesian coordinates

Returns:

(px, py, pz, (E)) or (pT, phi, eta, (E))

Return type:

(N, 3|4) array with columns

lumin.data_processing.hep_proc.get_vecs(feats, strict=True)[source]¶

Filter list of features to get list of 3-momenta defined in the list. Works for both pT, eta, phi and Cartesian coordinates. If strict, return only vectors with all coordinates present in feature list.

Parameters:

feats (List[str]) – list of features to filter
strict (bool) – whether to require all 3-momenta components to be present in the list

Return type:

Set[str]

Returns:

set of unique 3-momneta prefixes

lumin.data_processing.hep_proc.proc_event(df, fix_phi=False, fix_y=False, fix_z=False, use_cartesian=False, ref_vec_0=None, ref_vec_1=None, keep_feats=None, default_vals=None)[source]¶

Process event: Pass data through inplace various conversions and drop uneeded columns. Data expected to consist of vectors defined in pT, eta, phi.

Parameters:

df (DataFrame) – DataFrame to alter
fix_phi (bool) – whether to rotate events using fix_event_phi()
fix_y – whether to flip events using fix_event_y()
fix_z – whether to flip events using fix_event_z()
use_cartesian – wether to convert vectors to Cartesian coordinates
ref_vec_0 (Optional[str]) – column prefix of vector components to use as reference (0) for :meth:~lumin.data_prcoessing.hep_proc.fix_event_phi`, fix_event_y(), and fix_event_z() e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]
ref_vec_1 (Optional[str]) – column prefix of vector components to use as reference (1) for fix_event_y(), e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]
keep_feats (Optional[List[str]]) – columns to keep which would otherwise be dropped
default_vals (Optional[List[str]]) – list of default values which might be used to represent missing vector components. These will be replaced with np.nan.

Return type:

None

lumin.data_processing.hep_proc.to_cartesian(df, vec, drop=False)[source]¶

Vectoriesed conversion of 3-momenta to Cartesian coordinates inplace, optionally dropping old pT,eta,phi features

Parameters:

df (DataFrame) – DataFrame to alter
vec (str) – column prefix of vector components to alter, e.g. ‘muon’ for columns [‘muon_pt’, ‘muon_phi’, ‘muon_eta’]
drop (bool) – Whether to remove original columns and just keep the new ones

Return type:

None

lumin.data_processing.hep_proc.to_pt_eta_phi(df, vec, drop=False)[source]¶

Vectorised conversion of 3-momenta to pT,eta,phi coordinates inplace, optionally dropping old px,py,pz features

Parameters:

df (DataFrame) – DataFrame to alter
vec (str) – column prefix of vector components to alter, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]
drop (bool) – Whether to remove original columns and just keep the new ones

Return type:

None

lumin.data_processing.hep_proc.twist(dphi, deta)[source]¶

Vectorised computation of twist between vectors (https://arxiv.org/abs/1010.3698)

Parameters:

dphi (Union[float, ndarray]) – delta phi separations
deta (Union[float, ndarray]) – delta eta separations

Return type:

Union[float, ndarray]

Returns:

angular separation as float or np.ndarray

lumin.data_processing.pre_proc module¶

lumin.data_processing.pre_proc.fit_input_pipe(df, cont_feats, savename=None, input_pipe=None, norm_in=True, pca=False, whiten=False, with_mean=True, with_std=True, n_components=None)[source]¶

Fit input pipeline to continuous features and optionally save.

Parameters:

df (DataFrame) – DataFrame with data to fit pipeline
cont_feats (Union[str, List[str]]) – (list of) column(s) to use as input data for fitting
savename (Optional[str]) – if set will save the fitted Pipeline to with that name as Pickle (.pkl extension added automatically)
input_pipe (Optional[Pipeline]) – if set will fit, otherwise will instantiate a new Pipeline
norm_in (bool) – whether to apply StandardScaler to inputs. Only used if input_pipe is not set.
pca (bool) – whether to apply PCA to inputs. Perforemed prior to StandardScaler. No dimensionality reduction is applied, purely rotation. Only used if input_pipe is not set.
whiten (bool) – whether PCA should whiten inputs. Only used if input_pipe is not set.
with_mean (bool) – whether StandardScalers should shift means to 0. Only used if input_pipe is not set.
with_std (bool) – whether StandardScalers should scale standard deviations to 1. Only used if input_pipe is not set.
n_components (Optional[int]) – if set, causes PCA to reduce the dimensionality of the input data. Only used if input_pipe is not set.

Return type:

Pipeline

Returns:

Fitted Pipeline

lumin.data_processing.pre_proc.fit_output_pipe(df, targ_feats, savename=None, output_pipe=None, norm_out=True)[source]¶

Fit output pipeline to target features and optionally save. Have you thought about using a y_range for regression instead?

Parameters:

df (DataFrame) – DataFrame with data to fit pipeline
targ_feats (Union[str, List[str]]) – (list of) column(s) to use as input data for fitting
savename (Optional[str]) – if set will save the fitted Pipeline to with that name as Pickle (.pkl extension added automatically)
output_pipe (Optional[Pipeline]) – if set will fit, otherwise will instantiate a new Pipeline
norm_out (bool) – whether to apply StandardScaler to outputs . Only used if output_pipe is not set.

Return type:

Pipeline

Returns:

Fitted Pipeline

lumin.data_processing.pre_proc.get_pre_proc_pipes(norm_in=True, norm_out=False, pca=False, whiten=False, with_mean=True, with_std=True, n_components=None)[source]¶

Configure SKLearn Pipelines for processing inputs and targets with the requested transformations.

Parameters:

norm_in (bool) – whether to apply StandardScaler to inputs
norm_out (bool) – whether to apply StandardScaler to outputs
pca (bool) – whether to apply PCA to inputs. Perforemed prior to StandardScaler. No dimensionality reduction is applied, purely rotation.
whiten (bool) – whether PCA should whiten inputs.
with_mean (bool) – whether StandardScalers should shift means to 0
with_std (bool) – whether StandardScalers should scale standard deviations to 1
n_components (Optional[int]) – if set, causes PCA to reduce the dimensionality of the input data

Return type:

Tuple[Pipeline, Pipeline]

Returns:

Pipeline for input data Pipeline for target data

lumin.data_processing.pre_proc.proc_cats(train_df, cat_feats, val_df=None, test_df=None)[source]¶

Process categorical features in train_df to be valued 0->cardinality-1. Applied inplace. Applies same transformation to validation and testing data is passed. Will complain if validation or testing sets contain categories which are not present in the training data.

Parameters:

train_df (DataFrame) – DataFrame with the training data, which will also be used to specify all the categories to consider
cat_feats (List[str]) – list of columns to use as categorical features
val_df (Optional[DataFrame]) – if set will apply the same category to code mapping to the validation data as was performed on the training data
test_df (Optional[DataFrame]) – if set will apply the same category to code mapping to the testing data as was performed on the training data

Return type:

Tuple[OrderedDict, OrderedDict]

Returns:

ordered dictionary mapping categorical features to dictionaries mapping codes to categories ordered dictionary mapping categorical features to their cardinalities

lumin.data_processing package¶

Submodules¶

lumin.data_processing.file_proc module¶

lumin.data_processing.hep_proc module¶

lumin.data_processing.pre_proc module¶

Module contents¶

Docs

Tutorials