lumin.data_processing package¶
Submodules¶
lumin.data_processing.file_proc module¶
-
lumin.data_processing.file_proc.
save_to_grp
(arr, grp, name, compression=None)[source]¶ Save Numpy array as a dataset in an h5py Group
- Parameters
arr (
ndarray
) – array to be savedgrp (
Group
) – group in which to save arrname (
str
) – name of dataset to createcompression (
Optional
[str
]) – optional compression argument for h5py, e.g. ‘lzf’
- Return type
None
-
lumin.data_processing.file_proc.
fold2foldfile
(df, out_file, fold_idx, cont_feats, cat_feats, targ_feats, targ_type, misc_feats=None, wgt_feat=None, matrix_lookup=None, matrix_missing=None, matrix_shape=None, tensor_data=None, compression=None)[source]¶ Save fold of data into an h5py Group
- Parameters
df (
DataFrame
) – Dataframe from which to save dataout_file (
File
) – h5py file to save data infold_idx (
int
) – ID for the fold; used name h5py group according to ‘fold_{fold_idx}’cont_feats (
List
[str
]) – list of columns in df to save as continuous variablescat_feats (
List
[str
]) – list of columns in df to save as discreet variablestarg_feats (
Union
[str
,List
[str
]]) – (list of) column(s) in df to save as target feature(s)targ_type (
Any
) – type of target feature, e.g. int,’float32’misc_feats (
Optional
[List
[str
]]) – any extra columns to savewgt_feat (
Optional
[str
]) – column to save as data weightsmatrix_vecs – list of objects for matrix encoding, i.e. feature prefixes
matrix_feats_per_vec – list of features per vector for matrix encoding, i.e. feature suffixes. Features listed but not present in df will be replaced with NaN.
matrix_row_wise – whether objects encoded as a matrix should be encoded row-wise (i.e. all the features associated with an object are in their own row), or column-wise (i.e. all the features associated with an object are in their own column)
tensor_data (
Optional
[ndarray
]) – data of higher order than a matrix can be passed directly as a numpy array, rather than beign extracted and reshaped from the DataFrame. The array will be saved under matrix data, and this is incompatible with also setting matrix_lookup, matrix_missing, and matrix_shape. The first dimension of the array must be compatible with the length of the data frame.compression (
Optional
[str
]) – optional compression argument for h5py, e.g. ‘lzf’
- Return type
None
-
lumin.data_processing.file_proc.
df2foldfile
(df, n_folds, cont_feats, cat_feats, targ_feats, savename, targ_type, strat_key=None, misc_feats=None, wgt_feat=None, cat_maps=None, matrix_vecs=None, matrix_feats_per_vec=None, matrix_row_wise=None, tensor_data=None, tensor_name=None, tensor_is_sparse=False, compression=None)[source]¶ Convert dataframe into h5py file by splitting data into sub-folds to be accessed by a
FoldYielder
- Parameters
df (
DataFrame
) – Dataframe from which to save datan_folds (
int
) – number of folds to split df intocont_feats (
List
[str
]) – list of columns in df to save as continuous variablescat_feats (
List
[str
]) – list of columns in df to save as discreet variablestarg_feats (
Union
[str
,List
[str
]]) – (list of) column(s) in df to save as target feature(s)savename (
Union
[Path
,str
]) – name of h5py file to create (.h5py extension not required)targ_type (
str
) – type of target feature, e.g. int,’float32’strat_key (
Optional
[str
]) – column to use for stratified splittingmisc_feats (
Optional
[List
[str
]]) – any extra columns to savewgt_feat (
Optional
[str
]) – column to save as data weightscat_maps (
Optional
[Dict
[str
,Dict
[int
,Any
]]]) – Dictionary mapping categorical features to dictionary mapping codes to categoriesmatrix_vecs (
Optional
[List
[str
]]) – list of objects for matrix encoding, i.e. feature prefixesmatrix_feats_per_vec (
Optional
[List
[str
]]) – list of features per vector for matrix encoding, i.e. feature suffixes. Features listed but not present in df will be replaced with NaN.matrix_row_wise (
Optional
[bool
]) – whether objects encoded as a matrix should be encoded row-wise (i.e. all the features associated with an object are in their own row), or column-wise (i.e. all the features associated with an object are in their own column)tensor_data (
Optional
[ndarray
]) – data of higher order than a matrix can be passed directly as a numpy array, rather than beign extracted and reshaped from the DataFrame. The array will be saved under matrix data, and this is incompatible with also setting matrix_vecs, matrix_feats_per_vec, and matrix_row_wise. The first dimension of the array must be compatible with the length of the data frame.tensor_name (
Optional
[str
]) – if tensor_data is set, then this is the name that will to the foldfile’s metadata.tensor_is_sparse (
bool
) – Set to True if the matrix is in sparse COO format and should be densified later on The format expected is coo_x = sparse.as_coo(x); m = np.vstack((coo_x.data, coo_x.coords)), where m is the tensor passed to tensor_data.compression (
Optional
[str
]) – optional compression argument for h5py, e.g. ‘lzf’
- Return type
None
-
lumin.data_processing.file_proc.
add_meta_data
(out_file, feats, cont_feats, cat_feats, cat_maps, targ_feats, wgt_feat=None, matrix_vecs=None, matrix_feats_per_vec=None, matrix_row_wise=None, tensor_name=None, tensor_shp=None, tensor_is_sparse=False)[source]¶ Adds meta data to foldfile containing information about the data: feature names, matrix information, etc.
FoldYielder
objects will access this and automatically extract it to save the user from having to manually pass lists of features.- Parameters
out_file (
File
) – h5py file to save data infeats (
List
[str
]) – list of all features in datacont_feats (
List
[str
]) – list of continuous featurescat_feats (
List
[str
]) – list of categorical featurescat_maps (
Optional
[Dict
[str
,Dict
[int
,Any
]]]) – Dictionary mapping categorical features to dictionary mapping codes to categoriestarg_feats (
Union
[str
,List
[str
]]) – (list of) target feature(s)wgt_feat (
Optional
[str
]) – name of weight featurematrix_vecs (
Optional
[List
[str
]]) – list of objects for matrix encoding, i.e. feature prefixesmatrix_feats_per_vec (
Optional
[List
[str
]]) – list of features per vector for matrix encoding, i.e. feature suffixes. Features listed but not present in df will be replaced with NaN.matrix_row_wise (
Optional
[bool
]) – whether objects encoded as a matrix should be encoded row-wise (i.e. all the features associated with an object are in their own row), or column-wise (i.e. all the features associated with an object are in their own column)tensor_name (
Optional
[str
]) – Name used to refer to the tensor when displaying model informationtensor_shp (
Optional
[Tuple
[int
]]) – The shape of the tensor data (exclusing batch dimension)tensor_is_sparse (
bool
) – Whether the tensor is sparse (COO format) and should be densified prior to use
- Return type
None
lumin.data_processing.hep_proc module¶
-
lumin.data_processing.hep_proc.
to_cartesian
(df, vec, drop=False)[source]¶ Vectoriesed conversion of 3-momenta to Cartesian coordinates inplace, optionally dropping old pT,eta,phi features
- Parameters
df (
DataFrame
) – DataFrame to altervec (
str
) – column prefix of vector components to alter, e.g. ‘muon’ for columns [‘muon_pt’, ‘muon_phi’, ‘muon_eta’]drop (
bool
) – Whether to remove original columns and just keep the new ones
- Return type
None
-
lumin.data_processing.hep_proc.
to_pt_eta_phi
(df, vec, drop=False)[source]¶ Vectorised conversion of 3-momenta to pT,eta,phi coordinates inplace, optionally dropping old px,py,pz features
- Parameters
df (
DataFrame
) – DataFrame to altervec (
str
) – column prefix of vector components to alter, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]drop (
bool
) – Whether to remove original columns and just keep the new ones
- Return type
None
-
lumin.data_processing.hep_proc.
delta_phi
(arr_a, arr_b)[source]¶ Vectorised computation of modulo 2pi angular seperation of array of angles b from array of angles a, in range [-pi,pi]
- Parameters
arr_a (
Union
[float
,ndarray
]) – reference anglesarr_b (
Union
[float
,ndarray
]) – final angles
- Return type
Union
[float
,ndarray
]- Returns
angular separation as float or np.ndarray
-
lumin.data_processing.hep_proc.
twist
(dphi, deta)[source]¶ Vectorised computation of twist between vectors (https://arxiv.org/abs/1010.3698)
- Parameters
dphi (
Union
[float
,ndarray
]) – delta phi separationsdeta (
Union
[float
,ndarray
]) – delta eta separations
- Return type
Union
[float
,ndarray
]- Returns
angular separation as float or np.ndarray
-
lumin.data_processing.hep_proc.
add_abs_mom
(df, vec, z=True)[source]¶ Vectorised computation 3-momenta magnitude, adding new column in place. Currently only works for Cartesian vectors
- Parameters
df (
DataFrame
) – DataFrame to altervec (
str
) – column prefix of vector components, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]z (
bool
) – whether to consider the z-component of the momenta
- Return type
None
-
lumin.data_processing.hep_proc.
add_mass
(df, vec)[source]¶ Vectorised computation of mass of 4-vector, adding new column in place.
- Parameters
df (
DataFrame
) – DataFrame to altervec (
str
) – column prefix of vector components, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]
- Return type
None
-
lumin.data_processing.hep_proc.
add_energy
(df, vec)[source]¶ Vectorised computation of energy of 4-vector, adding new column in place.
- Parameters
df (
DataFrame
) – DataFrame to altervec (
str
) – column prefix of vector components, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]
- Return type
None
-
lumin.data_processing.hep_proc.
add_mt
(df, vec, mpt_name='mpt')[source]¶ Vectorised computation of transverse mass of 4-vector with respect to missing transverse momenta, adding new column in place. Currently only works for pT, eta, phi vectors
- Parameters
df (
DataFrame
) – DataFrame to altervec (
str
) – column prefix of vector components, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]mpt_name (
str
) – column prefix of vector of missing transverse momenta components, e.g. ‘mpt’ for columns [‘mpt_pT’, ‘mpt_phi’]
-
lumin.data_processing.hep_proc.
get_vecs
(feats, strict=True)[source]¶ Filter list of features to get list of 3-momenta defined in the list. Works for both pT, eta, phi and Cartesian coordinates. If strict, return only vectors with all coordinates present in feature list.
- Parameters
feats (
List
[str
]) – list of features to filterstrict (
bool
) – whether to require all 3-momenta components to be present in the list
- Return type
Set
[str
]- Returns
set of unique 3-momneta prefixes
-
lumin.data_processing.hep_proc.
fix_event_phi
(df, ref_vec)[source]¶ Rotate event in phi such that ref_vec is at phi == 0. Performed inplace. Currently only works on vectors defined in pT, eta, phi
- Parameters
df (
DataFrame
) – DataFrame to alterref_vec (
str
) – column prefix of vector components to use as reference, e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]
- Return type
None
-
lumin.data_processing.hep_proc.
fix_event_z
(df, ref_vec)[source]¶ Flip event in z-axis such that ref_vec is in positive z-direction. Performed inplace. Works for both pT, eta, phi and Cartesian coordinates.
- Parameters
df (
DataFrame
) – DataFrame to alterref_vec (
str
) – column prefix of vector components to use as reference, e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]
- Return type
None
-
lumin.data_processing.hep_proc.
fix_event_y
(df, ref_vec_0, ref_vec_1)[source]¶ Flip event in y-axis such that ref_vec_1 has a higher py than ref_vec_0. Performed in place. Works for both pT, eta, phi and Cartesian coordinates.
- Parameters
df (
DataFrame
) – DataFrame to alterref_vec_0 (
str
) – column prefix of vector components to use as reference 0, e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]ref_vec_1 (
str
) – column prefix of vector components to use as reference 1, e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]
- Return type
None
-
lumin.data_processing.hep_proc.
event_to_cartesian
(df, drop=False, ignore=None)[source]¶ Convert entire event to Cartesian coordinates, except vectors listed in ignore. Optionally, drop old pT,eta,phi features. Perfomed inplace.
- Parameters
df (
DataFrame
) – DataFrame to alterdrop (
bool
) – whether to drop old coordinatesignore (
Optional
[List
[str
]]) – vectors to ignore when converting
- Return type
None
-
lumin.data_processing.hep_proc.
proc_event
(df, fix_phi=False, fix_y=False, fix_z=False, use_cartesian=False, ref_vec_0=None, ref_vec_1=None, keep_feats=None, default_vals=None)[source]¶ Process event: Pass data through inplace various conversions and drop uneeded columns. Data expected to consist of vectors defined in pT, eta, phi.
- Parameters
df (
DataFrame
) – DataFrame to alterfix_phi (
bool
) – whether to rotate events usingfix_event_phi()
fix_y – whether to flip events using
fix_event_y()
fix_z – whether to flip events using
fix_event_z()
use_cartesian – wether to convert vectors to Cartesian coordinates
ref_vec_0 (
Optional
[str
]) – column prefix of vector components to use as reference (0) for :meth:~lumin.data_prcoessing.hep_proc.fix_event_phi`,fix_event_y()
, andfix_event_z()
e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]ref_vec_1 (
Optional
[str
]) – column prefix of vector components to use as reference (1) forfix_event_y()
, e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]keep_feats (
Optional
[List
[str
]]) – columns to keep which would otherwise be droppeddefault_vals (
Optional
[List
[str
]]) – list of default values which might be used to represent missing vector components. These will be replaced with np.nan.
- Return type
None
-
lumin.data_processing.hep_proc.
calc_pair_mass
(df, masses, feat_map)[source]¶ Vectorised computation of invarient mass of pair of particles with given masses, using 3-momenta. Only works for vectors defined in Cartesian coordinates.
- Parameters
df (
DataFrame
) – DataFrame vector componentsmasses (
Union
[Tuple
[float
,float
],Tuple
[ndarray
,ndarray
]]) – tuple of masses of particles (either constant or different pair of masses per pair of particles)feat_map (
Dict
[str
,str
]) – dictionary mapping of requested momentum components to the features in df
- Return type
ndarray
- Returns
np.ndarray of invarient masses
-
lumin.data_processing.hep_proc.
boost
(ref_vec, boost_vec, df=None, rescale_boost=False)[source]¶ Vectorised boosting of reference vectors along boosting vectors. N.B. Implementation adapted from ROOT (https://root.cern/)
- Parameters
vec_0 – either (N,4) array of 4-momenta coordinates for starting vector, or prefix name for starting vector, i.e. columns should have names of the form [vec_0]_px, etc.
vec_1 – either (N,4) array of 4-momenta coordinates for boosting vector, or prefix name for boosting vector, i.e. columns should have names of the form [vec_1]_px, etc.
df (
Optional
[DataFrame
]) – DataFrame with datarescale_boost (
bool
) – whether to divide the boost vector by its energy
- Return type
ndarray
- Returns
(N,4) array of boosted vector in Cartesian coordinates
-
lumin.data_processing.hep_proc.
boost2cm
(vec, df=None)[source]¶ Vectorised computation of boosting vector required to boost a vector to its centre-of-mass frame
- Parameters
vec (
Union
[ndarray
,str
]) – either (N,4) array of 4-momenta coordinates for starting vector, or prefix name for starting vector, i.e. columns should have names of the form [vec]_px, etc.df (
Optional
[DataFrame
]) – DataFrame with data is supplying a string vec
- Return type
ndarray
- Returns
(N,3) array of boosting vector in Cartesian coordinates
-
lumin.data_processing.hep_proc.
get_momentum
(df, vec, include_E=False, as_cart=False)[source]¶ Extracts array of 3- or 4-momenta coordinates from DataFrame columns
- Parameters
df (
DataFrame
) – DataFrame with datavec (
str
) – prefix name for vector, i.e. columns should have names of the form [vec]_px, etc.as_cart (
bool
) – if True will return momenta in Cartesian coordinates
- Returns
(px, py, pz, (E)) or (pT, phi, eta, (E))
- Return type
(N, 3|4) array with columns
-
lumin.data_processing.hep_proc.
cos_delta
(vec_0, vec_1, df=None, name=None, inplace=False)[source]¶ Vectorised compututation of the cosine of the angular seperation of vec_1 from vec_0 If vec_* are strings, then columns are extracted from DataFrame df. If inplace is True Cosine angle is added a new column to the DataFrame with name cosdelta_[vec_0]_[vec_1] or cosdelta, unless name is set
- Parameters
vec_0 (
Union
[ndarray
,str
]) – either (N,3) array of 3-momenta coordinates for vector 0, or prefix name for vector zero, i.e. columns should have names of the form [vec_0]_px, etc.vec_1 (
Union
[ndarray
,str
]) – either (N,3) array of 3-momenta coordinates for vector 1, or prefix name for vector one, i.e. columns should have names of the form [vec_1]_px, etc.df (
Optional
[DataFrame
]) – DataFrame with dataname (
Optional
[str
]) – if set, will create a new column in df for cosdelta with given name, otherwise will generate a nameinplace (
bool
) – if True will add new column to df, otherwise will return array of cos_deltas
- Return type
Union
[None
,ndarray
]- Returns
array of cos deltas in not inplace
-
lumin.data_processing.hep_proc.
delta_r
(dphi, deta)[source]¶ Vectorised computation of delta R separation for arrays of delta phi and delta eta (rapidity or pseudorapidity)
- Parameters
dphi (
Union
[float
,ndarray
]) – delta phi separationsdeta (
Union
[float
,ndarray
]) – delta eta separations
- Return type
Union
[float
,ndarray
]- Returns
delta R separation as float or np.ndarray
-
lumin.data_processing.hep_proc.
delta_r_boosted
(vec_0, vec_1, ref_vec, df=None, name=None, inplace=False)[source]¶ Vectorised compututation of the deltaR seperation of vec_1 from vec_0 in the rest-frame of another vector If vec_* are strings, then columns are extracted from DataFrame df. If inplace is True deltaR is added a new column to the DataFrame with name dR_[vec_0]_[vec_1]_boosted_[ref_vec] or dR_boosted, unless name is set
- Parameters
vec_0 (
Union
[ndarray
,str
]) – either (N,4) array of 4-momenta coordinates for vector 0, in Cartesian coordinates or prefix name for vector zero, i.e. columns should have names of the form [vec_0]_px, etc.vec_1 (
Union
[ndarray
,str
]) – either (N,4) array of 4-momenta coordinates for vector 1, in Cartesian coordinates or prefix name for vector one, i.e. columns should have names of the form [vec_1]_px, etc.ref_vec (
Union
[ndarray
,str
]) – either (N,4) array of 4-momenta coordinates for the vector in whos rest-frame deltaR should be computed, in Cartesian coordinates or prefix name for reference vector, i.e. columns should have names of the form [ref_vec]_px, etc.df (
Optional
[DataFrame
]) – DataFrame with dataname (
Optional
[str
]) – if set, will create a new column in df for cosdelta with given name, otherwise will generate a nameinplace (
bool
) – if True will add new column to df, otherwise will return array of cos_deltas
- Return type
Union
[None
,ndarray
]- Returns
array of boosted deltaR in not inplace
lumin.data_processing.pre_proc module¶
-
lumin.data_processing.pre_proc.
get_pre_proc_pipes
(norm_in=True, norm_out=False, pca=False, whiten=False, with_mean=True, with_std=True, n_components=None)[source]¶ Configure SKLearn Pipelines for processing inputs and targets with the requested transformations.
- Parameters
norm_in (
bool
) – whether to apply StandardScaler to inputsnorm_out (
bool
) – whether to apply StandardScaler to outputspca (
bool
) – whether to apply PCA to inputs. Perforemed prior to StandardScaler. No dimensionality reduction is applied, purely rotation.whiten (
bool
) – whether PCA should whiten inputs.with_mean (
bool
) – whether StandardScalers should shift means to 0with_std (
bool
) – whether StandardScalers should scale standard deviations to 1n_components (
Optional
[int
]) – if set, causes PCA to reduce the dimensionality of the input data
- Return type
Tuple
[Pipeline
,Pipeline
]- Returns
Pipeline for input data Pipeline for target data
-
lumin.data_processing.pre_proc.
fit_input_pipe
(df, cont_feats, savename=None, input_pipe=None, norm_in=True, pca=False, whiten=False, with_mean=True, with_std=True, n_components=None)[source]¶ Fit input pipeline to continuous features and optionally save.
- Parameters
df (
DataFrame
) – DataFrame with data to fit pipelinecont_feats (
Union
[str
,List
[str
]]) – (list of) column(s) to use as input data for fittingsavename (
Optional
[str
]) – if set will save the fitted Pipeline to with that name as Pickle (.pkl extension added automatically)input_pipe (
Optional
[Pipeline
]) – if set will fit, otherwise will instantiate a new Pipelinenorm_in (
bool
) – whether to apply StandardScaler to inputs. Only used if input_pipe is not set.pca (
bool
) – whether to apply PCA to inputs. Perforemed prior to StandardScaler. No dimensionality reduction is applied, purely rotation. Only used if input_pipe is not set.whiten (
bool
) – whether PCA should whiten inputs. Only used if input_pipe is not set.with_mean (
bool
) – whether StandardScalers should shift means to 0. Only used if input_pipe is not set.with_std (
bool
) – whether StandardScalers should scale standard deviations to 1. Only used if input_pipe is not set.n_components (
Optional
[int
]) – if set, causes PCA to reduce the dimensionality of the input data. Only used if input_pipe is not set.
- Return type
Pipeline
- Returns
Fitted Pipeline
-
lumin.data_processing.pre_proc.
fit_output_pipe
(df, targ_feats, savename=None, output_pipe=None, norm_out=True)[source]¶ Fit output pipeline to target features and optionally save. Have you thought about using a y_range for regression instead?
- Parameters
df (
DataFrame
) – DataFrame with data to fit pipelinetarg_feats (
Union
[str
,List
[str
]]) – (list of) column(s) to use as input data for fittingsavename (
Optional
[str
]) – if set will save the fitted Pipeline to with that name as Pickle (.pkl extension added automatically)output_pipe (
Optional
[Pipeline
]) – if set will fit, otherwise will instantiate a new Pipelinenorm_out (
bool
) – whether to apply StandardScaler to outputs . Only used if output_pipe is not set.
- Return type
Pipeline
- Returns
Fitted Pipeline
-
lumin.data_processing.pre_proc.
proc_cats
(train_df, cat_feats, val_df=None, test_df=None)[source]¶ Process categorical features in train_df to be valued 0->cardinality-1. Applied inplace. Applies same transformation to validation and testing data is passed. Will complain if validation or testing sets contain categories which are not present in the training data.
- Parameters
train_df (
DataFrame
) – DataFrame with the training data, which will also be used to specify all the categories to considercat_feats (
List
[str
]) – list of columns to use as categorical featuresval_df (
Optional
[DataFrame
]) – if set will apply the same category to code mapping to the validation data as was performed on the training datatest_df (
Optional
[DataFrame
]) – if set will apply the same category to code mapping to the testing data as was performed on the training data
- Return type
Tuple
[OrderedDict
,OrderedDict
]- Returns
ordered dictionary mapping categorical features to dictionaries mapping categories to codes ordered dictionary mapping categorical features to their cardinalities