lumin.data_processing package¶

Submodules¶

lumin.data_processing.file_proc module¶

lumin.data_processing.file_proc.save_to_grp(arr, grp, name)[source]¶

Save Numpy array as a dataset in an h5py Group

Parameters

arr (ndarray) – array to be saved
grp (Group) – group in which to save arr
name (str) – name of dataset to create

Return type

None

lumin.data_processing.file_proc.fold2foldfile(df, out_file, fold_idx, cont_feats, cat_feats, targ_feats, targ_type, misc_feats=None, wgt_feat=None)[source]¶

Save fold of data into an h5py Group

Parameters

df (DataFrame) – Dataframe from which to save data
out_file (File) – h5py file to save data in
fold_idx (int) – ID for the fold; used name h5py group according to ‘fold_{fold_idx}’
cont_feats (List[str]) – list of columns in df to save as continuous variables
cat_feats (List[str]) – list of columns in df to save as discreet variables
targ_feats (list of) column(s) in df to save as target feature(s) –
targ_type (Any) – type of target feature, e.g. int,’float32’
misc_feats (optional) – any extra columns to save
wgt_feat (optional) – column to save as data weights

Return type

None

lumin.data_processing.file_proc.df2foldfile(df, n_folds, cont_feats, cat_feats, targ_feats, savename, targ_type, strat_key=None, misc_feats=None, wgt_feat=None)[source]¶

Convert dataframe into h5py file by splitting data into sub-folds to be accessed by a FoldYielder

Parameters

df (DataFrame) – Dataframe from which to save data
n_folds (int) – number of folds to split df into
cont_feats (List[str]) – list of columns in df to save as continuous variables
cat_feats (List[str]) – list of columns in df to save as discreet variables
targ_feats (list of) column(s) in df to save as target feature(s) –
savename (Union[Path, str]) – name of h5py file to create (.h5py extension not required)
targ_type (str) – type of target feature, e.g. int,’float32’
strat_key (optional) – column to use for stratified splitting
misc_feats (optional) – any extra columns to save
wgt_feat (optional) – column to save as data weights

lumin.data_processing.hep_proc module¶

lumin.data_processing.hep_proc.to_cartesian(df, vec, drop=False)[source]¶

Vectoriesed conversion of 3-momenta to Cartesian coordinates inplace, optionally dropping old pT,eta,phi features

Parameters

df (DataFrame) – DataFrame to alter
vec (str) – column prefix of vector components to alter, e.g. ‘muon’ for columns [‘muon_pt’, ‘muon_phi’, ‘muon_eta’]
drop (bool) – Whether to remove original columns and just keep the new ones

Return type

None

lumin.data_processing.hep_proc.to_pt_eta_phi(df, vec, eta=None, drop=False)[source]¶

Vectoriesed conversion of 3-momenta to pT,eta,phi coordinates inplace, optionally dropping old px,py,pz features

Attention

eta is now deprecieated as it is now infered from df. Will be removed in V0.4

Parameters

df (DataFrame) – DataFrame to alter
vec (str) – column prefix of vector components to alter, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]
eta (Optional[bool]) – depreciated as now infered
drop (bool) – Whether to remove original columns and just keep the new ones

Return type

None

lumin.data_processing.hep_proc.delta_phi(arr_a, arr_b)[source]¶

Vectorised compututation of modulo 2pi angular seperation of array of angles b from array of angles a, in range [-pi,pi]

Parameters

arr_a (Union[float, ndarray]) – reference angles
arr_b (Union[float, ndarray]) – final angles

Return type

Union[float, ndarray]

Returns

angular separation as float or np.array

lumin.data_processing.hep_proc.twist(dphi, deta)[source]¶

Vectorised computation of twist between vectors (https://arxiv.org/abs/1010.3698)

Parameters

dphi (Union[float, ndarray]) – delta phi separations
deta (Union[float, ndarray]) – delta eta separations

Return type

Union[float, ndarray]

Returns

angular separation as float or np.array

lumin.data_processing.hep_proc.add_abs_mom(df, vec, z=True)[source]¶

Vectorised computation 3-momenta magnitude, adding new column in place. Currently only works for Cartesian vectors

Parameters

df (DataFrame) – DataFrame to alter
vec (str) – column prefix of vector components, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]
z (bool) – whether to consider the z-component of the momenta

Return type

None

lumin.data_processing.hep_proc.add_mass(df, vec)[source]¶

Vectorised computation of mass of 4-vector, adding new column in place.

Parameters

df (DataFrame) – DataFrame to alter
vec (str) – column prefix of vector components, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]

Return type

None

lumin.data_processing.hep_proc.add_energy(df, vec)[source]¶

Vectorised computation of energy of 4-vector, adding new column in place.

Parameters

df (DataFrame) – DataFrame to alter
vec (str) – column prefix of vector components, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]

Return type

None

lumin.data_processing.hep_proc.add_mt(df, vec, mpt_name='mpt')[source]¶

Vectorised computation of transverse mass of 4-vector with respect to missing transverse momenta, adding new column in place. Currently only works for pT, eta, phi vectors

Parameters

df (DataFrame) – DataFrame to alter
vec (str) – column prefix of vector components, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]
mpt_name (str) – column prefix of vector of missing transverse momenta components, e.g. ‘mpt’ for columns [‘mpt_pT’, ‘mpt_phi’]

lumin.data_processing.hep_proc.get_vecs(feats, strict=True)[source]¶

Filter list of features to get list of 3-momenta defined in the list. Works for both pT, eta, phi and Cartesian coordinates. If strict, return only vectors with all coordinates present in feature list.

Parameters

feats (List[str]) – list of features to filter
strict (bool) – whether to require all 3-momenta components to be present in the list

Return type

Set[str]

Returns

set of unique 3-momneta prefixes

lumin.data_processing.hep_proc.fix_event_phi(df, ref_vec)[source]¶

Rotate event in phi such that ref_vec is at phi == 0. Performed inplace. Currently only works on vectors defined in pT, eta, phi

Parameters

df (DataFrame) – DataFrame to alter
ref_vec (str) – column prefix of vector components to use as reference, e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]

Return type

None

lumin.data_processing.hep_proc.fix_event_z(df, ref_vec)[source]¶

Flip event in z-axis such that ref_vec is in positive z-direction. Performed inplace. Works for both pT, eta, phi and Cartesian coordinates.

Parameters

df (DataFrame) – DataFrame to alter
ref_vec (str) – column prefix of vector components to use as reference, e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]

Return type

None

lumin.data_processing.hep_proc.fix_event_y(df, ref_vec_0, ref_vec_1)[source]¶

Flip event in y-axis such that ref_vec_1 has a higher py than ref_vec_0. Performed in place. Works for both pT, eta, phi and Cartesian coordinates.

Parameters

df (DataFrame) – DataFrame to alter
ref_vec_0 (str) – column prefix of vector components to use as reference 0, e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]
ref_vec_1 (str) – column prefix of vector components to use as reference 1, e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]

Return type

None

lumin.data_processing.hep_proc.event_to_cartesian(df, drop=False, ignore=None)[source]¶

Convert entire event to Cartesian coordinates, except vectors listed in ignore. Optionally, drop old pT,eta,phi features. Perfomed inplace.

Parameters

df (DataFrame) – DataFrame to alter
drop (bool) – whether to drop old coordinates
ignore (Optional[List[str]]) – vectors to ignore when converting

Return type

None

lumin.data_processing.hep_proc.proc_event(df, fix_phi=False, fix_y=False, fix_z=False, use_cartesian=False, ref_vec_0=None, ref_vec_1=None, keep_feats=None, default_vals=None)[source]¶

Process event: Pass data through inplace various conversions and drop uneeded columns. Data expected to consist of vectors defined in pT, eta, phi.

Parameters

df (DataFrame) – DataFrame to alter
fix_phi (bool) – whether to rotate events using fix_event_phi()
fix_y – whether to flip events using fix_event_y()
fix_z – whether to flip events using fix_event_z()
use_cartesian – wether to convert vectors to Cartesian coordinates
ref_vec_0 (Optional[str]) – column prefix of vector components to use as reference (0) for :meth:~lumin.data_prcoessing.hep.proc.fix_event_phi`, fix_event_y(), and fix_event_z() e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]
ref_vec_1 (Optional[str]) – column prefix of vector components to use as reference 1 for fix_event_z(), e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]
keep_feats (Optional[List[str]]) – columns to keep which would otherwise be dropped
default_vals (Optional[List[str]]) – list of default values which might be used to represent missing vector components. These will be replaced with np.nan.

Return type

None

lumin.data_processing.hep_proc.calc_pair_mass(df, masses, feat_map)[source]¶

Vectorised computation of invarient mass of pair of particles with given masses, using 3-momenta. Only works for vectors defined in Cartesian coordinates.

Parameters

df (DataFrame) – DataFrame vector components
masses (Union[Tuple[float, float], Tuple[ndarray, ndarray]]) – tuple of masses of particles (either constant or different pair of masses per pair of particles)
feat_map (Dict[str, str]) – dictionary mapping of requested momentum components to the features in df

Return type

ndarray

Returns

np.array of invarient masses

lumin.data_processing.pre_proc module¶

lumin.data_processing.pre_proc.get_pre_proc_pipes(norm_in=True, norm_out=False, pca=False, whiten=False, with_mean=True, with_std=True, n_components=None)[source]¶

Configure SKLearn Pipelines for processing inputs and targets with the requested transformations.

Parameters

norm_in (bool) – whether to apply StandardScaler to inputs
norm_out (bool) – whether to apply StandardScaler to outputs
pca (bool) – whether to apply PCA to inputs. Perforemed prior to StandardScaler. No dimensionality reduction is applied, purely rotation.
whiten (bool) – whether PCA should whiten inputs.
with_mean (bool) – whether StandardScalers should shift means to 0
with_std (bool) – whether StandardScalers should scale standard deviations to 1
n_components (Optional[int]) – if set, causes PCA to reduce the dimensionality of the input data

Return type

Tuple[Pipeline, Pipeline]

Returns

Pipeline for input data Pipeline for target data

lumin.data_processing.pre_proc.fit_input_pipe(df, cont_feats, savename=None, input_pipe=None, norm_in=True, pca=False, whiten=False, with_mean=True, with_std=True, n_components=None)[source]¶

Fit input pipeline to continuous features and optionally save.

Parameters

df (DataFrame) – DataFrame with data to fit pipeline
cont_feats (Union[str, List[str]]) – (list of) column(s) to use as input data for fitting
savename (Optional[str]) – if set will save the fitted Pipeline to with that name as Pickle (.pkl extension added automatically)
input_pipe (Optional[Pipeline]) – if set will fit, otherwise will instantiate a new Pipeline
norm_in (bool) – whether to apply StandardScaler to inputs. Only used if input_pipe is not set.
pca (bool) – whether to apply PCA to inputs. Perforemed prior to StandardScaler. No dimensionality reduction is applied, purely rotation. Only used if input_pipe is not set.
whiten (bool) – whether PCA should whiten inputs. Only used if input_pipe is not set.
with_mean (bool) – whether StandardScalers should shift means to 0. Only used if input_pipe is not set.
with_std (bool) – whether StandardScalers should scale standard deviations to 1. Only used if input_pipe is not set.
n_components (Optional[int]) – if set, causes PCA to reduce the dimensionality of the input data. Only used if input_pipe is not set.

Return type

Pipeline

Returns

Fitted Pipeline

lumin.data_processing.pre_proc.fit_output_pipe(df, targ_feats, savename=None, output_pipe=None, norm_out=True)[source]¶

Fit output pipeline to target features and optionally save. Have you thought about using a y_range for regression instead?

Parameters

df (DataFrame) – DataFrame with data to fit pipeline
targ_feats (Union[str, List[str]]) – (list of) column(s) to use as input data for fitting
savename (Optional[str]) – if set will save the fitted Pipeline to with that name as Pickle (.pkl extension added automatically)
output_pipe (Optional[Pipeline]) – if set will fit, otherwise will instantiate a new Pipeline
norm_out (bool) – whether to apply StandardScaler to outputs . Only used if output_pipe is not set.

Return type

Pipeline

Returns

Fitted Pipeline

lumin.data_processing.pre_proc.proc_cats(train_df, cat_feats, val_df=None, test_df=None)[source]¶

Process categorical features in train_df to be valued 0->cardinality-1. Applied inplace. Applies same transformation to validation and testing data is passed. Will complain if validation or testing sets contain categories which are not present in the training data.

Parameters

train_df (DataFrame) – DataFrame with the training data, which will also be used to specify all the categories to consider
cat_feats (List[str]) – list of columns to use as categorical features
val_df (Optional[DataFrame]) – if set will apply the same category to code mapping to the validation data as was performed on the training data
test_df (Optional[DataFrame]) – if set will apply the same category to code mapping to the testing data as was performed on the training data

Return type

Tuple[OrderedDict, OrderedDict]

Returns

ordered dictionary mapping categorical features to dictionaries mapping categories to codes ordered dictionary mapping categorical features to their cardinalities

lumin.data_processing package¶

Submodules¶

lumin.data_processing.file_proc module¶

lumin.data_processing.hep_proc module¶

lumin.data_processing.pre_proc module¶

Module contents¶

Docs

Tutorials