Shortcuts

lumin.data_processing package

Submodules

lumin.data_processing.file_proc module

lumin.data_processing.file_proc.save_to_grp(arr, grp, name)[source]

Save Numpy array as a dataset in an h5py Group

Parameters
  • arr (ndarray) – array to be saved

  • grp (Group) – group in which to save arr

  • name (str) – name of dataset to create

Return type

None

lumin.data_processing.file_proc.fold2foldfile(df, out_file, fold_idx, cont_feats, cat_feats, targ_feats, targ_type, misc_feats=None, wgt_feat=None)[source]

Save fold of data into an h5py Group

Parameters
  • df (DataFrame) – Dataframe from which to save data

  • out_file (File) – h5py file to save data in

  • fold_idx (int) – ID for the fold; used name h5py group according to ‘fold_{fold_idx}’

  • cont_feats (List[str]) – list of columns in df to save as continuous variables

  • cat_feats (List[str]) – list of columns in df to save as discreet variables

  • targ_feats (list of) column(s) in df to save as target feature(s) –

  • targ_type (Any) – type of target feature, e.g. int,’float32’

  • misc_feats (optional) – any extra columns to save

  • wgt_feat (optional) – column to save as data weights

Return type

None

lumin.data_processing.file_proc.df2foldfile(df, n_folds, cont_feats, cat_feats, targ_feats, savename, targ_type, strat_key=None, misc_feats=None, wgt_feat=None)[source]

Convert dataframe into h5py file by splitting data into sub-folds to be accessed by a FoldYielder

Parameters
  • df (DataFrame) – Dataframe from which to save data

  • n_folds (int) – number of folds to split df into

  • cont_feats (List[str]) – list of columns in df to save as continuous variables

  • cat_feats (List[str]) – list of columns in df to save as discreet variables

  • targ_feats (list of) column(s) in df to save as target feature(s) –

  • savename (Union[Path, str]) – name of h5py file to create (.h5py extension not required)

  • targ_type (str) – type of target feature, e.g. int,’float32’

  • strat_key (optional) – column to use for stratified splitting

  • misc_feats (optional) – any extra columns to save

  • wgt_feat (optional) – column to save as data weights

lumin.data_processing.hep_proc module

lumin.data_processing.hep_proc.to_cartesian(df, vec, drop=False)[source]

Vectoriesed conversion of 3-momenta to Cartesian coordinates inplace, optionally dropping old pT,eta,phi features

Parameters
  • df (DataFrame) – DataFrame to alter

  • vec (str) – column prefix of vector components to alter, e.g. ‘muon’ for columns [‘muon_pt’, ‘muon_phi’, ‘muon_eta’]

  • drop (bool) – Whether to remove original columns and just keep the new ones

Return type

None

lumin.data_processing.hep_proc.to_pt_eta_phi(df, vec, eta=None, drop=False)[source]

Vectoriesed conversion of 3-momenta to pT,eta,phi coordinates inplace, optionally dropping old px,py,pz features

Attention

eta is now deprecieated as it is now infered from df. Will be removed in V0.4

Parameters
  • df (DataFrame) – DataFrame to alter

  • vec (str) – column prefix of vector components to alter, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]

  • eta (Optional[bool]) – depreciated as now infered

  • drop (bool) – Whether to remove original columns and just keep the new ones

Return type

None

lumin.data_processing.hep_proc.delta_phi(arr_a, arr_b)[source]

Vectorised compututation of modulo 2pi angular seperation of array of angles b from array of angles a, in range [-pi,pi]

Parameters
  • arr_a (Union[float, ndarray]) – reference angles

  • arr_b (Union[float, ndarray]) – final angles

Return type

Union[float, ndarray]

Returns

angular separation as float or np.array

lumin.data_processing.hep_proc.twist(dphi, deta)[source]

Vectorised computation of twist between vectors (https://arxiv.org/abs/1010.3698)

Parameters
  • dphi (Union[float, ndarray]) – delta phi separations

  • deta (Union[float, ndarray]) – delta eta separations

Return type

Union[float, ndarray]

Returns

angular separation as float or np.array

lumin.data_processing.hep_proc.add_abs_mom(df, vec, z=True)[source]

Vectorised computation 3-momenta magnitude, adding new column in place. Currently only works for Cartesian vectors

Parameters
  • df (DataFrame) – DataFrame to alter

  • vec (str) – column prefix of vector components, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]

  • z (bool) – whether to consider the z-component of the momenta

Return type

None

lumin.data_processing.hep_proc.add_mass(df, vec)[source]

Vectorised computation of mass of 4-vector, adding new column in place.

Parameters
  • df (DataFrame) – DataFrame to alter

  • vec (str) – column prefix of vector components, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]

Return type

None

lumin.data_processing.hep_proc.add_energy(df, vec)[source]

Vectorised computation of energy of 4-vector, adding new column in place.

Parameters
  • df (DataFrame) – DataFrame to alter

  • vec (str) – column prefix of vector components, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]

Return type

None

lumin.data_processing.hep_proc.add_mt(df, vec, mpt_name='mpt')[source]

Vectorised computation of transverse mass of 4-vector with respect to missing transverse momenta, adding new column in place. Currently only works for pT, eta, phi vectors

Parameters
  • df (DataFrame) – DataFrame to alter

  • vec (str) – column prefix of vector components, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]

  • mpt_name (str) – column prefix of vector of missing transverse momenta components, e.g. ‘mpt’ for columns [‘mpt_pT’, ‘mpt_phi’]

lumin.data_processing.hep_proc.get_vecs(feats, strict=True)[source]

Filter list of features to get list of 3-momenta defined in the list. Works for both pT, eta, phi and Cartesian coordinates. If strict, return only vectors with all coordinates present in feature list.

Parameters
  • feats (List[str]) – list of features to filter

  • strict (bool) – whether to require all 3-momenta components to be present in the list

Return type

Set[str]

Returns

set of unique 3-momneta prefixes

lumin.data_processing.hep_proc.fix_event_phi(df, ref_vec)[source]

Rotate event in phi such that ref_vec is at phi == 0. Performed inplace. Currently only works on vectors defined in pT, eta, phi

Parameters
  • df (DataFrame) – DataFrame to alter

  • ref_vec (str) – column prefix of vector components to use as reference, e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]

Return type

None

lumin.data_processing.hep_proc.fix_event_z(df, ref_vec)[source]

Flip event in z-axis such that ref_vec is in positive z-direction. Performed inplace. Works for both pT, eta, phi and Cartesian coordinates.

Parameters
  • df (DataFrame) – DataFrame to alter

  • ref_vec (str) – column prefix of vector components to use as reference, e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]

Return type

None

lumin.data_processing.hep_proc.fix_event_y(df, ref_vec_0, ref_vec_1)[source]

Flip event in y-axis such that ref_vec_1 has a higher py than ref_vec_0. Performed in place. Works for both pT, eta, phi and Cartesian coordinates.

Parameters
  • df (DataFrame) – DataFrame to alter

  • ref_vec_0 (str) – column prefix of vector components to use as reference 0, e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]

  • ref_vec_1 (str) – column prefix of vector components to use as reference 1, e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]

Return type

None

lumin.data_processing.hep_proc.event_to_cartesian(df, drop=False, ignore=None)[source]

Convert entire event to Cartesian coordinates, except vectors listed in ignore. Optionally, drop old pT,eta,phi features. Perfomed inplace.

Parameters
  • df (DataFrame) – DataFrame to alter

  • drop (bool) – whether to drop old coordinates

  • ignore (Optional[List[str]]) – vectors to ignore when converting

Return type

None

lumin.data_processing.hep_proc.proc_event(df, fix_phi=False, fix_y=False, fix_z=False, use_cartesian=False, ref_vec_0=None, ref_vec_1=None, keep_feats=None, default_vals=None)[source]

Process event: Pass data through inplace various conversions and drop uneeded columns. Data expected to consist of vectors defined in pT, eta, phi.

Parameters
  • df (DataFrame) – DataFrame to alter

  • fix_phi (bool) – whether to rotate events using fix_event_phi()

  • fix_y – whether to flip events using fix_event_y()

  • fix_z – whether to flip events using fix_event_z()

  • use_cartesian – wether to convert vectors to Cartesian coordinates

  • ref_vec_0 (Optional[str]) – column prefix of vector components to use as reference (0) for :meth:~lumin.data_prcoessing.hep.proc.fix_event_phi`, fix_event_y(), and fix_event_z() e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]

  • ref_vec_1 (Optional[str]) – column prefix of vector components to use as reference 1 for fix_event_z(), e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]

  • keep_feats (Optional[List[str]]) – columns to keep which would otherwise be dropped

  • default_vals (Optional[List[str]]) – list of default values which might be used to represent missing vector components. These will be replaced with np.nan.

Return type

None

lumin.data_processing.hep_proc.calc_pair_mass(df, masses, feat_map)[source]

Vectorised computation of invarient mass of pair of particles with given masses, using 3-momenta. Only works for vectors defined in Cartesian coordinates.

Parameters
  • df (DataFrame) – DataFrame vector components

  • masses (Union[Tuple[float, float], Tuple[ndarray, ndarray]]) – tuple of masses of particles (either constant or different pair of masses per pair of particles)

  • feat_map (Dict[str, str]) – dictionary mapping of requested momentum components to the features in df

Return type

ndarray

Returns

np.array of invarient masses

lumin.data_processing.pre_proc module

lumin.data_processing.pre_proc.get_pre_proc_pipes(norm_in=True, norm_out=False, pca=False, whiten=False, with_mean=True, with_std=True, n_components=None)[source]

Configure SKLearn Pipelines for processing inputs and targets with the requested transformations.

Parameters
  • norm_in (bool) – whether to apply StandardScaler to inputs

  • norm_out (bool) – whether to apply StandardScaler to outputs

  • pca (bool) – whether to apply PCA to inputs. Perforemed prior to StandardScaler. No dimensionality reduction is applied, purely rotation.

  • whiten (bool) – whether PCA should whiten inputs.

  • with_mean (bool) – whether StandardScalers should shift means to 0

  • with_std (bool) – whether StandardScalers should scale standard deviations to 1

  • n_components (Optional[int]) – if set, causes PCA to reduce the dimensionality of the input data

Return type

Tuple[Pipeline, Pipeline]

Returns

Pipeline for input data Pipeline for target data

lumin.data_processing.pre_proc.fit_input_pipe(df, cont_feats, savename=None, input_pipe=None, norm_in=True, pca=False, whiten=False, with_mean=True, with_std=True, n_components=None)[source]

Fit input pipeline to continuous features and optionally save.

Parameters
  • df (DataFrame) – DataFrame with data to fit pipeline

  • cont_feats (Union[str, List[str]]) – (list of) column(s) to use as input data for fitting

  • savename (Optional[str]) – if set will save the fitted Pipeline to with that name as Pickle (.pkl extension added automatically)

  • input_pipe (Optional[Pipeline]) – if set will fit, otherwise will instantiate a new Pipeline

  • norm_in (bool) – whether to apply StandardScaler to inputs. Only used if input_pipe is not set.

  • pca (bool) – whether to apply PCA to inputs. Perforemed prior to StandardScaler. No dimensionality reduction is applied, purely rotation. Only used if input_pipe is not set.

  • whiten (bool) – whether PCA should whiten inputs. Only used if input_pipe is not set.

  • with_mean (bool) – whether StandardScalers should shift means to 0. Only used if input_pipe is not set.

  • with_std (bool) – whether StandardScalers should scale standard deviations to 1. Only used if input_pipe is not set.

  • n_components (Optional[int]) – if set, causes PCA to reduce the dimensionality of the input data. Only used if input_pipe is not set.

Return type

Pipeline

Returns

Fitted Pipeline

lumin.data_processing.pre_proc.fit_output_pipe(df, targ_feats, savename=None, output_pipe=None, norm_out=True)[source]

Fit output pipeline to target features and optionally save. Have you thought about using a y_range for regression instead?

Parameters
  • df (DataFrame) – DataFrame with data to fit pipeline

  • targ_feats (Union[str, List[str]]) – (list of) column(s) to use as input data for fitting

  • savename (Optional[str]) – if set will save the fitted Pipeline to with that name as Pickle (.pkl extension added automatically)

  • output_pipe (Optional[Pipeline]) – if set will fit, otherwise will instantiate a new Pipeline

  • norm_out (bool) – whether to apply StandardScaler to outputs . Only used if output_pipe is not set.

Return type

Pipeline

Returns

Fitted Pipeline

lumin.data_processing.pre_proc.proc_cats(train_df, cat_feats, val_df=None, test_df=None)[source]

Process categorical features in train_df to be valued 0->cardinality-1. Applied inplace. Applies same transformation to validation and testing data is passed. Will complain if validation or testing sets contain categories which are not present in the training data.

Parameters
  • train_df (DataFrame) – DataFrame with the training data, which will also be used to specify all the categories to consider

  • cat_feats (List[str]) – list of columns to use as categorical features

  • val_df (Optional[DataFrame]) – if set will apply the same category to code mapping to the validation data as was performed on the training data

  • test_df (Optional[DataFrame]) – if set will apply the same category to code mapping to the testing data as was performed on the training data

Return type

Tuple[OrderedDict, OrderedDict]

Returns

ordered dictionary mapping categorical features to dictionaries mapping categories to codes ordered dictionary mapping categorical features to their cardinalities

Module contents

Read the Docs v: v0.3.1
Versions
latest
stable
v0.3.2
v0.3.1
Downloads
pdf
html
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.

Docs

Access comprehensive developer and user documentation for LUMIN

View Docs

Tutorials

Get tutorials for beginner and advanced researchers demonstrating many of the features of LUMIN

View Tutorials