lumin.data_processing package¶
Submodules¶
lumin.data_processing.file_proc module¶
- lumin.data_processing.file_proc.add_meta_data(out_file, feats, cont_feats, cat_feats, cat_maps, targ_feats, wgt_feat=None, matrix_vecs=None, matrix_feats_per_vec=None, matrix_row_wise=None, tensor_name=None, tensor_shp=None, tensor_is_sparse=False, target_tensor_shp=None, tensor_target_is_sparse=False)[source]¶
Adds meta data to foldfile containing information about the data: feature names, matrix information, etc.
FoldYielder
objects will access this and automatically extract it to save the user from having to manually pass lists of features.- Parameters:
out_file (
File
) – h5py file to save data infeats (
List
[str
]) – list of all features in datacont_feats (
List
[str
]) – list of continuous featurescat_feats (
List
[str
]) – list of categorical featurescat_maps (
Optional
[Dict
[str
,Dict
[int
,Any
]]]) – Dictionary mapping categorical features to dictionary mapping codes to categoriestarg_feats (
Union
[str
,List
[str
]]) – (list of) target feature(s)wgt_feat (
Optional
[str
]) – name of weight featurematrix_vecs (
Optional
[List
[str
]]) – list of objects for matrix encoding, i.e. feature prefixesmatrix_feats_per_vec (
Optional
[List
[str
]]) – list of features per vector for matrix encoding, i.e. feature suffixes. Features listed but not present in df will be replaced with NaN.matrix_row_wise (
Optional
[bool
]) – whether objects encoded as a matrix should be encoded row-wise (i.e. all the features associated with an object are in their own row), or column-wise (i.e. all the features associated with an object are in their own column)tensor_name (
Optional
[str
]) – Name used to refer to the tensor when displaying model informationtensor_shp (
Optional
[Tuple
[int
]]) – The shape of the tensor data (excluding batch dimension)tensor_is_sparse (
bool
) – Whether the tensor is sparse (COO format) and should be densified prior to usetarget_tensor_shp (
Optional
[Tuple
[int
]]) – The shape of the target tensor data (excluding batch dimension)tensor_target_is_sparse (
bool
) – Whether the target tensor is sparse (COO format) and should be densified prior to use
- Return type:
None
- lumin.data_processing.file_proc.df2foldfile(df, n_folds, cont_feats, cat_feats, targ_feats, savename, targ_type, shuffle=True, strat_key=None, misc_feats=None, wgt_feat=None, cat_maps=None, matrix_vecs=None, matrix_feats_per_vec=None, matrix_row_wise=None, tensor_data=None, tensor_name=None, tensor_as_sparse=False, compression=None, tensor_target=None, tensor_target_as_sparse=False)[source]¶
Convert dataframe into h5py file by splitting data into sub-folds to be accessed by a
FoldYielder
- Parameters:
df (
Optional
[DataFrame
]) – Dataframe from which to save data, can contain flat input features, weights, targets, but is entriely optional if targets and inputs are passed as tensorsn_folds (
int
) – number of folds to split df intocont_feats (
List
[str
]) – list of columns in df to save as continuous variablescat_feats (
List
[str
]) – list of columns in df to save as discreet variablestarg_feats (
Union
[str
,List
[str
]]) – (list of) column(s) in df to save as target feature(s)savename (
Union
[Path
,str
]) – name of h5py file to create (.h5py extension not required)targ_type (
str
) – type of target feature, e.g. int,’float32’shuffle (
bool
) – if true will shuffle data prior to splitting into folds, otherwise folds will be contiguous splits of the unsuffled data, useful e.g. for testing datasetsstrat_key (
Optional
[str
]) – column to use for stratified splittingmisc_feats (
Optional
[List
[str
]]) – any extra columns to savewgt_feat (
Optional
[str
]) – column to save as data weightscat_maps (
Optional
[Dict
[str
,Dict
[int
,Any
]]]) – Dictionary mapping categorical features to dictionary mapping codes to categoriesmatrix_vecs (
Optional
[List
[str
]]) – list of objects for matrix encoding, i.e. feature prefixesmatrix_feats_per_vec (
Optional
[List
[str
]]) – list of features per vector for matrix encoding, i.e. feature suffixes. Features listed but not present in df will be replaced with NaN.matrix_row_wise (
Optional
[bool
]) – whether objects encoded as a matrix should be encoded row-wise (i.e. all the features associated with an object are in their own row), or column-wise (i.e. all the features associated with an object are in their own column)tensor_data (
Optional
[ndarray
]) – data of higher order than a matrix can be passed directly as a numpy array, rather than beign extracted and reshaped from the DataFrame. The array will be saved under matrix data, and this is incompatible with also setting matrix_vecs, matrix_feats_per_vec, and matrix_row_wise. The first dimension of the array must be compatible with the length of the data frame.tensor_name (
Optional
[str
]) – if tensor_data is set, then this is the name that will to the foldfile’s metadata.tensor_as_sparse (
bool
) – Set to True to store the matrix in sparse COO format The format is coo_x = sparse.as_coo(x); m = np.vstack((coo_x.data, coo_x.coords)), where m is the tensor passed to tensor_data.compression (
Optional
[str
]) – optional compression argument for h5py, e.g. ‘lzf’tensor_target (
Optional
[ndarray
]) – optional encoding of multi-dimensional targets as a numpy arraytensor_target_as_sparse (
bool
) – Set to True to store the matrix in sparse COO format The format is coo_x = sparse.as_coo(x); m = np.vstack((coo_x.data, coo_x.coords)), where m is the tensor passed to tensor_target.
- Return type:
None
- lumin.data_processing.file_proc.fold2foldfile(df, out_file, fold_idx, cont_feats, cat_feats, targ_feats, targ_type, misc_feats=None, wgt_feat=None, matrix_lookup=None, matrix_missing=None, matrix_shape=None, tensor_data=None, tensor_target=None, compression=None, n_samples=None)[source]¶
Save fold of data into an h5py Group
- Parameters:
df (
Optional
[DataFrame
]) – Dataframe from which to save data, can contain flat input features, weights, targets, but is entriely optional if targets and inputs are passed as tensorsout_file (
File
) – h5py file to save data infold_idx (
int
) – ID for the fold; used name h5py group according to ‘fold_{fold_idx}’cont_feats (
List
[str
]) – list of columns in df to save as continuous variablescat_feats (
List
[str
]) – list of columns in df to save as discreet variablestarg_feats (
Union
[str
,List
[str
]]) – (list of) column(s) in df to save as target feature(s)targ_type (
Any
) – type of target feature, e.g. int,’float32’misc_feats (
Optional
[List
[str
]]) – any extra columns to savewgt_feat (
Optional
[str
]) – column to save as data weightsmatrix_vecs – list of objects for matrix encoding, i.e. feature prefixes
matrix_feats_per_vec – list of features per vector for matrix encoding, i.e. feature suffixes. Features listed but not present in df will be replaced with NaN.
matrix_row_wise – whether objects encoded as a matrix should be encoded row-wise (i.e. all the features associated with an object are in their own row), or column-wise (i.e. all the features associated with an object are in their own column)
tensor_data (
Optional
[ndarray
]) – data of higher order than a matrix can be passed directly as a numpy array, rather than beign extracted and reshaped from the DataFrame. The array will be saved under matrix data, and this is incompatible with also setting matrix_lookup, matrix_missing, and matrix_shape. The first dimension of the array must be compatible with the length of the data frame.tensor_target (
Optional
[ndarray
]) – optional encoding of multi-dimensional targets as a numpy arraycompression (
Optional
[str
]) – optional compression argument for h5py, e.g. ‘lzf’n_samples (
Optional
[int
]) – in case df is None, please supply the number of samples in the fold: this cannot be determined otherwise, since tensor_data and tensor_target may be sparse
- Return type:
None
- lumin.data_processing.file_proc.save_to_grp(arr, grp, name, compression=None)[source]¶
Save Numpy array as a dataset in an h5py Group
- Parameters:
arr (
ndarray
) – array to be savedgrp (
Group
) – group in which to save arrname (
str
) – name of dataset to createcompression (
Optional
[str
]) – optional compression argument for h5py, e.g. ‘lzf’
- Return type:
None
lumin.data_processing.hep_proc module¶
- lumin.data_processing.hep_proc.add_abs_mom(df, vec, z=True)[source]¶
Vectorised computation 3-momenta magnitude, adding new column in place. Currently only works for Cartesian vectors
- Parameters:
df (
DataFrame
) – DataFrame to altervec (
str
) – column prefix of vector components, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]z (
bool
) – whether to consider the z-component of the momenta
- Return type:
None
- lumin.data_processing.hep_proc.add_energy(df, vec)[source]¶
Vectorised computation of energy of 4-vector, adding new column in place.
- Parameters:
df (
DataFrame
) – DataFrame to altervec (
str
) – column prefix of vector components, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]
- Return type:
None
- lumin.data_processing.hep_proc.add_mass(df, vec)[source]¶
Vectorised computation of mass of 4-vector, adding new column in place.
- Parameters:
df (
DataFrame
) – DataFrame to altervec (
str
) – column prefix of vector components, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]
- Return type:
None
- lumin.data_processing.hep_proc.add_mt(df, vec, mpt_name='mpt')[source]¶
Vectorised computation of transverse mass of 4-vector with respect to missing transverse momenta, adding new column in place. Currently only works for pT, eta, phi vectors
- Parameters:
df (
DataFrame
) – DataFrame to altervec (
str
) – column prefix of vector components, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]mpt_name (
str
) – column prefix of vector of missing transverse momenta components, e.g. ‘mpt’ for columns [‘mpt_pT’, ‘mpt_phi’]
- lumin.data_processing.hep_proc.boost(ref_vec, boost_vec, df=None, rescale_boost=False)[source]¶
Vectorised boosting of reference vectors along boosting vectors. N.B. Implementation adapted from ROOT (https://root.cern/)
- Parameters:
vec_0 – either (N,4) array of 4-momenta coordinates for starting vector, or prefix name for starting vector, i.e. columns should have names of the form [vec_0]_px, etc.
vec_1 – either (N,4) array of 4-momenta coordinates for boosting vector, or prefix name for boosting vector, i.e. columns should have names of the form [vec_1]_px, etc.
df (
Optional
[DataFrame
]) – DataFrame with datarescale_boost (
bool
) – whether to divide the boost vector by its energy
- Return type:
ndarray
- Returns:
(N,4) array of boosted vector in Cartesian coordinates
- lumin.data_processing.hep_proc.boost2cm(vec, df=None)[source]¶
Vectorised computation of boosting vector required to boost a vector to its centre-of-mass frame
- Parameters:
vec (
Union
[ndarray
,str
]) – either (N,4) array of 4-momenta coordinates for starting vector, or prefix name for starting vector, i.e. columns should have names of the form [vec]_px, etc.df (
Optional
[DataFrame
]) – DataFrame with data is supplying a string vec
- Return type:
ndarray
- Returns:
(N,3) array of boosting vector in Cartesian coordinates
- lumin.data_processing.hep_proc.calc_pair_mass(df, masses, feat_map)[source]¶
Vectorised computation of invarient mass o
f pair of particles with given masses, using 3-momenta. Only works for vectors defined in Cartesian coordinates.
- Arguments:
df: DataFrame vector components masses: tuple of masses of particles (either constant or different pair of masses per pair of particles) feat_map: dictionary mapping of requested momentum components to the features in df
- Returns:
np.ndarray of invarient masses
- Return type:
ndarray
- lumin.data_processing.hep_proc.cos_delta(vec_0, vec_1, df=None, name=None, inplace=False)[source]¶
Vectorised compututation of the cosine of the angular seperation of vec_1 from vec_0 If vec_* are strings, then columns are extracted from DataFrame df. If inplace is True Cosine angle is added a new column to the DataFrame with name cosdelta_[vec_0]_[vec_1] or cosdelta, unless name is set
- Parameters:
vec_0 (
Union
[ndarray
,str
]) – either (N,3) array of 3-momenta coordinates for vector 0, or prefix name for vector zero, i.e. columns should have names of the form [vec_0]_px, etc.vec_1 (
Union
[ndarray
,str
]) – either (N,3) array of 3-momenta coordinates for vector 1, or prefix name for vector one, i.e. columns should have names of the form [vec_1]_px, etc.df (
Optional
[DataFrame
]) – DataFrame with dataname (
Optional
[str
]) – if set, will create a new column in df for cosdelta with given name, otherwise will generate a nameinplace (
bool
) – if True will add new column to df, otherwise will return array of cos_deltas
- Return type:
Optional
[ndarray
]- Returns:
array of cos deltas in not inplace
- lumin.data_processing.hep_proc.delta_phi(arr_a, arr_b)[source]¶
Vectorised computation of modulo 2pi angular seperation of array of angles b from array of angles a, in range [-pi,pi]
- Parameters:
arr_a (
Union
[float
,ndarray
]) – reference anglesarr_b (
Union
[float
,ndarray
]) – final angles
- Return type:
Union
[float
,ndarray
]- Returns:
angular separation as float or np.ndarray
- lumin.data_processing.hep_proc.delta_r(dphi, deta)[source]¶
Vectorised computation of delta R separation for arrays of delta phi and delta eta (rapidity or pseudorapidity)
- Parameters:
dphi (
Union
[float
,ndarray
]) – delta phi separationsdeta (
Union
[float
,ndarray
]) – delta eta separations
- Return type:
Union
[float
,ndarray
]- Returns:
delta R separation as float or np.ndarray
- lumin.data_processing.hep_proc.delta_r_boosted(vec_0, vec_1, ref_vec, df=None, name=None, inplace=False)[source]¶
Vectorised compututation of the deltaR seperation of vec_1 from vec_0 in the rest-frame of another vector If vec_* are strings, then columns are extracted from DataFrame df. If inplace is True deltaR is added a new column to the DataFrame with name dR_[vec_0]_[vec_1]_boosted_[ref_vec] or dR_boosted, unless name is set
- Parameters:
vec_0 (
Union
[ndarray
,str
]) – either (N,4) array of 4-momenta coordinates for vector 0, in Cartesian coordinates or prefix name for vector zero, i.e. columns should have names of the form [vec_0]_px, etc.vec_1 (
Union
[ndarray
,str
]) – either (N,4) array of 4-momenta coordinates for vector 1, in Cartesian coordinates or prefix name for vector one, i.e. columns should have names of the form [vec_1]_px, etc.ref_vec (
Union
[ndarray
,str
]) – either (N,4) array of 4-momenta coordinates for the vector in whos rest-frame deltaR should be computed, in Cartesian coordinates or prefix name for reference vector, i.e. columns should have names of the form [ref_vec]_px, etc.df (
Optional
[DataFrame
]) – DataFrame with dataname (
Optional
[str
]) – if set, will create a new column in df for cosdelta with given name, otherwise will generate a nameinplace (
bool
) – if True will add new column to df, otherwise will return array of cos_deltas
- Return type:
Optional
[ndarray
]- Returns:
array of boosted deltaR in not inplace
- lumin.data_processing.hep_proc.event_to_cartesian(df, drop=False, ignore=None)[source]¶
Convert entire event to Cartesian coordinates, except vectors listed in ignore. Optionally, drop old pT,eta,phi features. Perfomed inplace.
- Parameters:
df (
DataFrame
) – DataFrame to alterdrop (
bool
) – whether to drop old coordinatesignore (
Optional
[List
[str
]]) – vectors to ignore when converting
- Return type:
None
- lumin.data_processing.hep_proc.fix_event_phi(df, ref_vec)[source]¶
Rotate event in phi such that ref_vec is at phi == 0. Performed inplace. Currently only works on vectors defined in pT, eta, phi
- Parameters:
df (
DataFrame
) – DataFrame to alterref_vec (
str
) – column prefix of vector components to use as reference, e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]
- Return type:
None
- lumin.data_processing.hep_proc.fix_event_y(df, ref_vec_0, ref_vec_1)[source]¶
Flip event in y-axis such that ref_vec_1 has a higher py than ref_vec_0. Performed in place. Works for both pT, eta, phi and Cartesian coordinates.
- Parameters:
df (
DataFrame
) – DataFrame to alterref_vec_0 (
str
) – column prefix of vector components to use as reference 0, e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]ref_vec_1 (
str
) – column prefix of vector components to use as reference 1, e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]
- Return type:
None
- lumin.data_processing.hep_proc.fix_event_z(df, ref_vec)[source]¶
Flip event in z-axis such that ref_vec is in positive z-direction. Performed inplace. Works for both pT, eta, phi and Cartesian coordinates.
- Parameters:
df (
DataFrame
) – DataFrame to alterref_vec (
str
) – column prefix of vector components to use as reference, e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]
- Return type:
None
- lumin.data_processing.hep_proc.get_momentum(df, vec, include_E=False, as_cart=False)[source]¶
Extracts array of 3- or 4-momenta coordinates from DataFrame columns
- Parameters:
df (
DataFrame
) – DataFrame with datavec (
str
) – prefix name for vector, i.e. columns should have names of the form [vec]_px, etc.as_cart (
bool
) – if True will return momenta in Cartesian coordinates
- Returns:
(px, py, pz, (E)) or (pT, phi, eta, (E))
- Return type:
(N, 3|4) array with columns
- lumin.data_processing.hep_proc.get_vecs(feats, strict=True)[source]¶
Filter list of features to get list of 3-momenta defined in the list. Works for both pT, eta, phi and Cartesian coordinates. If strict, return only vectors with all coordinates present in feature list.
- Parameters:
feats (
List
[str
]) – list of features to filterstrict (
bool
) – whether to require all 3-momenta components to be present in the list
- Return type:
Set
[str
]- Returns:
set of unique 3-momneta prefixes
- lumin.data_processing.hep_proc.proc_event(df, fix_phi=False, fix_y=False, fix_z=False, use_cartesian=False, ref_vec_0=None, ref_vec_1=None, keep_feats=None, default_vals=None)[source]¶
Process event: Pass data through inplace various conversions and drop uneeded columns. Data expected to consist of vectors defined in pT, eta, phi.
- Parameters:
df (
DataFrame
) – DataFrame to alterfix_phi (
bool
) – whether to rotate events usingfix_event_phi()
fix_y – whether to flip events using
fix_event_y()
fix_z – whether to flip events using
fix_event_z()
use_cartesian – wether to convert vectors to Cartesian coordinates
ref_vec_0 (
Optional
[str
]) – column prefix of vector components to use as reference (0) for :meth:~lumin.data_prcoessing.hep_proc.fix_event_phi`,fix_event_y()
, andfix_event_z()
e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]ref_vec_1 (
Optional
[str
]) – column prefix of vector components to use as reference (1) forfix_event_y()
, e.g. ‘muon’ for columns [‘muon_pT’, ‘muon_eta’, ‘muon_phi’]keep_feats (
Optional
[List
[str
]]) – columns to keep which would otherwise be droppeddefault_vals (
Optional
[List
[str
]]) – list of default values which might be used to represent missing vector components. These will be replaced with np.nan.
- Return type:
None
- lumin.data_processing.hep_proc.to_cartesian(df, vec, drop=False)[source]¶
Vectoriesed conversion of 3-momenta to Cartesian coordinates inplace, optionally dropping old pT,eta,phi features
- Parameters:
df (
DataFrame
) – DataFrame to altervec (
str
) – column prefix of vector components to alter, e.g. ‘muon’ for columns [‘muon_pt’, ‘muon_phi’, ‘muon_eta’]drop (
bool
) – Whether to remove original columns and just keep the new ones
- Return type:
None
- lumin.data_processing.hep_proc.to_pt_eta_phi(df, vec, drop=False)[source]¶
Vectorised conversion of 3-momenta to pT,eta,phi coordinates inplace, optionally dropping old px,py,pz features
- Parameters:
df (
DataFrame
) – DataFrame to altervec (
str
) – column prefix of vector components to alter, e.g. ‘muon’ for columns [‘muon_px’, ‘muon_py’, ‘muon_pz’]drop (
bool
) – Whether to remove original columns and just keep the new ones
- Return type:
None
- lumin.data_processing.hep_proc.twist(dphi, deta)[source]¶
Vectorised computation of twist between vectors (https://arxiv.org/abs/1010.3698)
- Parameters:
dphi (
Union
[float
,ndarray
]) – delta phi separationsdeta (
Union
[float
,ndarray
]) – delta eta separations
- Return type:
Union
[float
,ndarray
]- Returns:
angular separation as float or np.ndarray
lumin.data_processing.pre_proc module¶
- lumin.data_processing.pre_proc.fit_input_pipe(df, cont_feats, savename=None, input_pipe=None, norm_in=True, pca=False, whiten=False, with_mean=True, with_std=True, n_components=None)[source]¶
Fit input pipeline to continuous features and optionally save.
- Parameters:
df (
DataFrame
) – DataFrame with data to fit pipelinecont_feats (
Union
[str
,List
[str
]]) – (list of) column(s) to use as input data for fittingsavename (
Optional
[str
]) – if set will save the fitted Pipeline to with that name as Pickle (.pkl extension added automatically)input_pipe (
Optional
[Pipeline
]) – if set will fit, otherwise will instantiate a new Pipelinenorm_in (
bool
) – whether to apply StandardScaler to inputs. Only used if input_pipe is not set.pca (
bool
) – whether to apply PCA to inputs. Perforemed prior to StandardScaler. No dimensionality reduction is applied, purely rotation. Only used if input_pipe is not set.whiten (
bool
) – whether PCA should whiten inputs. Only used if input_pipe is not set.with_mean (
bool
) – whether StandardScalers should shift means to 0. Only used if input_pipe is not set.with_std (
bool
) – whether StandardScalers should scale standard deviations to 1. Only used if input_pipe is not set.n_components (
Optional
[int
]) – if set, causes PCA to reduce the dimensionality of the input data. Only used if input_pipe is not set.
- Return type:
Pipeline
- Returns:
Fitted Pipeline
- lumin.data_processing.pre_proc.fit_output_pipe(df, targ_feats, savename=None, output_pipe=None, norm_out=True)[source]¶
Fit output pipeline to target features and optionally save. Have you thought about using a y_range for regression instead?
- Parameters:
df (
DataFrame
) – DataFrame with data to fit pipelinetarg_feats (
Union
[str
,List
[str
]]) – (list of) column(s) to use as input data for fittingsavename (
Optional
[str
]) – if set will save the fitted Pipeline to with that name as Pickle (.pkl extension added automatically)output_pipe (
Optional
[Pipeline
]) – if set will fit, otherwise will instantiate a new Pipelinenorm_out (
bool
) – whether to apply StandardScaler to outputs . Only used if output_pipe is not set.
- Return type:
Pipeline
- Returns:
Fitted Pipeline
- lumin.data_processing.pre_proc.get_pre_proc_pipes(norm_in=True, norm_out=False, pca=False, whiten=False, with_mean=True, with_std=True, n_components=None)[source]¶
Configure SKLearn Pipelines for processing inputs and targets with the requested transformations.
- Parameters:
norm_in (
bool
) – whether to apply StandardScaler to inputsnorm_out (
bool
) – whether to apply StandardScaler to outputspca (
bool
) – whether to apply PCA to inputs. Perforemed prior to StandardScaler. No dimensionality reduction is applied, purely rotation.whiten (
bool
) – whether PCA should whiten inputs.with_mean (
bool
) – whether StandardScalers should shift means to 0with_std (
bool
) – whether StandardScalers should scale standard deviations to 1n_components (
Optional
[int
]) – if set, causes PCA to reduce the dimensionality of the input data
- Return type:
Tuple
[Pipeline
,Pipeline
]- Returns:
Pipeline for input data Pipeline for target data
- lumin.data_processing.pre_proc.proc_cats(train_df, cat_feats, val_df=None, test_df=None)[source]¶
Process categorical features in train_df to be valued 0->cardinality-1. Applied inplace. Applies same transformation to validation and testing data is passed. Will complain if validation or testing sets contain categories which are not present in the training data.
- Parameters:
train_df (
DataFrame
) – DataFrame with the training data, which will also be used to specify all the categories to considercat_feats (
List
[str
]) – list of columns to use as categorical featuresval_df (
Optional
[DataFrame
]) – if set will apply the same category to code mapping to the validation data as was performed on the training datatest_df (
Optional
[DataFrame
]) – if set will apply the same category to code mapping to the testing data as was performed on the training data
- Return type:
Tuple
[OrderedDict
,OrderedDict
]- Returns:
ordered dictionary mapping categorical features to dictionaries mapping codes to categories ordered dictionary mapping categorical features to their cardinalities