analyze_electrophero

Main functions for electropherogram analysis.

Author: Anja Hess

Date: 2025-AUG-06

Attributes

`script_path`	Local directory of DNAvi analyze_electrophero module
`maindir`	Local directory of DNAvi (MAIN)

Functions

`peak2basepairs`(df, qc_save_dir[, y_label, x_label, ...])	Function to infer ladder peaks from the signal table and annotate those to base pair positions with the user-provided ladder-file.
`split_and_long_by_ladder`(df)	This function allows to handle multiple ladder types in one input dataframe while transferring the data into a long format required for plotting. The base pair position for each set of DNA samples is assigned as defined by previous marker interpolation.
`parse_meta_to_long`(df, metafile[, sample_col, ...])	Function to parse the user-provided metadata and transfer to long format
`remove_marker_from_df`(df[, peak_dict, on, ...])	Function to remove marker from dataframe including a halo, meaning a defined number of base pairs around the marker band specified in the constants module
`nuc_fractions`(df[, unit, size_unit, nuc_dict])	Estimate nucleosomal fractions (percentages) of a sample's cfDNA based on pre-defined base pair ranges.
`run_stats`(df[, variable, category, paired, alpha, ...])	Function to perform statistical tests (parametric or
`marker_and_normalize`(df[, peak_dict, include_marker, ...])	Function to normalize the raw DNA fluorescence intensity to a value between 0 abd 1.
`epg_stats`(df[, save_dir, unit, size_unit, ...])	Compute and output basic statistics for DNA size distributions
`epg_analysis`(path_to_file, path_to_ladder, path_to_meta)	Core function to analyze DNA distribution from a signal table.

Module Contents

analyze_electrophero.script_path: Local directory of DNAvi analyze_electrophero module

analyze_electrophero.maindir: Local directory of DNAvi (MAIN)

analyze_electrophero.peak2basepairs(df, qc_save_dir, y_label=YLABEL, x_label=XLABEL, ladder_dir='', ladder_type='custom', marker_lane=0)

Function to infer ladder peaks from the signal table and annotate those to base pair positions with the user-provided ladder-file.

Parameters:

df – pandas dataframe
qc_save_dir – directory to save qc results
y_label – str, new name for the signal intensity values
x_label – str, new name for the position values
ladder_dir – str, path to where the ladder is located
ladder_type – str, if changed to “custom” the minimum peak height can be adjusted with the constants module.

Returns:

a dictionary annotating each peak to a base pair position

analyze_electrophero.split_and_long_by_ladder(df)

This function allows to handle multiple ladder types in one input dataframe while transferring the data into a long format required for plotting. The base pair position for each set of DNA samples is assigned as defined by previous marker interpolation.

Parameters:: df – pandas.DataFrame (wide)
Returns:: pandas.DataFrame (long)

analyze_electrophero.parse_meta_to_long(df, metafile, sample_col='sample', source_file='', image_input=False)

Function to parse the user-provided metadata and transfer to long format

Parameters:

df – pandas.DataFrame (wide)
metafile – str, csv path
sample_col – str, column name
source_file – str, csv path to where the source file shall be located
image_input – bool, whether this dataframe was previously generated from an image file

Returns:

the source data file is written to disk (.csv)

analyze_electrophero.remove_marker_from_df(df, peak_dict='', on='', correct_for_variant_samples=False)

Function to remove marker from dataframe including a halo, meaning a defined number of base pairs around the marker band specified in the constants module

Parameters:

df – pandas.DataFrame
peak_dict – dict, previously generated with peak2basepairs
on – str denoting column based on which dataframe will be cut
correct_for_variant_samples – bool - if this option is chosen, each sample will

be checked individually for end of the marker peaks and cropped based on this information. Defaults to False, meaning that the marker halo is estimated from the first sample. :return: pd.DataFrame, cleared from marker-associated data points

analyze_electrophero.nuc_fractions(df, unit='', size_unit='', nuc_dict=NUC_DICT)

Estimate nucleosomal fractions (percentages) of a sample’s cfDNA based on pre-defined base pair ranges.

Parameters:

df – pandas.DataFrame
unit – str, usually normalized fluorescence unit
size_unit – str, fragment size unit (base pairs)

Returns:

pd.Dataframe of nucleosomal fractions

analyze_electrophero.run_stats(df, variable='', category='', paired=False, alpha=0.05, region_id='region_id')

Function to perform statistical tests (parametric or non-parametric) infer significance for the difference in mean base pair fragment size for patients/samples from different groups

Parameters:

df – pandas.DataFrame
variable – continuous variable
category – categorical variable
paired – boolean

Returns:

statistics per group in a dataframe

analyze_electrophero.marker_and_normalize(df, peak_dict='', include_marker=False, normalize=True, normalize_to=False, correct=False)

Function to normalize the raw DNA fluorescence intensity to a value between 0 abd 1.

Parameters:

df – pandas.DataFrame
peak_dict – dict, previously generated with peak2basepairs
include_marker – bool, whether to include markers

Returns:

pd.DataFrame, now with normalized DNA fluorescence intensity

analyze_electrophero.epg_stats(df, save_dir='', unit='normalized_fluorescent_units', size_unit='bp_pos', metric_unit='value', nuc_dict=NUC_DICT, paired=False, region_id='region_id', cut=False)

Compute and output basic statistics for DNA size distributions

Parameters:

df – pandas.DataFrame
save_dir – string, where to save the statistics to
unit – string (y-variable)
size_unit – string (x-variable)
paired – bool, whether measurements were paired

Returns:

will save three dataframes as .csv files in stats directory: basic_statistics.csv, peak_statistics.csv, group_statistics_by_CATEGORICAL-VAR.csv)

analyze_electrophero.epg_analysis(path_to_file, path_to_ladder, path_to_meta, run_id=None, include_marker=False, image_input=False, save_dir=False, marker_lane=0, nuc_dict=NUC_DICT, paired=False, normalize=True, normalize_to=False, correct=False, cut=False)

Core function to analyze DNA distribution from a signal table.

Parameters:

path_to_file – str, path where the signal table is stored
path_to_ladder – str, path to where the ladder file is stored
path_to_meta – str, path to metadata file
run_id – str, name for the analysis, based on user input or name of the signal table file
include_marker – bool, whether to include the marker in the analysis
image_input – bool, whether to the signal table was generated based on an image
save_dir – bool or str, where to save the statistics to. Default: False
paired – bool, whether to perform a paired statistical analysis
normalize – bool, whether to perform min-max normalization
normalize_to – str of False, name of sample to which all other samples are normalized to

Returns:

run analysis and plotting functions, create multiple outputs in the result folder