analyze_electrophero

Main functions for electropherogram analysis.

Author: Anja Hess

Date: 2025-AUG-06

Attributes

script_path

Local directory of DNAvi analyze_electrophero module

maindir

Local directory of DNAvi (MAIN)

Functions

peak2basepairs(df, qc_save_dir[, y_label, x_label, ...])

Function to infer ladder peaks from the signal table and annotate those to base pair positions with the user-provided ladder-file.

split_and_long_by_ladder(df)

This function allows to handle multiple ladder types in one input dataframe while transferring the data into a long format required for plotting. The base pair position for each set of DNA samples is assigned as defined by previous marker interpolation.

parse_meta_to_long(df, metafile[, sample_col, ...])

Function to parse the user-provided metadata and transfer to long format

remove_marker_from_df(df[, peak_dict, on, ...])

Function to remove marker from dataframe including a halo, meaning a defined number of base pairs around the marker band specified in the constants module

nuc_fractions(df[, unit, size_unit, nuc_dict])

Estimate nucleosomal fractions (percentages) of a sample's cfDNA based on pre-defined base pair ranges.

run_stats(df[, variable, category, paired, alpha, ...])

Function to perform statistical tests (parametric or

marker_and_normalize(df[, peak_dict, include_marker, ...])

Function to normalize the raw DNA fluorescence intensity to a value between 0 abd 1.

epg_stats(df[, save_dir, unit, size_unit, ...])

Compute and output basic statistics for DNA size distributions

epg_analysis(path_to_file, path_to_ladder, path_to_meta)

Core function to analyze DNA distribution from a signal table.

Module Contents

analyze_electrophero.script_path

Local directory of DNAvi analyze_electrophero module

analyze_electrophero.maindir

Local directory of DNAvi (MAIN)

analyze_electrophero.peak2basepairs(df, qc_save_dir, y_label=YLABEL, x_label=XLABEL, ladder_dir='', ladder_type='custom', marker_lane=0)

Function to infer ladder peaks from the signal table and annotate those to base pair positions with the user-provided ladder-file.

Parameters:
  • df – pandas dataframe

  • qc_save_dir – directory to save qc results

  • y_label – str, new name for the signal intensity values

  • x_label – str, new name for the position values

  • ladder_dir – str, path to where the ladder is located

  • ladder_type – str, if changed to “custom” the minimum peak height can be adjusted with the constants module.

Returns:

a dictionary annotating each peak to a base pair position

analyze_electrophero.split_and_long_by_ladder(df)

This function allows to handle multiple ladder types in one input dataframe while transferring the data into a long format required for plotting. The base pair position for each set of DNA samples is assigned as defined by previous marker interpolation.

Parameters:

df – pandas.DataFrame (wide)

Returns:

pandas.DataFrame (long)

analyze_electrophero.parse_meta_to_long(df, metafile, sample_col='sample', source_file='', image_input=False)

Function to parse the user-provided metadata and transfer to long format

Parameters:
  • df – pandas.DataFrame (wide)

  • metafile – str, csv path

  • sample_col – str, column name

  • source_file – str, csv path to where the source file shall be located

  • image_input – bool, whether this dataframe was previously generated from an image file

Returns:

the source data file is written to disk (.csv)

analyze_electrophero.remove_marker_from_df(df, peak_dict='', on='', correct_for_variant_samples=False)

Function to remove marker from dataframe including a halo, meaning a defined number of base pairs around the marker band specified in the constants module

Parameters:
  • df – pandas.DataFrame

  • peak_dict – dict, previously generated with peak2basepairs

  • on – str denoting column based on which dataframe will be cut

  • correct_for_variant_samples – bool - if this option is chosen, each sample will

be checked individually for end of the marker peaks and cropped based on this information. Defaults to False, meaning that the marker halo is estimated from the first sample. :return: pd.DataFrame, cleared from marker-associated data points

analyze_electrophero.nuc_fractions(df, unit='', size_unit='', nuc_dict=NUC_DICT)

Estimate nucleosomal fractions (percentages) of a sample’s cfDNA based on pre-defined base pair ranges.

Parameters:
  • df – pandas.DataFrame

  • unit – str, usually normalized fluorescence unit

  • size_unit – str, fragment size unit (base pairs)

Returns:

pd.Dataframe of nucleosomal fractions

analyze_electrophero.run_stats(df, variable='', category='', paired=False, alpha=0.05, region_id='region_id')

Function to perform statistical tests (parametric or non-parametric) infer significance for the difference in mean base pair fragment size for patients/samples from different groups

Parameters:
  • df – pandas.DataFrame

  • variable – continuous variable

  • category – categorical variable

  • paired – boolean

Returns:

statistics per group in a dataframe

analyze_electrophero.marker_and_normalize(df, peak_dict='', include_marker=False, normalize=True, normalize_to=False, correct=False)

Function to normalize the raw DNA fluorescence intensity to a value between 0 abd 1.

Parameters:
  • df – pandas.DataFrame

  • peak_dict – dict, previously generated with peak2basepairs

  • include_marker – bool, whether to include markers

Returns:

pd.DataFrame, now with normalized DNA fluorescence intensity

analyze_electrophero.epg_stats(df, save_dir='', unit='normalized_fluorescent_units', size_unit='bp_pos', metric_unit='value', nuc_dict=NUC_DICT, paired=False, region_id='region_id', cut=False)

Compute and output basic statistics for DNA size distributions

Parameters:
  • df – pandas.DataFrame

  • save_dir – string, where to save the statistics to

  • unit – string (y-variable)

  • size_unit – string (x-variable)

  • paired – bool, whether measurements were paired

Returns:

will save three dataframes as .csv files in stats directory: basic_statistics.csv, peak_statistics.csv, group_statistics_by_CATEGORICAL-VAR.csv)

analyze_electrophero.epg_analysis(path_to_file, path_to_ladder, path_to_meta, run_id=None, include_marker=False, image_input=False, save_dir=False, marker_lane=0, nuc_dict=NUC_DICT, paired=False, normalize=True, normalize_to=False, correct=False, cut=False)

Core function to analyze DNA distribution from a signal table.

Parameters:
  • path_to_file – str, path where the signal table is stored

  • path_to_ladder – str, path to where the ladder file is stored

  • path_to_meta – str, path to metadata file

  • run_id – str, name for the analysis, based on user input or name of the signal table file

  • include_marker – bool, whether to include the marker in the analysis

  • image_input – bool, whether to the signal table was generated based on an image

  • save_dir – bool or str, where to save the statistics to. Default: False

  • paired – bool, whether to perform a paired statistical analysis

  • normalize – bool, whether to perform min-max normalization

  • normalize_to – str of False, name of sample to which all other samples are normalized to

Returns:

run analysis and plotting functions, create multiple outputs in the result folder