analyze_electrophero ==================== .. py:module:: analyze_electrophero .. autoapi-nested-parse:: Main functions for electropherogram analysis. Author: Anja Hess Date: 2025-AUG-06 Attributes ---------- .. autoapisummary:: analyze_electrophero.script_path analyze_electrophero.maindir Functions --------- .. autoapisummary:: analyze_electrophero.peak2basepairs analyze_electrophero.split_and_long_by_ladder analyze_electrophero.parse_meta_to_long analyze_electrophero.remove_marker_from_df analyze_electrophero.nuc_fractions analyze_electrophero.run_stats analyze_electrophero.marker_and_normalize analyze_electrophero.epg_stats analyze_electrophero.epg_analysis Module Contents --------------- .. py:data:: script_path Local directory of DNAvi analyze_electrophero module .. py:data:: maindir Local directory of DNAvi (MAIN) .. py:function:: peak2basepairs(df, qc_save_dir, y_label=YLABEL, x_label=XLABEL, ladder_dir='', ladder_type='custom', marker_lane=0) Function to infer ladder peaks from the signal table and annotate those to base pair positions with the user-provided ladder-file. :param df: pandas dataframe :param qc_save_dir: directory to save qc results :param y_label: str, new name for the signal intensity values :param x_label: str, new name for the position values :param ladder_dir: str, path to where the ladder is located :param ladder_type: str, if changed to "custom" the minimum peak height can be adjusted with the constants module. :return: a dictionary annotating each peak to a base pair position .. py:function:: split_and_long_by_ladder(df) This function allows to handle multiple ladder types in one input dataframe while transferring the data into a long format required for plotting. The base pair position for each set of DNA samples is assigned as defined by previous marker interpolation. :param df: pandas.DataFrame (wide) :return: pandas.DataFrame (long) .. py:function:: parse_meta_to_long(df, metafile, sample_col='sample', source_file='', image_input=False) Function to parse the user-provided metadata and transfer to long format :param df: pandas.DataFrame (wide) :param metafile: str, csv path :param sample_col: str, column name :param source_file: str, csv path to where the source file shall be located :param image_input: bool, whether this dataframe was previously generated from an image file :return: the source data file is written to disk (.csv) .. py:function:: remove_marker_from_df(df, peak_dict='', on='', correct_for_variant_samples=False) Function to remove marker from dataframe including a halo, meaning a defined number of base pairs around the marker band specified in the constants module :param df: pandas.DataFrame :param peak_dict: dict, previously generated with peak2basepairs :param on: str denoting column based on which dataframe will be cut :param correct_for_variant_samples: bool - if this option is chosen, each sample will be checked individually for end of the marker peaks and cropped based on this information. Defaults to False, meaning that the marker halo is estimated from the first sample. :return: pd.DataFrame, cleared from marker-associated data points .. py:function:: nuc_fractions(df, unit='', size_unit='', nuc_dict=NUC_DICT) Estimate nucleosomal fractions (percentages) of a sample's cfDNA based on pre-defined base pair ranges. :param df: pandas.DataFrame :param unit: str, usually normalized fluorescence unit :param size_unit: str, fragment size unit (base pairs) :return: pd.Dataframe of nucleosomal fractions .. py:function:: run_stats(df, variable='', category='', paired=False, alpha=0.05, region_id='region_id') Function to perform statistical tests (parametric or non-parametric) infer significance for the difference in mean base pair fragment size for patients/samples from different groups :param df: pandas.DataFrame :param variable: continuous variable :param category: categorical variable :param paired: boolean :return: statistics per group in a dataframe .. py:function:: marker_and_normalize(df, peak_dict='', include_marker=False, normalize=True, normalize_to=False, correct=False) Function to normalize the raw DNA fluorescence intensity to a value between 0 abd 1. :param df: pandas.DataFrame :param peak_dict: dict, previously generated with peak2basepairs :param include_marker: bool, whether to include markers :return: pd.DataFrame, now with normalized DNA fluorescence intensity .. py:function:: epg_stats(df, save_dir='', unit='normalized_fluorescent_units', size_unit='bp_pos', metric_unit='value', nuc_dict=NUC_DICT, paired=False, region_id='region_id', cut=False) Compute and output basic statistics for DNA size distributions :param df: pandas.DataFrame :param save_dir: string, where to save the statistics to :param unit: string (y-variable) :param size_unit: string (x-variable) :param paired: bool, whether measurements were paired :return: will save three dataframes as .csv files in stats directory: basic_statistics.csv, peak_statistics.csv, group_statistics_by_CATEGORICAL-VAR.csv) .. py:function:: epg_analysis(path_to_file, path_to_ladder, path_to_meta, run_id=None, include_marker=False, image_input=False, save_dir=False, marker_lane=0, nuc_dict=NUC_DICT, paired=False, normalize=True, normalize_to=False, correct=False, cut=False) Core function to analyze DNA distribution from a signal table. :param path_to_file: str, path where the signal table is stored :param path_to_ladder: str, path to where the ladder file is stored :param path_to_meta: str, path to metadata file :param run_id: str, name for the analysis, based on user input or name of the signal table file :param include_marker: bool, whether to include the marker in the analysis :param image_input: bool, whether to the signal table was generated based on an image :param save_dir: bool or str, where to save the statistics to. Default: False :param paired: bool, whether to perform a paired statistical analysis :param normalize: bool, whether to perform min-max normalization :param normalize_to: str of False, name of sample to which all other samples are normalized to :return: run analysis and plotting functions, create multiple outputs in the result folder