corradjust.corradjust module#

class corradjust.corradjust.CorrAdjust(df_feature_ann, ref_feature_colls, out_dir, winsorize=0.05, shuffle_feature_names=False, metric='enrichment-based', min_pairs_to_score=1000, random_seed=17, title=None, verbose=True)[source]#

Bases: object

The main CorrAdjust class.

Parameters:

df_feature_annpandas.DataFrame

A data frame with index and 2 columns providing annotation for features.

Index: unique feature ids that are also used in the data table. For example, ENSEMBL gene ids or small RNA license plates.
Column feature_name: user-friendly names of the features, which are also used in the reference .gmt files. For example, gene symbols or "hsa-miR-xx-5p" notation for miRNAs.
Column feature_type: discrete set of feature types. E.g., if you are analyzing only mRNA-seq data, put "mRNA" for all features; if you are integrating miRNA, mRNA, or any other data type, you could use more than one type (e.g., "miRNA" and "mRNA"). Feature types shouldn’t contain "-" character.

ref_feature_collsdict

A dictionary with information about reference feature set collections. Each collection is an item of ref_feature_colls with key being collection name (string) and value being another dict with the following structure.

{

"path" : str

Relative or absolute path to the .gmt file with reference feature sets.

"sign" : {"positive", "negative", "absolute"}

Determines how to rank feature-feature correlations.

"absolute": absolute values of correlations
(highest absolute correlations go first).
"positive": descending sorting of correlations
(most positive correlations go first).
"negative": ascending sorting of correlations
(most negative correlations go first).

"feature_pair_types" : list

List of strings showing types of correlations you are interested in. For example, ["mRNA-mRNA"] or ["miRNA-mRNA"].

"high_corr_frac" : float

Fraction of all feature pairs with most highly ranked correlations that will be used for scoring (alpha in the paper’s notation).

out_dirstr

Path to the directory where the output files will be stored. The directory will be created if it doesn’t exist.

winsorizefloat or None, optional, default=0.05

If float, specifies fraction of the lowest and the highest data values that will be winsorized. If None, no winsorizing will be applied. Winsorizing is train/test-set friendly, i.e., test data has no influence on training data.

shuffle_feature_namesbool or list, optional, default=False

If True, feature_name column of df_feature_ann will be randomly permuted. If list of feature types, e.g., ["mRNA"], permutation will be conducted only within the corresponding feature types. If False, no permutations will be conducted.

metric{“enrichment-based”, “BP@K”}, optional, default=”enrichment-based”

Metric for evaluating feature correlations (see CorrAdjust paper for details). Please use the default option unless you are absolutely sure why do you need balanced precision at K (BP@K).

min_pairs_to_scoreint, optional, default=1000

Minimum number of total pairs containing a feature (N_j in the paper notation) to include the feature in the score computations. Inequaility must hold both in training and validation pair sets.

random_seedint, optional, default=17

Random seed used by PCA and other non-deterministic procedures.

titlestr or None, optional, default=None

If string, it will be shown in top-left corner of all generated plots.

verbosebool, optional, default=True

Whether to print progress messages to stderr.

Attributes:

confounder_PCslist: The list of confounder PC indices (0-based). Populated by calling fit method.
PCA_modelsklearn.decomposition.PCA: PCA instance fit on training data. Populated by calling fit and fit_PCA methods.
winsorizerWinsorizer: Winsorizer instance fit on training data. Populated by calling fit and fit_PCA methods.
mean_centerizerMeanCenterizer: MeanCenterizer instance fit on training data. Populated by calling fit and fit_PCA methods.
corr_scorerCorrScorer: Instance of CorrScorer. Populated by constructor.
out_dirstr: Output directory. Populated by constructor.
titlestr: Title added to left-top corner of the plots. Populated by constructor.
metric{“enrichment-based”, “BP@K”}: Metric for evaluating feature correlations. Populated by constructor.
random_seedint: Random seed. Populated by constructor.
verbosebool: Whether to print progress messages. Populated by constructor.

fit(df_data, df_samp_ann=None, method='greedy', n_PCs=None, n_iters=None)[source]#

Find optimal set of PCs to regress out. Calling this method sets confounder_PCs attribute.

It also generates several files in out_dir:

fit.tsv : file with the scores across the optimization trajectory.
fit.{sample_group}.png : visualization of fit.tsv for each sample group.
fit.log : detailed log file updated at each iteration.

Parameters:

df_datapandas.DataFrame

Training data. Index = samples, columns = feature ids. Data should be normalized so that feature values are comparable between samples. Feature ids should perfectly match the ones used in df_feature_ann while initializing the object.

df_samp_annpandas.DataFrame or None, optional, default=None

A data frame with index and one column providing group annotation of samples.

Index: sample names (should match index of df_data).
Column group: discrete set of group names (e.g., "normal" and "tumor"). Group name cannot be "mean".

If df_samp_ann is None, the method automatically generates it with one sample group called all_samp.

method{“greedy”, “first-n”}, optional, default=”greedy”

Optimization method.

"greedy" (recommended default): searches for the best PC to remove on each iteration (quadratic complexity).
"first-n": iteratively removes PCs based on % variance they explain (linear complexity). This is the concept used by sva_network function of the sva package.

n_PCsint or None, optional, default=None

Number of PCs to choose from during the optimization. By default (None), number of PCs is estimated as a knee of % cumulative variance plot.

n_itersint or None, optional, default=None

How many optimization iterations to perform. If None, n_iters is set to n_PCs.

fit_PCA(df_data, df_samp_ann=None, plot_var=False)[source]#

Compute principal components of training data. The data are centered and optionally winsorized before computing PCA in train/test-friendly way.

Calling this method sets PCA_model attribute.

Parameters:

df_datapandas.DataFrame: Training data. Index = samples, columns = feature ids. See fit method.
df_samp_annpandas.DataFrame, optional, default=None: A data frame providing group annotation of samples. See fit method.
plot_varbool, optional, default=False: Whether to make plots with % variance explained. The plots will be saved as "PCA_var.png" in out_dir.

Returns:

PCsnumpy.ndarray: Array of principal components with shape (n_samples, n_components).
PC_coefsnumpy.ndarray: Array of PC coefficients with shape (n_components, n_features).
n_PCs_kneeint: Number of PCs estimated using knee (a.k.a. elbow) method.
df_datapandas.DataFrame: Centered and (optionally) winsorized training data.

transform(df_data, df_samp_ann=None)[source]#

Residualize identified confounder PCs from input data. This method should be called after fit, or after manually calling fit_PCA and setting confounder_PCs attribute.

Parameters:

df_datapandas.DataFrame: Input data. Index = samples, columns = feature ids. Feature ids should perfectly match the ones used in df_feature_ann while initializing the object.
df_samp_annpandas.DataFrame or None, optional, default=None: A data frame providing group annotation of samples. This argument is mandatory if it was used during fit call.

Returns:

df_data_cleanpandas.DataFrame: Cleaned data.
df_rsquaredspandas.DataFrame: Data frame with R^2 values for each feature.

compute_confounders(df_data, df_samp_ann=None)[source]#

Compute the data frame with confounders from input data. This method should be called after fit, or after manually calling fit_PCA and setting confounder_PCs attribute.

Parameters:

df_datapandas.DataFrame: Input data. Index = samples, columns = feature ids. Feature ids should perfectly match the ones used in df_feature_ann while initializing the object.
df_samp_annpandas.DataFrame or None, optional, default=None: A data frame providing group annotation of samples. This argument is mandatory if it was used during fit call.

Returns:

pandas.DataFrame: Data frame with values of confounder PCs. Note that PC indices are 0-based.

compute_feature_scores(df_data, df_samp_ann=None, samp_group=None, pairs_subset='all')[source]#

Compute enrichment scores and p-values for feature-feature correlations in given data.

Parameters:

df_datapandas.DataFrame: Input data. Index = samples, columns = feature ids. Feature ids should perfectly match the ones used in df_feature_ann while initializing the object.
df_samp_annpandas.DataFrame or None, optional, default=None: A data frame providing group annotation of samples. This argument is mandatory if it was used during fit call.
samp_groupstr or None, default=None: If df_samp_ann is passed, this argument is mandatory and should refer to one of the sample groups in df_samp_ann.
pairs_subset{“all”, “training”, “validation”}, optional, default=”all”: Which set of feature pairs to use for computing scores.

Returns:

dict: Dict with data frames containing feature-wise enrichment scores and p-values. It has the following structure:

{

"Clean" : {

ref_feature_coll : pandas.DataFrame

},

"Raw": {…}

}.

make_volcano_plot(feature_scores, plot_filename, **plot_kwargs)[source]#

Make volcano plot.

Parameters:

feature_scoresdict: Dict produced by compute_feature_scores method.
plot_filenamestr: Path to the figure (with extension, e.g., .png).
**plot_kwargs: Other keyword arguments controlling plot aesthetics passed to VolcanoPlotter.

Returns:

VolcanoPlotter: Instance of VolcanoPlotter. At this point, the plot is already saved in the file, so you can change it through fig/ax and re-save.

make_corr_distr_plot(df_data, plot_filename, df_samp_ann=None, samp_group=None, pairs_subset='all', **plot_kwargs)[source]#

Visualize distribution of feature-feature correlations before and after PC correction.

Parameters:

df_datapandas.DataFrame: Input data. Index = samples, columns = feature ids. Feature ids should perfectly match the ones used in df_feature_ann while initializing the object.
plot_filenamestr: Path to the figure (with extension, e.g., .png).
df_samp_annpandas.DataFrame or None, optional, default=None: A data frame providing group annotation of samples. This argument is mandatory if it was used during fit call.
samp_groupstr or None, default=None: If df_samp_ann is passed, this argument is mandatory and should refer to one of the sample groups in df_samp_ann.
pairs_subset{“all”, “training”, “validation”}, optional, default=”all”: Which set of feature pairs to use for computing scores.
**plot_kwargs: Other keyword arguments controlling plot aesthetics passed to CorrDistrPlotter.

Returns:

CorrDistrPlotter: Instance of CorrDistrPlotter. At this point, the plot is already saved in the file, so you can change it through fig/ax and re-save.

export_corrs(df_data, filename, df_samp_ann=None, samp_group=None, chunk_size=1000000)[source]#

Export feature-feature correlations to file.

Parameters:

df_datapandas.DataFrame: Input data. Index = samples, columns = feature ids. Feature ids should perfectly match the ones used in df_feature_ann while initializing the object.
filenamestr: Path to the export table with .tsv extension.
df_samp_annpandas.DataFrame or None, optional, default=None: A data frame providing group annotation of samples. This argument is mandatory if it was used during fit call.
samp_groupstr or None, default=None: If df_samp_ann is passed, this argument is mandatory and should refer to one of the sample groups in df_samp_ann.
chunk_sizeint, optional, default=1000000: How many rows are exported in a single chunk.

_search_confounder_PCs(df_data, df_samp_ann, PCs, PC_coefs, method, n_iters)[source]#

Iteratively residualize PCs from training data to maximize number of reference pairs among highly correlated feature pairs.

Parental fit method describes files generated by this method.

Parameters:

df_datapandas.DataFrame: See fit method.
df_samp_annpandas.DataFrame or None: See fit method.
PCsnumpy.ndarray: 2D array of PCs (returned by fit_PCA).
PC_coefsnumpy.ndarray: 2D array of PC coefficients (returned by fit_PCA).
methodstr: See fit method.
n_iters_type_: See fit method.

_compute_corrs_and_scores(X, df_samp_ann, log_file, prog_bar, log_message)[source]#

Conduct one iteration of PC optimization by computing and scoring correlations. Scores will be computed separately for each sample group, mean scores will be also returned.

Parameters:

Xnumpy.ndarray: Training data, rows = samples, columns = features.
df_samp_annpandas.DataFrame: See fit method.
log_filefile: Writing-open log file.
prog_bartqdm.std.tqdm or None: Progress bar (will be updated) or None.
log_messagestr or None: Title of the log message. If None, nothing will be printed to log.

Returns:

iter_scoresdict: A dict of the following structure:

{

sample_group_name : {

ref_feature_coll : {

"score training" : float,

"score validation" : float,

"score all" : float,

}

}

}.
main_scorefloat: Mean of training scores across all sample groups and reference collections, i.e., iter_scores["mean"]["mean"]["score training"].

_best_iter_scores_to_df(best_iter_scores, used_PCs)[source]#

Convert dict of scores into data frame.

Parameters:

best_iter_scoreslist: List of dicts with scores corresponding to the iterations of optimization. Each element of list should have a format specified in the return dict of the _compute_corrs_and_scores method.
used_PCslist: List of PC indices corresponding to best_iter_scores.

Returns:

pandas.DataFrame

Pandas data frame with scores in columns and iterations on rows.

Index: iteration number. Starts with 0 (raw dataset before correction).
Column PC: index of PC being regressed out on each iteration. First row always have NaN (no correction).
Score columns: these columns have the following format: "{sample_group_name};{ref_feature_coll};{training/validation/all}".

_make_best_iter_scores_plot(df_best_iter_scores, peak_iter, df_samp_ann)[source]#

Make optimization score line plots. If there is one sample group, only one plot with "fit.png" name will be produced. If there are multiple groups, one plot per group will be made: "fit.{group_name}.png", including group with mean values.

Parameters:

df_best_iter_scorespandas.DataFrame: Data frame generated using _best_iter_scores_to_df method.
peak_iterint: This number is used to draw vertical dashed red line at the selected early stopping iteration.
df_samp_annpandas.DataFrame: See fit method.

_compute_raw_and_clean_corrs(df_data, df_samp_ann, samp_group)[source]#

Compute correlations before and after residualizing PC confounders for a given group of samples.

Parameters:

df_datapandas.DataFrame: Input data. Index = samples, columns = feature ids. See transform for details.
df_samp_annpandas.DataFrame: A data frame providing group annotation of samples. See transform for details.
samp_groupstr: One of the sample groups in df_samp_ann.

Returns:

corrs_rawnumpy.array: 1D array of correlations (diagonal 1s are not included).
corrs_cleannumpy.array: Same but with cleaned correlations.

_check_df_data(df_data)[source]#

Check validity of input data.

Parameters:

df_datapandas.DataFrame: Input data. Index = samples, columns = feature ids. See fit method.

Raises:

AssertionError

Several possible reasons:

Feature ids do not match ones in feature annotation.
There is a gene with constant expression.

_check_df_samp_ann(df_samp_ann, df_data, after_training=False)[source]#

Check validity of sample annotation.

Parameters:

df_samp_annpandas.DataFrame or None: Sample annotation data. See fit method.
df_datapandas.DataFrame: Input data. Index = samples, columns = feature ids. See fit method.
after_trainingbool: Whether this method is called after training.

Raises:

AssertionError

Several possible reasons:

Sample names in df_samp_ann and df_data do not match.
df_samp_ann lacks group column.
If after_training == True, method will also check that group names of df_samp_ann match those in mean_centerizer.