corradjust.corradjust module#

class corradjust.corradjust.CorrAdjust(df_feature_ann, ref_feature_colls, out_dir, winsorize=0.05, shuffle_feature_names=False, metric='enrichment-based', min_pairs_to_score=1000, random_seed=17, title=None, verbose=True)[source]#

Bases: object

The main CorrAdjust class.

Parameters:
df_feature_annpandas.DataFrame

A data frame with index and 2 columns providing annotation for features.

  • Index: unique feature ids that are also used in the data table. For example, ENSEMBL gene ids or small RNA license plates.

  • Column feature_name: user-friendly names of the features, which are also used in the reference .gmt files. For example, gene symbols or "hsa-miR-xx-5p" notation for miRNAs.

  • Column feature_type: discrete set of feature types. E.g., if you are analyzing only mRNA-seq data, put "mRNA" for all features; if you are integrating miRNA, mRNA, or any other data type, you could use more than one type (e.g., "miRNA" and "mRNA"). Feature types shouldn’t contain "-" character.

ref_feature_collsdict

A dictionary with information about reference feature set collections. Each collection is an item of ref_feature_colls with key being collection name (string) and value being another dict with the following structure.

{
"path" : str
Relative or absolute path to the .gmt file with reference feature sets.
"sign" : {"positive", "negative", "absolute"}
Determines how to rank feature-feature correlations.
"absolute": absolute values of correlations (highest absolute correlations go first).
"positive": descending sorting of correlations (most positive correlations go first).
"negative": ascending sorting of correlations (most negative correlations go first).
"feature_pair_types" : list
List of strings showing types of correlations you are interested in. For example, ["mRNA-mRNA"] or ["miRNA-mRNA"].
"high_corr_frac" : float
Fraction of all feature pairs with most highly ranked correlations that will be used for scoring (alpha in the paper’s notation).
}.
out_dirstr

Path to the directory where the output files will be stored. The directory will be created if it doesn’t exist.

winsorizefloat or None, optional, default=0.05

If float, specifies fraction of the lowest and the highest data values that will be winsorized. If None, no winsorizing will be applied. Winsorizing is train/test-set friendly, i.e., test data has no influence on training data.

shuffle_feature_namesbool or list, optional, default=False

If True, feature_name column of df_feature_ann will be randomly permuted. If list of feature types, e.g., ["mRNA"], permutation will be conducted only within the corresponding feature types. If False, no permutations will be conducted.

metric{“enrichment-based”, “BP@K”}, optional, default=”enrichment-based”

Metric for evaluating feature correlations (see CorrAdjust paper for details). Please use the default option unless you are absolutely sure why do you need balanced precision at K (BP@K).

min_pairs_to_scoreint, optional, default=1000

Minimum number of total pairs containing a feature (N_j in the paper notation) to include the feature in the score computations. Inequaility must hold both in training and validation pair sets.

random_seedint, optional, default=17

Random seed used by PCA and other non-deterministic procedures.

titlestr or None, optional, default=None

If string, it will be shown in top-left corner of all generated plots.

verbosebool, optional, default=True

Whether to print progress messages to stderr.

Attributes:
confounder_PCslist

The list of confounder PC indices (0-based). Populated by calling fit method.

PCA_modelsklearn.decomposition.PCA

PCA instance fit on training data. Populated by calling fit and fit_PCA methods.

winsorizerWinsorizer

Winsorizer instance fit on training data. Populated by calling fit and fit_PCA methods.

mean_centerizerMeanCenterizer

MeanCenterizer instance fit on training data. Populated by calling fit and fit_PCA methods.

corr_scorerCorrScorer

Instance of CorrScorer. Populated by constructor.

out_dirstr

Output directory. Populated by constructor.

titlestr

Title added to left-top corner of the plots. Populated by constructor.

metric{“enrichment-based”, “BP@K”}

Metric for evaluating feature correlations. Populated by constructor.

random_seedint

Random seed. Populated by constructor.

verbosebool

Whether to print progress messages. Populated by constructor.

fit(df_data, df_samp_ann=None, method='greedy', n_PCs=None, n_iters=None)[source]#

Find optimal set of PCs to regress out. Calling this method sets confounder_PCs attribute.

It also generates several files in out_dir:

  • fit.tsv : file with the scores across the optimization trajectory.

  • fit.{sample_group}.png : visualization of fit.tsv for each sample group.

  • fit.log : detailed log file updated at each iteration.

Parameters:
df_datapandas.DataFrame

Training data. Index = samples, columns = feature ids. Data should be normalized so that feature values are comparable between samples. Feature ids should perfectly match the ones used in df_feature_ann while initializing the object.

df_samp_annpandas.DataFrame or None, optional, default=None

A data frame with index and one column providing group annotation of samples.

  • Index: sample names (should match index of df_data).

  • Column group: discrete set of group names (e.g., "normal" and "tumor"). Group name cannot be "mean".

If df_samp_ann is None, the method automatically generates it with one sample group called all_samp.

method{“greedy”, “first-n”}, optional, default=”greedy”

Optimization method.

  • "greedy" (recommended default): searches for the best PC to remove on each iteration (quadratic complexity).

  • "first-n": iteratively removes PCs based on % variance they explain (linear complexity). This is the concept used by sva_network function of the sva package.

n_PCsint or None, optional, default=None

Number of PCs to choose from during the optimization. By default (None), number of PCs is estimated as a knee of % cumulative variance plot.

n_itersint or None, optional, default=None

How many optimization iterations to perform. If None, n_iters is set to n_PCs.

fit_PCA(df_data, df_samp_ann=None, plot_var=False)[source]#

Compute principal components of training data. The data are centered and optionally winsorized before computing PCA in train/test-friendly way.

Calling this method sets PCA_model attribute.

Parameters:
df_datapandas.DataFrame

Training data. Index = samples, columns = feature ids. See fit method.

df_samp_annpandas.DataFrame, optional, default=None

A data frame providing group annotation of samples. See fit method.

plot_varbool, optional, default=False

Whether to make plots with % variance explained. The plots will be saved as "PCA_var.png" in out_dir.

Returns:
PCsnumpy.ndarray

Array of principal components with shape (n_samples, n_components).

PC_coefsnumpy.ndarray

Array of PC coefficients with shape (n_components, n_features).

n_PCs_kneeint

Number of PCs estimated using knee (a.k.a. elbow) method.

df_datapandas.DataFrame

Centered and (optionally) winsorized training data.

transform(df_data, df_samp_ann=None)[source]#

Residualize identified confounder PCs from input data. This method should be called after fit, or after manually calling fit_PCA and setting confounder_PCs attribute.

Parameters:
df_datapandas.DataFrame

Input data. Index = samples, columns = feature ids. Feature ids should perfectly match the ones used in df_feature_ann while initializing the object.

df_samp_annpandas.DataFrame or None, optional, default=None

A data frame providing group annotation of samples. This argument is mandatory if it was used during fit call.

Returns:
df_data_cleanpandas.DataFrame

Cleaned data.

df_rsquaredspandas.DataFrame

Data frame with R^2 values for each feature.

compute_confounders(df_data, df_samp_ann=None)[source]#

Compute the data frame with confounders from input data. This method should be called after fit, or after manually calling fit_PCA and setting confounder_PCs attribute.

Parameters:
df_datapandas.DataFrame

Input data. Index = samples, columns = feature ids. Feature ids should perfectly match the ones used in df_feature_ann while initializing the object.

df_samp_annpandas.DataFrame or None, optional, default=None

A data frame providing group annotation of samples. This argument is mandatory if it was used during fit call.

Returns:
pandas.DataFrame

Data frame with values of confounder PCs. Note that PC indices are 0-based.

compute_feature_scores(df_data, df_samp_ann=None, samp_group=None, pairs_subset='all')[source]#

Compute enrichment scores and p-values for feature-feature correlations in given data.

Parameters:
df_datapandas.DataFrame

Input data. Index = samples, columns = feature ids. Feature ids should perfectly match the ones used in df_feature_ann while initializing the object.

df_samp_annpandas.DataFrame or None, optional, default=None

A data frame providing group annotation of samples. This argument is mandatory if it was used during fit call.

samp_groupstr or None, default=None

If df_samp_ann is passed, this argument is mandatory and should refer to one of the sample groups in df_samp_ann.

pairs_subset{“all”, “training”, “validation”}, optional, default=”all”

Which set of feature pairs to use for computing scores.

Returns:
dict

Dict with data frames containing feature-wise enrichment scores and p-values. It has the following structure:

{
"Clean" : {
ref_feature_coll : pandas.DataFrame
},
"Raw": {…}
}.
make_volcano_plot(feature_scores, plot_filename, **plot_kwargs)[source]#

Make volcano plot.

Parameters:
feature_scoresdict

Dict produced by compute_feature_scores method.

plot_filenamestr

Path to the figure (with extension, e.g., .png).

**plot_kwargs

Other keyword arguments controlling plot aesthetics passed to VolcanoPlotter.

Returns:
VolcanoPlotter

Instance of VolcanoPlotter. At this point, the plot is already saved in the file, so you can change it through fig/ax and re-save.

make_corr_distr_plot(df_data, plot_filename, df_samp_ann=None, samp_group=None, pairs_subset='all', **plot_kwargs)[source]#

Visualize distribution of feature-feature correlations before and after PC correction.

Parameters:
df_datapandas.DataFrame

Input data. Index = samples, columns = feature ids. Feature ids should perfectly match the ones used in df_feature_ann while initializing the object.

plot_filenamestr

Path to the figure (with extension, e.g., .png).

df_samp_annpandas.DataFrame or None, optional, default=None

A data frame providing group annotation of samples. This argument is mandatory if it was used during fit call.

samp_groupstr or None, default=None

If df_samp_ann is passed, this argument is mandatory and should refer to one of the sample groups in df_samp_ann.

pairs_subset{“all”, “training”, “validation”}, optional, default=”all”

Which set of feature pairs to use for computing scores.

**plot_kwargs

Other keyword arguments controlling plot aesthetics passed to CorrDistrPlotter.

Returns:
CorrDistrPlotter

Instance of CorrDistrPlotter. At this point, the plot is already saved in the file, so you can change it through fig/ax and re-save.

export_corrs(df_data, filename, df_samp_ann=None, samp_group=None, chunk_size=1000000)[source]#

Export feature-feature correlations to file.

Parameters:
df_datapandas.DataFrame

Input data. Index = samples, columns = feature ids. Feature ids should perfectly match the ones used in df_feature_ann while initializing the object.

filenamestr

Path to the export table with .tsv extension.

df_samp_annpandas.DataFrame or None, optional, default=None

A data frame providing group annotation of samples. This argument is mandatory if it was used during fit call.

samp_groupstr or None, default=None

If df_samp_ann is passed, this argument is mandatory and should refer to one of the sample groups in df_samp_ann.

chunk_sizeint, optional, default=1000000

How many rows are exported in a single chunk.

_search_confounder_PCs(df_data, df_samp_ann, PCs, PC_coefs, method, n_iters)[source]#

Iteratively residualize PCs from training data to maximize number of reference pairs among highly correlated feature pairs.

Parental fit method describes files generated by this method.

Parameters:
df_datapandas.DataFrame

See fit method.

df_samp_annpandas.DataFrame or None

See fit method.

PCsnumpy.ndarray

2D array of PCs (returned by fit_PCA).

PC_coefsnumpy.ndarray

2D array of PC coefficients (returned by fit_PCA).

methodstr

See fit method.

n_iters_type_

See fit method.

_compute_corrs_and_scores(X, df_samp_ann, log_file, prog_bar, log_message)[source]#

Conduct one iteration of PC optimization by computing and scoring correlations. Scores will be computed separately for each sample group, mean scores will be also returned.

Parameters:
Xnumpy.ndarray

Training data, rows = samples, columns = features.

df_samp_annpandas.DataFrame

See fit method.

log_filefile

Writing-open log file.

prog_bartqdm.std.tqdm or None

Progress bar (will be updated) or None.

log_messagestr or None

Title of the log message. If None, nothing will be printed to log.

Returns:
iter_scoresdict

A dict of the following structure:

{
sample_group_name : {
ref_feature_coll : {
"score training" : float,
"score validation" : float,
"score all" : float,
}
}
}.
main_scorefloat

Mean of training scores across all sample groups and reference collections, i.e., iter_scores["mean"]["mean"]["score training"].

_best_iter_scores_to_df(best_iter_scores, used_PCs)[source]#

Convert dict of scores into data frame.

Parameters:
best_iter_scoreslist

List of dicts with scores corresponding to the iterations of optimization. Each element of list should have a format specified in the return dict of the _compute_corrs_and_scores method.

used_PCslist

List of PC indices corresponding to best_iter_scores.

Returns:
pandas.DataFrame

Pandas data frame with scores in columns and iterations on rows.

  • Index: iteration number. Starts with 0 (raw dataset before correction).

  • Column PC: index of PC being regressed out on each iteration. First row always have NaN (no correction).

  • Score columns: these columns have the following format: "{sample_group_name};{ref_feature_coll};{training/validation/all}".

_make_best_iter_scores_plot(df_best_iter_scores, peak_iter, df_samp_ann)[source]#

Make optimization score line plots. If there is one sample group, only one plot with "fit.png" name will be produced. If there are multiple groups, one plot per group will be made: "fit.{group_name}.png", including group with mean values.

Parameters:
df_best_iter_scorespandas.DataFrame

Data frame generated using _best_iter_scores_to_df method.

peak_iterint

This number is used to draw vertical dashed red line at the selected early stopping iteration.

df_samp_annpandas.DataFrame

See fit method.

_compute_raw_and_clean_corrs(df_data, df_samp_ann, samp_group)[source]#

Compute correlations before and after residualizing PC confounders for a given group of samples.

Parameters:
df_datapandas.DataFrame

Input data. Index = samples, columns = feature ids. See transform for details.

df_samp_annpandas.DataFrame

A data frame providing group annotation of samples. See transform for details.

samp_groupstr

One of the sample groups in df_samp_ann.

Returns:
corrs_rawnumpy.array

1D array of correlations (diagonal 1s are not included).

corrs_cleannumpy.array

Same but with cleaned correlations.

_check_df_data(df_data)[source]#

Check validity of input data.

Parameters:
df_datapandas.DataFrame

Input data. Index = samples, columns = feature ids. See fit method.

Raises:
AssertionError

Several possible reasons:

  • Feature ids do not match ones in feature annotation.

  • There is a gene with constant expression.

_check_df_samp_ann(df_samp_ann, df_data, after_training=False)[source]#

Check validity of sample annotation.

Parameters:
df_samp_annpandas.DataFrame or None

Sample annotation data. See fit method.

df_datapandas.DataFrame

Input data. Index = samples, columns = feature ids. See fit method.

after_trainingbool

Whether this method is called after training.

Raises:
AssertionError

Several possible reasons:

  • Sample names in df_samp_ann and df_data do not match.

  • df_samp_ann lacks group column.

  • If after_training == True, method will also check that group names of df_samp_ann match those in mean_centerizer.