corradjust.corradjust module#
- class corradjust.corradjust.CorrAdjust(df_feature_ann, ref_feature_colls, out_dir, winsorize=0.05, shuffle_feature_names=False, metric='enrichment-based', min_pairs_to_score=1000, random_seed=17, title=None, verbose=True)[source]#
Bases:
object
The main CorrAdjust class.
- Parameters:
- df_feature_annpandas.DataFrame
A data frame with index and 2 columns providing annotation for features.
Index: unique feature ids that are also used in the data table. For example, ENSEMBL gene ids or small RNA license plates.
Column
feature_name
: user-friendly names of the features, which are also used in the reference.gmt
files. For example, gene symbols or"hsa-miR-xx-5p"
notation for miRNAs.Column
feature_type
: discrete set of feature types. E.g., if you are analyzing only mRNA-seq data, put"mRNA"
for all features; if you are integrating miRNA, mRNA, or any other data type, you could use more than one type (e.g.,"miRNA"
and"mRNA"
). Feature types shouldn’t contain"-"
character.
- ref_feature_collsdict
A dictionary with information about reference feature set collections. Each collection is an item of ref_feature_colls with key being collection name (string) and value being another dict with the following structure.
{"path"
: strRelative or absolute path to the.gmt
file with reference feature sets."sign"
:{"positive", "negative", "absolute"}
Determines how to rank feature-feature correlations."absolute"
: absolute values of correlations (highest absolute correlations go first)."positive"
: descending sorting of correlations (most positive correlations go first)."negative"
: ascending sorting of correlations (most negative correlations go first)."feature_pair_types"
: listList of strings showing types of correlations you are interested in. For example,["mRNA-mRNA"]
or["miRNA-mRNA"]
."high_corr_frac"
: floatFraction of all feature pairs with most highly ranked correlations that will be used for scoring (alpha in the paper’s notation).}.- out_dirstr
Path to the directory where the output files will be stored. The directory will be created if it doesn’t exist.
- winsorizefloat or None, optional, default=0.05
If float, specifies fraction of the lowest and the highest data values that will be winsorized. If
None
, no winsorizing will be applied. Winsorizing is train/test-set friendly, i.e., test data has no influence on training data.- shuffle_feature_namesbool or list, optional, default=False
If
True
,feature_name
column of df_feature_ann will be randomly permuted. If list of feature types, e.g.,["mRNA"]
, permutation will be conducted only within the corresponding feature types. IfFalse
, no permutations will be conducted.- metric{“enrichment-based”, “BP@K”}, optional, default=”enrichment-based”
Metric for evaluating feature correlations (see CorrAdjust paper for details). Please use the default option unless you are absolutely sure why do you need balanced precision at K (BP@K).
- min_pairs_to_scoreint, optional, default=1000
Minimum number of total pairs containing a feature (N_j in the paper notation) to include the feature in the score computations. Inequaility must hold both in training and validation pair sets.
- random_seedint, optional, default=17
Random seed used by PCA and other non-deterministic procedures.
- titlestr or None, optional, default=None
If string, it will be shown in top-left corner of all generated plots.
- verbosebool, optional, default=True
Whether to print progress messages to stderr.
- Attributes:
- confounder_PCslist
The list of confounder PC indices (0-based). Populated by calling fit method.
- PCA_modelsklearn.decomposition.PCA
PCA instance fit on training data. Populated by calling fit and fit_PCA methods.
- winsorizerWinsorizer
Winsorizer instance fit on training data. Populated by calling fit and fit_PCA methods.
- mean_centerizerMeanCenterizer
MeanCenterizer instance fit on training data. Populated by calling fit and fit_PCA methods.
- corr_scorerCorrScorer
Instance of CorrScorer. Populated by constructor.
- out_dirstr
Output directory. Populated by constructor.
- titlestr
Title added to left-top corner of the plots. Populated by constructor.
- metric{“enrichment-based”, “BP@K”}
Metric for evaluating feature correlations. Populated by constructor.
- random_seedint
Random seed. Populated by constructor.
- verbosebool
Whether to print progress messages. Populated by constructor.
- fit(df_data, df_samp_ann=None, method='greedy', n_PCs=None, n_iters=None)[source]#
Find optimal set of PCs to regress out. Calling this method sets confounder_PCs attribute.
It also generates several files in out_dir:
fit.tsv
: file with the scores across the optimization trajectory.fit.{sample_group}.png
: visualization offit.tsv
for each sample group.fit.log
: detailed log file updated at each iteration.
- Parameters:
- df_datapandas.DataFrame
Training data. Index = samples, columns = feature ids. Data should be normalized so that feature values are comparable between samples. Feature ids should perfectly match the ones used in df_feature_ann while initializing the object.
- df_samp_annpandas.DataFrame or None, optional, default=None
A data frame with index and one column providing group annotation of samples.
Index: sample names (should match index of df_data).
Column
group
: discrete set of group names (e.g.,"normal"
and"tumor"
). Group name cannot be"mean"
.
If df_samp_ann is
None
, the method automatically generates it with one sample group calledall_samp
.- method{“greedy”, “first-n”}, optional, default=”greedy”
Optimization method.
"greedy"
(recommended default): searches for the best PC to remove on each iteration (quadratic complexity)."first-n"
: iteratively removes PCs based on % variance they explain (linear complexity). This is the concept used by sva_network function of the sva package.
- n_PCsint or None, optional, default=None
Number of PCs to choose from during the optimization. By default (
None
), number of PCs is estimated as a knee of % cumulative variance plot.- n_itersint or None, optional, default=None
How many optimization iterations to perform. If
None
, n_iters is set to n_PCs.
- fit_PCA(df_data, df_samp_ann=None, plot_var=False)[source]#
Compute principal components of training data. The data are centered and optionally winsorized before computing PCA in train/test-friendly way.
Calling this method sets PCA_model attribute.
- Parameters:
- df_datapandas.DataFrame
Training data. Index = samples, columns = feature ids. See fit method.
- df_samp_annpandas.DataFrame, optional, default=None
A data frame providing group annotation of samples. See fit method.
- plot_varbool, optional, default=False
Whether to make plots with % variance explained. The plots will be saved as
"PCA_var.png"
in out_dir.
- Returns:
- PCsnumpy.ndarray
Array of principal components with shape
(n_samples, n_components)
.- PC_coefsnumpy.ndarray
Array of PC coefficients with shape
(n_components, n_features)
.- n_PCs_kneeint
Number of PCs estimated using knee (a.k.a. elbow) method.
- df_datapandas.DataFrame
Centered and (optionally) winsorized training data.
- transform(df_data, df_samp_ann=None)[source]#
Residualize identified confounder PCs from input data. This method should be called after fit, or after manually calling fit_PCA and setting confounder_PCs attribute.
- Parameters:
- df_datapandas.DataFrame
Input data. Index = samples, columns = feature ids. Feature ids should perfectly match the ones used in df_feature_ann while initializing the object.
- df_samp_annpandas.DataFrame or None, optional, default=None
A data frame providing group annotation of samples. This argument is mandatory if it was used during fit call.
- Returns:
- df_data_cleanpandas.DataFrame
Cleaned data.
- df_rsquaredspandas.DataFrame
Data frame with
R^2
values for each feature.
- compute_confounders(df_data, df_samp_ann=None)[source]#
Compute the data frame with confounders from input data. This method should be called after fit, or after manually calling fit_PCA and setting confounder_PCs attribute.
- Parameters:
- df_datapandas.DataFrame
Input data. Index = samples, columns = feature ids. Feature ids should perfectly match the ones used in df_feature_ann while initializing the object.
- df_samp_annpandas.DataFrame or None, optional, default=None
A data frame providing group annotation of samples. This argument is mandatory if it was used during fit call.
- Returns:
- pandas.DataFrame
Data frame with values of confounder PCs. Note that PC indices are 0-based.
- compute_feature_scores(df_data, df_samp_ann=None, samp_group=None, pairs_subset='all')[source]#
Compute enrichment scores and p-values for feature-feature correlations in given data.
- Parameters:
- df_datapandas.DataFrame
Input data. Index = samples, columns = feature ids. Feature ids should perfectly match the ones used in df_feature_ann while initializing the object.
- df_samp_annpandas.DataFrame or None, optional, default=None
A data frame providing group annotation of samples. This argument is mandatory if it was used during fit call.
- samp_groupstr or None, default=None
If df_samp_ann is passed, this argument is mandatory and should refer to one of the sample groups in df_samp_ann.
- pairs_subset{“all”, “training”, “validation”}, optional, default=”all”
Which set of feature pairs to use for computing scores.
- Returns:
- dict
Dict with data frames containing feature-wise enrichment scores and p-values. It has the following structure:
{"Clean"
: {ref_feature_coll
: pandas.DataFrame},"Raw"
: {…}}.
- make_volcano_plot(feature_scores, plot_filename, **plot_kwargs)[source]#
Make volcano plot.
- Parameters:
- feature_scoresdict
Dict produced by compute_feature_scores method.
- plot_filenamestr
Path to the figure (with extension, e.g.,
.png
).- **plot_kwargs
Other keyword arguments controlling plot aesthetics passed to VolcanoPlotter.
- Returns:
- VolcanoPlotter
Instance of VolcanoPlotter. At this point, the plot is already saved in the file, so you can change it through fig/ax and re-save.
- make_corr_distr_plot(df_data, plot_filename, df_samp_ann=None, samp_group=None, pairs_subset='all', **plot_kwargs)[source]#
Visualize distribution of feature-feature correlations before and after PC correction.
- Parameters:
- df_datapandas.DataFrame
Input data. Index = samples, columns = feature ids. Feature ids should perfectly match the ones used in df_feature_ann while initializing the object.
- plot_filenamestr
Path to the figure (with extension, e.g.,
.png
).- df_samp_annpandas.DataFrame or None, optional, default=None
A data frame providing group annotation of samples. This argument is mandatory if it was used during fit call.
- samp_groupstr or None, default=None
If df_samp_ann is passed, this argument is mandatory and should refer to one of the sample groups in df_samp_ann.
- pairs_subset{“all”, “training”, “validation”}, optional, default=”all”
Which set of feature pairs to use for computing scores.
- **plot_kwargs
Other keyword arguments controlling plot aesthetics passed to CorrDistrPlotter.
- Returns:
- CorrDistrPlotter
Instance of CorrDistrPlotter. At this point, the plot is already saved in the file, so you can change it through fig/ax and re-save.
- export_corrs(df_data, filename, df_samp_ann=None, samp_group=None, chunk_size=1000000)[source]#
Export feature-feature correlations to file.
- Parameters:
- df_datapandas.DataFrame
Input data. Index = samples, columns = feature ids. Feature ids should perfectly match the ones used in df_feature_ann while initializing the object.
- filenamestr
Path to the export table with
.tsv
extension.- df_samp_annpandas.DataFrame or None, optional, default=None
A data frame providing group annotation of samples. This argument is mandatory if it was used during fit call.
- samp_groupstr or None, default=None
If df_samp_ann is passed, this argument is mandatory and should refer to one of the sample groups in df_samp_ann.
- chunk_sizeint, optional, default=1000000
How many rows are exported in a single chunk.
- _search_confounder_PCs(df_data, df_samp_ann, PCs, PC_coefs, method, n_iters)[source]#
Iteratively residualize PCs from training data to maximize number of reference pairs among highly correlated feature pairs.
Parental fit method describes files generated by this method.
- Parameters:
- df_datapandas.DataFrame
See fit method.
- df_samp_annpandas.DataFrame or None
See fit method.
- PCsnumpy.ndarray
2D array of PCs (returned by fit_PCA).
- PC_coefsnumpy.ndarray
2D array of PC coefficients (returned by fit_PCA).
- methodstr
See fit method.
- n_iters_type_
See fit method.
- _compute_corrs_and_scores(X, df_samp_ann, log_file, prog_bar, log_message)[source]#
Conduct one iteration of PC optimization by computing and scoring correlations. Scores will be computed separately for each sample group, mean scores will be also returned.
- Parameters:
- Xnumpy.ndarray
Training data, rows = samples, columns = features.
- df_samp_annpandas.DataFrame
See fit method.
- log_filefile
Writing-open log file.
- prog_bartqdm.std.tqdm or None
Progress bar (will be updated) or
None
.- log_messagestr or None
Title of the log message. If
None
, nothing will be printed to log.
- Returns:
- iter_scoresdict
A dict of the following structure:
{sample_group_name
: {ref_feature_coll
: {"score training"
: float,"score validation"
: float,"score all"
: float,}}}.- main_scorefloat
Mean of training scores across all sample groups and reference collections, i.e.,
iter_scores["mean"]["mean"]["score training"]
.
- _best_iter_scores_to_df(best_iter_scores, used_PCs)[source]#
Convert dict of scores into data frame.
- Parameters:
- best_iter_scoreslist
List of dicts with scores corresponding to the iterations of optimization. Each element of list should have a format specified in the return dict of the _compute_corrs_and_scores method.
- used_PCslist
List of PC indices corresponding to best_iter_scores.
- Returns:
- pandas.DataFrame
Pandas data frame with scores in columns and iterations on rows.
Index: iteration number. Starts with 0 (raw dataset before correction).
Column
PC
: index of PC being regressed out on each iteration. First row always have NaN (no correction).Score columns: these columns have the following format:
"{sample_group_name};{ref_feature_coll};{training/validation/all}"
.
- _make_best_iter_scores_plot(df_best_iter_scores, peak_iter, df_samp_ann)[source]#
Make optimization score line plots. If there is one sample group, only one plot with
"fit.png"
name will be produced. If there are multiple groups, one plot per group will be made:"fit.{group_name}.png"
, including group with mean values.- Parameters:
- df_best_iter_scorespandas.DataFrame
Data frame generated using _best_iter_scores_to_df method.
- peak_iterint
This number is used to draw vertical dashed red line at the selected early stopping iteration.
- df_samp_annpandas.DataFrame
See fit method.
- _compute_raw_and_clean_corrs(df_data, df_samp_ann, samp_group)[source]#
Compute correlations before and after residualizing PC confounders for a given group of samples.
- Parameters:
- df_datapandas.DataFrame
Input data. Index = samples, columns = feature ids. See transform for details.
- df_samp_annpandas.DataFrame
A data frame providing group annotation of samples. See transform for details.
- samp_groupstr
One of the sample groups in df_samp_ann.
- Returns:
- corrs_rawnumpy.array
1D array of correlations (diagonal 1s are not included).
- corrs_cleannumpy.array
Same but with cleaned correlations.
- _check_df_data(df_data)[source]#
Check validity of input data.
- Parameters:
- df_datapandas.DataFrame
Input data. Index = samples, columns = feature ids. See fit method.
- Raises:
- AssertionError
Several possible reasons:
Feature ids do not match ones in feature annotation.
There is a gene with constant expression.
- _check_df_samp_ann(df_samp_ann, df_data, after_training=False)[source]#
Check validity of sample annotation.
- Parameters:
- df_samp_annpandas.DataFrame or None
Sample annotation data. See fit method.
- df_datapandas.DataFrame
Input data. Index = samples, columns = feature ids. See fit method.
- after_trainingbool
Whether this method is called after training.
- Raises:
- AssertionError
Several possible reasons:
Sample names in df_samp_ann and df_data do not match.
df_samp_ann lacks
group
column.If
after_training == True
, method will also check that group names of df_samp_ann match those in mean_centerizer.