2. Running CorrAdjust#

2.1. Training the model#

After input data is prepared, we are ready to initialize the main object of CorrAdjust. See the full list of accepted parameters in API reference.

[4]:
from corradjust import CorrAdjust

model = CorrAdjust(
    df_feature_ann,
    ref_feature_colls,
    "out_data/GTEx_Whole_Blood"
)
2025-02-28 19:45:06.114508 | Loading Canonical Pathways reference collection...
2025-02-28 19:45:11.391956 | Loading Gene Ontology reference collection...

The fit method trains CorrAdjust instance. It identifies a subset of confounder PCs which needs to be regressed out from the data to maximize enrichment of the provided reference feature pairs among the highly correlated pairs. See the full list of accepted parameters in API reference.

[5]:
# We set n_PCs to 19 to reproduce plot from the paper
# Default value works fine as well
model.fit(df_data_train, n_PCs=19)
2025-02-28 19:45:13.517653 | Computing PCA...
2025-02-28 19:45:15.113416 | Starting PC optimization...
100%|██████████████████████████████████████████████████████████████| 191/191 [12:40<00:00,  3.98s/it]
../_images/tutorial_standard_run_5_1.png

The fit method automatically splits the feature pairs into 50% training and 50% validation sets. Training set of feature pairs is used to select PCs on each iteration of the algorithm (maximum score), while validation set is used to determine the final number of PCs by the early stopping method. Specifically, the dashed red line is drawn at iteration after which there is little or no improvement in the validation set score.

The above optimization curve is automatically saved into fit.all_samp.png; all_samp refers to the fact that all training samples were used (see Advanced example with several sample groups). The file fit.tsv contains the data from the plot in the table format, and fit.log contains the full optimization log.

The termination procedure might stop too early or too late. You can always manually refine the list of selected confounder PCs:

[6]:
# Open the table representation of the above plot (PCs numbers are 0-based)
# 0th row corresponds to the raw data
fit_data = pd.read_csv(
    "out_data/GTEx_Whole_Blood/fit.tsv",
    sep="\t", index_col=0,
    dtype={"PC": "Int64"}
)
display(fit_data)
# Uncomment below to set the subset of confounder PCs manually
#model.confounder_PCs = [0, 1, 2, 17, 12]
PC all_samp;Canonical Pathways;training all_samp;Canonical Pathways;validation all_samp;Canonical Pathways;all all_samp;Gene Ontology;training all_samp;Gene Ontology;validation all_samp;Gene Ontology;all all_samp;mean;training all_samp;mean;validation all_samp;mean;all mean;Canonical Pathways;training mean;Canonical Pathways;validation mean;Canonical Pathways;all mean;Gene Ontology;training mean;Gene Ontology;validation mean;Gene Ontology;all mean;mean;training mean;mean;validation mean;mean;all
Iteration
0 <NA> 0.187871 0.179282 0.484495 0.096320 0.097821 0.295405 0.142095 0.138552 0.389950 0.187871 0.179282 0.484495 0.096320 0.097821 0.295405 0.142095 0.138552 0.389950
1 0 0.278803 0.297132 0.730810 0.208563 0.198734 0.562689 0.243683 0.247933 0.646749 0.278803 0.297132 0.730810 0.208563 0.198734 0.562689 0.243683 0.247933 0.646749
2 1 0.595418 0.596665 1.359490 0.434741 0.439244 1.037232 0.515079 0.517955 1.198361 0.595418 0.596665 1.359490 0.434741 0.439244 1.037232 0.515079 0.517955 1.198361
3 2 0.719489 0.705782 1.590421 0.477850 0.448394 1.090163 0.598670 0.577088 1.340292 0.719489 0.705782 1.590421 0.477850 0.448394 1.090163 0.598670 0.577088 1.340292
4 17 0.716559 0.702656 1.583160 0.477467 0.446032 1.086927 0.597013 0.574344 1.335043 0.716559 0.702656 1.583160 0.477467 0.446032 1.086927 0.597013 0.574344 1.335043
5 12 0.720507 0.713810 1.596511 0.467544 0.451943 1.079581 0.594025 0.582876 1.338046 0.720507 0.713810 1.596511 0.467544 0.451943 1.079581 0.594025 0.582876 1.338046
6 11 0.711772 0.714599 1.583469 0.456435 0.449269 1.065692 0.584103 0.581934 1.324581 0.711772 0.714599 1.583469 0.456435 0.449269 1.065692 0.584103 0.581934 1.324581
7 18 0.703086 0.708958 1.569668 0.448834 0.441979 1.047060 0.575960 0.575468 1.308364 0.703086 0.708958 1.569668 0.448834 0.441979 1.047060 0.575960 0.575468 1.308364
8 14 0.695889 0.694057 1.543607 0.440029 0.431120 1.024089 0.567959 0.562589 1.283848 0.695889 0.694057 1.543607 0.440029 0.431120 1.024089 0.567959 0.562589 1.283848
9 13 0.676388 0.682232 1.510955 0.434707 0.423115 1.009324 0.555548 0.552673 1.260140 0.676388 0.682232 1.510955 0.434707 0.423115 1.009324 0.555548 0.552673 1.260140
10 16 0.661971 0.668309 1.483583 0.422454 0.412582 0.981892 0.542213 0.540446 1.232737 0.661971 0.668309 1.483583 0.422454 0.412582 0.981892 0.542213 0.540446 1.232737
11 9 0.649132 0.650844 1.451423 0.412124 0.395790 0.951846 0.530628 0.523317 1.201635 0.649132 0.650844 1.451423 0.412124 0.395790 0.951846 0.530628 0.523317 1.201635
12 15 0.627831 0.636486 1.409620 0.393295 0.380669 0.911223 0.510563 0.508578 1.160422 0.627831 0.636486 1.409620 0.393295 0.380669 0.911223 0.510563 0.508578 1.160422
13 10 0.602309 0.618502 1.363589 0.382822 0.365800 0.884008 0.492565 0.492151 1.123799 0.602309 0.618502 1.363589 0.382822 0.365800 0.884008 0.492565 0.492151 1.123799
14 7 0.569647 0.572702 1.276895 0.373294 0.357365 0.860319 0.471470 0.465034 1.068607 0.569647 0.572702 1.276895 0.373294 0.357365 0.860319 0.471470 0.465034 1.068607
15 8 0.539522 0.544349 1.210512 0.348080 0.341005 0.808156 0.443801 0.442677 1.009334 0.539522 0.544349 1.210512 0.348080 0.341005 0.808156 0.443801 0.442677 1.009334
16 4 0.479903 0.491263 1.090110 0.307960 0.305060 0.729236 0.393931 0.398161 0.909673 0.479903 0.491263 1.090110 0.307960 0.305060 0.729236 0.393931 0.398161 0.909673
17 3 0.475440 0.500710 1.105622 0.208260 0.210637 0.542258 0.341850 0.355674 0.823940 0.475440 0.500710 1.105622 0.208260 0.210637 0.542258 0.341850 0.355674 0.823940
18 6 0.354778 0.364382 0.853504 0.150702 0.154227 0.423012 0.252740 0.259304 0.638258 0.354778 0.364382 0.853504 0.150702 0.154227 0.423012 0.252740 0.259304 0.638258
19 5 0.490470 0.492034 1.143655 0.384109 0.374775 0.891538 0.437290 0.433405 1.017596 0.490470 0.492034 1.143655 0.384109 0.374775 0.891538 0.437290 0.433405 1.017596

2.2. Computing and visualizing enrichments#

Once the confounder_PCs attribute is set, you can compute feature-wise tables of enrichments for each of the input reference feature collections. See the full list of accepted parameters in API reference.

This and below methods work with an arbitrary data with the same features (columns) as the training set. All these methods use PCA, mean-centering, and winsorization parameters learned from the training data (i.e., the data on which you called fit).

[7]:
feature_scores_train = model.compute_feature_scores(df_data_train)
feature_scores_test = model.compute_feature_scores(df_data_test)

# feature_scores is a dict, where the first key is "Raw" or "Clean"
# and the second key is reference collection name
display(feature_scores_test["Clean"]["Canonical Pathways"])
2025-02-28 19:57:57.477039 | Computing raw correlations for all_samp...
2025-02-28 19:57:58.332963 | Computing corrected correlations for all_samp...
2025-02-28 19:58:06.111344 | Computing raw correlations for all_samp...
2025-02-28 19:58:06.942633 | Computing corrected correlations for all_samp...
feature_name ref_pairs@K ref_pairs@total enrichment balanced_precision pvalue padj
feature_id
ENSG00000114942.13 EEF1B2 117/332 280/7213 8.958467 0.929586 6.666928e-87 4.809522e-83
ENSG00000266472.5 MRPS21 114/377 267/7213 8.075151 0.917299 9.381586e-79 3.383938e-75
ENSG00000090263.15 MRPS33 121/478 267/7213 6.778080 0.897049 2.561231e-74 6.158907e-71
ENSG00000104408.9 EIF3E 84/177 306/7213 10.906809 0.951072 2.633657e-69 4.749801e-66
ENSG00000254772.9 EEF1G 91/130 599/7213 8.154059 0.958608 1.657549e-67 2.391511e-64
... ... ... ... ... ... ... ...
ENSG00000278828.1 HIST1H3H 0/0 0/0 NaN NaN NaN NaN
ENSG00000280071.3 CH507-9B2.3 0/0 0/0 NaN NaN NaN NaN
ENSG00000281106.2 LINC00282 0/0 0/0 NaN NaN NaN NaN
ENSG00000281991.1 TMEM265 0/0 0/0 NaN NaN NaN NaN
ENSG00000283787.1 PRR33 0/0 0/0 NaN NaN NaN NaN

10021 rows × 7 columns

Columns of the above table have the following meaning:

  • ref_pairs@K: the second number stands for feature pairs involving the corresponding feature among the most highly correlated pairs (\(K_j\) in the paper’s notation). The first number is how many of these pairs ared reference (\(k_j\) in the paper’s notation).

  • ref_pairs@total: the second number stands for feature pairs involving the corresponding feature among all pairs (\(N_j\) in the paper’s notation). The first number is how many of these pairs are reference (\(n_j\) in the paper’s notation).

  • enrichment: essentially, this is the ratio of ref_pairs@K and ref_pairs@total ratios. However, the value is slightly different from a simple ratio since we use Bayesian regularization to shrink enrichment towards 1.0 to avoid problems with zero counts.

  • balanced_precision: non-linearly scaled version of enrichment. Enrichment = 1 corresponds to balanced precision of 0.5 (baseline), while maximum possible value of enrichment corresponds to balanced precision of 1.

  • pvalue: hypergeometric test p-value (as with enrichment, Bayesian regularization shrinks p-value towards 1.0 to avoid problems with zero counts).

  • padj: p-value adjusted for multiple testing by the Benjamini-Hochberg procedure.

Features with total less then min_pairs_to_score parameter of CorrAdjust (default is 1000) are ignored during the enrichment/p-value computations.

A convenient way to visualize this table is volcano plot. Size of each point on the plot is the second number in ref_pairs@K column. See the full list of accepted parameters in API reference.

[8]:
model.make_volcano_plot(
    feature_scores_test,
    "volcano.test_samples.png",
    # Number of features to annotate (default is None)
    annotate_features=10
)
[8]:
<corradjust.plots.VolcanoPlotter at 0x7f5e83705ca0>
../_images/tutorial_standard_run_12_1.png

2.3. Visualizing distribution of correlations#

While the above plot shows feature-centric view on correction results, we can also look at the impact of CorrAdjust on the distribution of all feature-feature correlations. See the full list of accepted parameters in API reference.

[9]:
model.make_corr_distr_plot(df_data_test, "corr_distr.test_samples.png")
2025-02-28 19:58:22.923334 | Computing raw correlations for all_samp...
2025-02-28 19:58:23.764165 | Computing corrected correlations for all_samp...
[9]:
<corradjust.plots.CorrDistrPlotter at 0x7f5e7969c590>
../_images/tutorial_standard_run_15_2.png

The left panels show histogram of correlations before and after correction – often, CorrAdjust shrinks the correlations towards 0.

The right panels show a more granular view on the distribution tail (specific tail is shown according to the sign parameter of the config). Dotted gray line stands for the high_corr_frac value. Before the correction, the threshold corresponding to 1% of the highest absolute correlations was 0.8, while after cleaning it became 0.55.

The above plots also distinguish between reference pairs (gene pairs with at least one shared pathway) and non-reference pairs. Correlations among reference pairs are, on average, higher then among non-reference ones, and this effect becomes stronger after the correction.

2.4. Obtaining corrected data and confounder matrices#

You can use transform method to residualize the identified confounder PCs from any data. The resulting data frame can be exported in a file and distributed with the original raw data to facilitate reproducibility of your analysis.

[10]:
# df_data_test_clean contains clean data (residuals)
# df_rsquared contains feature-wise R^2 scores for PC regression models
df_data_test_clean, df_rsquareds = model.transform(df_data_test)
display(df_data_test_clean)
ENSG00000188976.10 ENSG00000187961.13 ENSG00000188290.10 ENSG00000187608.8 ENSG00000131591.17 ENSG00000186891.13 ENSG00000186827.10 ENSG00000078808.16 ENSG00000176022.4 ENSG00000160087.20 ... ENSG00000198712.1 ENSG00000228253.1 ENSG00000198899.2 ENSG00000198938.2 ENSG00000198840.2 ENSG00000212907.2 ENSG00000198886.2 ENSG00000198786.2 ENSG00000198695.2 ENSG00000198727.2
GTEX-1122O-0005-SM-5O99J 0.050231 -0.369322 2.239115 0.678527 0.026114 -0.571762 0.601355 0.148707 -0.002400 -0.026402 ... 0.130188 0.172971 0.062483 0.184332 -0.189881 0.314373 0.192234 0.562027 0.274883 0.408772
GTEX-113IC-0006-SM-5NQ9C -0.304865 -0.147708 2.680885 2.763680 0.272341 -0.803344 0.144900 -0.513377 -0.401348 -0.041847 ... -0.423008 -0.878231 -0.610733 -0.400349 -0.411132 -0.868399 -0.618605 -1.073067 -0.863422 -0.680554
GTEX-117XS-0005-SM-5PNU6 -0.037980 -0.642927 -1.242912 -0.240673 -0.277833 -0.398900 -0.717077 0.209971 0.076258 0.096425 ... 0.109895 0.374746 0.154929 0.141935 0.145463 0.469610 0.128637 0.370306 0.612194 0.231607
GTEX-1192W-0005-SM-5NQBQ 0.125747 -0.006353 1.015262 0.450212 0.198476 0.383210 -0.177790 -0.064182 0.235887 -0.382492 ... 0.441488 0.599074 0.401713 0.523564 0.317172 0.457247 0.484326 0.661974 0.730998 0.022344
GTEX-11DXW-0006-SM-5NQ7Y 0.076788 -0.138612 0.168263 -0.610861 -0.025928 -0.313764 0.198317 -0.157295 -0.295122 -0.245366 ... -0.005570 0.391351 0.283745 -0.048960 0.246081 0.109192 0.219166 -0.053995 -0.136555 0.340095
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
GTEX-ZVE2-0006-SM-51MRW -0.039524 1.113171 0.700570 -0.147131 0.108795 1.572345 2.492765 0.047323 0.178910 0.034573 ... -0.508031 -0.512369 -0.466586 -0.422382 -0.484379 -0.998519 -0.806124 -1.024816 -1.005599 -0.554363
GTEX-ZVT2-0005-SM-57WBW 0.092948 -0.089754 0.161043 -0.131826 0.113745 0.399181 -0.245921 0.174028 0.380953 -0.148566 ... 0.140526 0.523956 0.367210 0.122278 0.114591 0.268310 0.224413 0.169460 0.170323 0.477202
GTEX-ZVT4-0006-SM-57WB8 -0.173546 0.118193 -0.243303 0.724267 0.144558 0.941209 1.597995 -0.109322 -0.385729 -0.038125 ... 0.932081 1.095124 1.196443 0.656256 0.543247 0.555277 0.943353 0.787113 0.565947 0.784144
GTEX-ZVZP-0006-SM-51MSW 0.097312 0.029589 -0.303182 -0.378001 0.461521 1.080009 0.108689 0.251166 0.018420 0.072854 ... 0.780985 0.767920 0.952384 1.110271 0.386526 0.365610 0.964042 0.715266 0.994353 1.261183
GTEX-ZXES-0005-SM-57WCB 0.032520 -0.280395 -1.092568 0.130744 -0.247445 -0.475019 -0.650201 0.305872 0.204207 0.032983 ... -0.304109 0.067784 -0.034188 -0.246612 -0.468249 -0.163455 -0.290596 -0.192652 -0.351980 -0.359862

377 rows × 10021 columns

You can also obtain a data frame with per-sample confounder PC values (note that PC indices are 0-based). If your samples have known metadata variables, you can correlate them with the values of confounder PCs to get insight on the confounding signal (see Figures 2F and 3F in our paper).

[11]:
df_confounders = model.compute_confounders(df_data_test)
display(df_confounders)
PC_0 PC_1 PC_2
GTEX-1122O-0005-SM-5O99J -77.795440 30.096142 2.478385
GTEX-113IC-0006-SM-5NQ9C 5.578602 39.257367 -58.918273
GTEX-117XS-0005-SM-5PNU6 72.421864 -46.543171 -6.697147
GTEX-1192W-0005-SM-5NQBQ 61.764435 -17.245763 -7.978997
GTEX-11DXW-0006-SM-5NQ7Y 61.639184 -18.790421 -25.956728
... ... ... ...
GTEX-ZVE2-0006-SM-51MRW 14.852568 7.832459 40.815036
GTEX-ZVT2-0005-SM-57WBW -66.880424 14.664277 6.053250
GTEX-ZVT4-0006-SM-57WB8 24.650767 7.150354 28.726063
GTEX-ZVZP-0006-SM-51MSW -79.053762 -5.825063 5.588776
GTEX-ZXES-0005-SM-57WCB -49.671241 10.018337 -14.464119

377 rows × 3 columns

2.5. Exporting correlations to file#

The export_corrs method generates a tab-separated table with all feature pairs and their clean and raw correlations, p-values, and reference sets information. The table is always sorted from the lowest to the highest correlation. The table can be huge if you have many feature, so we write it directly to disk in chunks instead of making a Pandas data frame.

[12]:
model.export_corrs(df_data_test, "corrs.test_samples.tsv")

# These lines just show a representative chunk of the file
corrs = pd.read_csv(
    "out_data/GTEx_Whole_Blood/corrs.test_samples.tsv",
    sep="\t", nrows=100000
)
corrs.loc[corrs["Gene Ontology_flag"] != -1].head(10)
2025-02-28 19:58:41.119579 | Computing raw correlations for all_samp...
2025-02-28 19:58:41.968045 | Computing corrected correlations for all_samp...
2025-02-28 19:58:43.861963 | Starting export to file...
100%|████████████████████████████████████████████████████████████████| 51/51 [02:03<00:00,  2.42s/it]
[12]:
feature_id1 feature_id2 feature_name1 feature_name2 feature_type1 feature_type2 corr_clean pvalue_clean corr_raw pvalue_raw Canonical Pathways_flag Canonical Pathways_trainval Gene Ontology_flag Gene Ontology_trainval
0 ENSG00000134001.12 ENSG00000109118.13 EIF2S1 PHF12 mRNA mRNA -0.797694 1.020175e-83 -0.766172 5.202540e-74 -1 -1 0 0
2 ENSG00000173120.14 ENSG00000100387.8 KDM2A RBX1 mRNA mRNA -0.790738 2.692452e-81 -0.480879 3.266728e-23 0 0 1 0
3 ENSG00000143742.12 ENSG00000173120.14 SRP9 KDM2A mRNA mRNA -0.785374 1.718902e-79 -0.135602 8.379965e-03 0 0 0 0
4 ENSG00000265354.3 ENSG00000109118.13 TIMM23 PHF12 mRNA mRNA -0.784154 4.350409e-79 -0.666457 9.263482e-50 -1 -1 0 1
5 ENSG00000014641.17 ENSG00000074755.14 MDH1 ZZEF1 mRNA mRNA -0.783222 8.809436e-79 -0.753369 2.963888e-70 -1 -1 0 0
6 ENSG00000014641.17 ENSG00000109118.13 MDH1 PHF12 mRNA mRNA -0.780033 9.586080e-78 -0.776452 3.326642e-77 -1 -1 0 0
7 ENSG00000079785.14 ENSG00000109118.13 DDX1 PHF12 mRNA mRNA -0.778276 3.510557e-77 -0.630010 4.314617e-43 -1 -1 1 1
8 ENSG00000109118.13 ENSG00000181610.12 PHF12 MRPS23 mRNA mRNA -0.777906 4.607714e-77 -0.525653 3.605585e-28 -1 -1 0 1
9 ENSG00000014641.17 ENSG00000177463.15 MDH1 NR2C2 mRNA mRNA -0.777216 7.641771e-77 -0.614369 1.706607e-40 0 1 0 1
10 ENSG00000177463.15 ENSG00000138663.8 NR2C2 COPS4 mRNA mRNA -0.777164 7.935790e-77 -0.387611 5.791705e-15 0 1 0 1

After the columns with correlation p-values, the table has a block of columns describing reference sets. Each reference collection is represented by exactly 2 columns:

  1. flag:

    • == 1: the feature pair is reference.

    • == 0: the feature pair is non-reference, though both features are present in some sets of the collection.

    • == -1: one of the features of the pair is absent from all sets of the collection.

  2. trainval:

    • If flag != -1, this column shows whether feature pair is in training (0) or in validation (1) subset.

    • If flag == -1, this column is also -1.