1. Preparing input data#
To use the CorrAdjust, you will need to prepare the following input data:
Data table and additional tables with feature/sample annotations.
One or more GMT files listing which features (e.g., genes) belong to the same reference sets (e.g., pathways).
Configuration dict.
1.1. Data and annotation tables#
Main data table’s rows and columns should represent samples and features (e.g., genes), respectively. CorrAdjust operates with Pandas data frames. Data should be normalized in a way that allows between-sample comparisons for each feature. Make sure you don’t have constant features (they cannot be used for correlation analysis).
If you have more than ~100 samples, you could split the samples into training and test sets, and the module will provide methods to process them without any training/test leaks with sklearn-style interface.
Below, we use the GTEx whole blood RNA-seq data (this tutorial will fully reproduce Case 1 from the CorrAdjust paper). We import read counts data (pre-filtered to exclude low-expressed genes), normalize it with median-of-ratios method, and then log-transform.
[1]:
import pandas as pd
import numpy as np
from corradjust.utils import MedianOfRatios
# Raw read counts data
df_counts = pd.read_csv(
"input_data/GTEx_Whole_Blood/raw_counts.tsv",
sep="\t", index_col=0
)
display(df_counts)
# Split samples into 50%/50% training and test sets
df_counts_train = df_counts.iloc[::2]
df_counts_test = df_counts.iloc[1::2]
# Normalize data using DESeq2 median of ratios algorithm
# This interface is train/test-set friendly, i.e., test data
# has no influence on training data
normalizer = MedianOfRatios()
normalizer.fit(df_counts_train)
df_norm_counts_train = normalizer.transform(df_counts_train)
df_norm_counts_test = normalizer.transform(df_counts_test)
# Log2-transform
df_data_train = np.log2(df_norm_counts_train + 1)
df_data_test = np.log2(df_norm_counts_test + 1)
ENSG00000188976.10 | ENSG00000187961.13 | ENSG00000188290.10 | ENSG00000187608.8 | ENSG00000131591.17 | ENSG00000186891.13 | ENSG00000186827.10 | ENSG00000078808.16 | ENSG00000176022.4 | ENSG00000160087.20 | ... | ENSG00000198712.1 | ENSG00000228253.1 | ENSG00000198899.2 | ENSG00000198938.2 | ENSG00000198840.2 | ENSG00000212907.2 | ENSG00000198886.2 | ENSG00000198786.2 | ENSG00000198695.2 | ENSG00000198727.2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GTEX-111YS-0006-SM-5NQBE | 1200 | 184 | 10 | 447 | 303 | 60 | 32 | 5023 | 656 | 2291 | ... | 135021 | 11606 | 85543 | 100328 | 19752 | 10251 | 159959 | 51494 | 27576 | 93271 |
GTEX-1122O-0005-SM-5O99J | 674 | 117 | 36 | 482 | 167 | 31 | 56 | 3917 | 482 | 2251 | ... | 65060 | 6219 | 40271 | 58629 | 8921 | 7193 | 89646 | 31324 | 11562 | 54229 |
GTEX-1128S-0005-SM-5P9HI | 1498 | 742 | 60 | 392 | 710 | 1141 | 466 | 5158 | 101 | 1638 | ... | 106654 | 15681 | 110716 | 85599 | 11111 | 14739 | 247762 | 70150 | 38511 | 67878 |
GTEX-113IC-0006-SM-5NQ9C | 1869 | 584 | 456 | 10134 | 891 | 202 | 429 | 4743 | 694 | 2396 | ... | 49286 | 5252 | 24159 | 39851 | 9008 | 5020 | 64451 | 23966 | 12086 | 36639 |
GTEX-113JC-0006-SM-5O997 | 652 | 226 | 210 | 7584 | 275 | 204 | 184 | 1854 | 84 | 611 | ... | 71888 | 9828 | 67927 | 56464 | 11920 | 9963 | 120662 | 31464 | 10105 | 54708 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
GTEX-ZVTK-0006-SM-57WBK | 1288 | 495 | 85 | 579 | 320 | 599 | 920 | 3557 | 179 | 772 | ... | 135880 | 16912 | 113231 | 108538 | 16336 | 21829 | 251772 | 106566 | 39759 | 103755 |
GTEX-ZVZP-0006-SM-51MSW | 763 | 152 | 5 | 319 | 260 | 126 | 49 | 4631 | 552 | 2912 | ... | 153946 | 11599 | 104121 | 162880 | 19291 | 8908 | 220827 | 47923 | 29585 | 145346 |
GTEX-ZVZQ-0006-SM-51MR8 | 1998 | 481 | 159 | 1391 | 1034 | 1016 | 785 | 4662 | 257 | 1095 | ... | 194031 | 24313 | 184296 | 186144 | 36242 | 24964 | 387990 | 88907 | 25830 | 175377 |
GTEX-ZXES-0005-SM-57WCB | 1181 | 221 | 7 | 759 | 274 | 86 | 65 | 6305 | 794 | 2939 | ... | 83592 | 9832 | 65882 | 73655 | 12275 | 8488 | 114308 | 33482 | 14236 | 57067 |
GTEX-ZXG5-0005-SM-57WCN | 1528 | 766 | 124 | 536 | 583 | 289 | 157 | 5351 | 236 | 1369 | ... | 142742 | 20875 | 130945 | 118088 | 22318 | 17722 | 229771 | 52406 | 18488 | 117515 |
755 rows × 10021 columns
Feature annotation table is mandatory and should have 3 columns:
feature_id: should match with columns of data and needs to be unique. For example, ENSEMBL gene IDs.
feature_name: should match feature names in the reference GMT files, allows duplicates. For example, gene symbols.
feature_type: discrete set of feature types. E.g., if you are analyzing only mRNA-seq data, put
mRNA
for all genes; if you are integrating miRNA, mRNA, or any other data type, you could use more than one type (e.g.,miRNA
andmRNA
), see Advanced example.
Rows of feature annotation table should be identical to the columns of df_data
.
[2]:
df_feature_ann = pd.read_csv(
"input_data/GTEx_Whole_Blood/gene_annotation.tsv",
sep="\t", index_col=0
)
display(df_feature_ann)
feature_name | feature_type | |
---|---|---|
feature_id | ||
ENSG00000188976.10 | NOC2L | mRNA |
ENSG00000187961.13 | KLHL17 | mRNA |
ENSG00000188290.10 | HES4 | mRNA |
ENSG00000187608.8 | ISG15 | mRNA |
ENSG00000131591.17 | C1orf159 | mRNA |
... | ... | ... |
ENSG00000212907.2 | MT-ND4L | mRNA |
ENSG00000198886.2 | MT-ND4 | mRNA |
ENSG00000198786.2 | MT-ND5 | mRNA |
ENSG00000198695.2 | MT-ND6 | mRNA |
ENSG00000198727.2 | MT-CYB | mRNA |
10021 rows × 2 columns
Finally, if you have two or more distinct sample groups (e.g., normal and tumor samples), you might provide sample annotation table, see Advanced example.
1.2. Reference GMT files#
Each reference collection of feature pairs should be represented as a separate GMT file. GMT file is tab-separated file with the following structure:
Column 1: reference feature set name.
Column 2: ignored, you can put any string there (e.g.,
NA
).Columns 3-…: feature names (number of features can differ between rows).
You can find a toy GMT file in the CorrAdjust GitHub repository. For the current tutorial, we downloaded Canonical Pathways and Gene Ontology databases from MSigDB.
The package will process the file line-by-line, and label all possible feature pairs from each line as reference pairs. Thus, \(n\) features on a line will generate \(n*(n-1)/2\) reference pairs.
One reference set might be represented by several lines. This can be useful, e.g., for miRNA-target gene pairs:
Column 1 |
Column 2 |
Column 3 |
Column 4 |
---|---|---|---|
… |
… |
… |
… |
miR-X-targets |
NA |
miR-X |
target-gene-1 |
miR-X-targets |
NA |
miR-X |
target-gene-2 |
miR-X-targets |
NA |
miR-X |
target-gene-3 |
… |
… |
… |
… |
If we instead put everything in one line like this,
Column 1 |
Column 2 |
Column 3 |
Column 4 |
Column 5 |
Column 6 |
---|---|---|---|---|---|
… |
… |
… |
… |
… |
… |
miR-X-targets |
NA |
miR-X |
target-gene-1 |
target-gene-2 |
target-gene-3 |
… |
… |
… |
… |
… |
… |
then pairs composed of two target genes will also be labeled as miR-X-targets
, which will be incorrect for the analysis of miRNA-mRNA targeting interactions. See Advanced example for a miRNA-mRNA run.
1.3. Reference feature collections configuration#
Configuration dict specifies how reference collections should be handled by CorrAdjust. Below is the example of such a config for two mRNA-mRNA pathway databases. One config can contain an arbitrary number of different collections (GMT files). See more detailed documentation in API reference.
[3]:
ref_feature_colls = {
"Canonical Pathways": {
# Relative or absolute path to the GMT file.
"path": "input_data/GMT_files/c2.cp.v2023.2.Hs.symbols.gmt",
# Rank feature pairs by absolute correlations.
"sign": "absolute",
# Allowed feature pair types.
# Should match annotation's "feature_type" column.
"feature_pair_types": ["mRNA-mRNA"],
# Fraction of all feature pairs to define highly ranked correlations.
# This is paramter alpha in the CorrAdjust paper notation.
# 0.01 is a good default value for mRNA-mRNA analysis.
"high_corr_frac": 0.01
},
"Gene Ontology": {
"path": "input_data/GMT_files/c5.go.v2023.2.Hs.symbols.gmt",
"sign": "absolute",
"feature_pair_types": ["mRNA-mRNA"],
"high_corr_frac": 0.01
}
}