1. Preparing input data#

To use the CorrAdjust, you will need to prepare the following input data:

Data table and additional tables with feature/sample annotations.
One or more GMT files listing which features (e.g., genes) belong to the same reference sets (e.g., pathways).
Configuration dict.

1.1. Data and annotation tables#

Main data table’s rows and columns should represent samples and features (e.g., genes), respectively. CorrAdjust operates with Pandas data frames. Data should be normalized in a way that allows between-sample comparisons for each feature. Make sure you don’t have constant features (they cannot be used for correlation analysis).

If you have more than ~100 samples, you could split the samples into training and test sets, and the module will provide methods to process them without any training/test leaks with sklearn-style interface.

Below, we use the GTEx whole blood RNA-seq data (this tutorial will fully reproduce Case 1 from the CorrAdjust paper). We import read counts data (pre-filtered to exclude low-expressed genes), normalize it with median-of-ratios method, and then log-transform.

[1]:

import pandas as pd
import numpy as np
from corradjust.utils import MedianOfRatios

# Raw read counts data
df_counts = pd.read_csv(
    "input_data/GTEx_Whole_Blood/raw_counts.tsv",
    sep="\t", index_col=0
)
display(df_counts)

# Split samples into 50%/50% training and test sets
df_counts_train = df_counts.iloc[::2]
df_counts_test = df_counts.iloc[1::2]

# Normalize data using DESeq2 median of ratios algorithm
# This interface is train/test-set friendly, i.e., test data
# has no influence on training data
normalizer = MedianOfRatios()
normalizer.fit(df_counts_train)
df_norm_counts_train = normalizer.transform(df_counts_train)
df_norm_counts_test = normalizer.transform(df_counts_test)

# Log2-transform
df_data_train = np.log2(df_norm_counts_train + 1)
df_data_test = np.log2(df_norm_counts_test + 1)

	ENSG00000188976.10	ENSG00000187961.13	ENSG00000188290.10	ENSG00000187608.8	ENSG00000131591.17	ENSG00000186891.13	ENSG00000186827.10	ENSG00000078808.16	ENSG00000176022.4	ENSG00000160087.20	...	ENSG00000198712.1	ENSG00000228253.1	ENSG00000198899.2	ENSG00000198938.2	ENSG00000198840.2	ENSG00000212907.2	ENSG00000198886.2	ENSG00000198786.2	ENSG00000198695.2	ENSG00000198727.2
GTEX-111YS-0006-SM-5NQBE	1200	184	10	447	303	60	32	5023	656	2291	...	135021	11606	85543	100328	19752	10251	159959	51494	27576	93271
GTEX-1122O-0005-SM-5O99J	674	117	36	482	167	31	56	3917	482	2251	...	65060	6219	40271	58629	8921	7193	89646	31324	11562	54229
GTEX-1128S-0005-SM-5P9HI	1498	742	60	392	710	1141	466	5158	101	1638	...	106654	15681	110716	85599	11111	14739	247762	70150	38511	67878
GTEX-113IC-0006-SM-5NQ9C	1869	584	456	10134	891	202	429	4743	694	2396	...	49286	5252	24159	39851	9008	5020	64451	23966	12086	36639
GTEX-113JC-0006-SM-5O997	652	226	210	7584	275	204	184	1854	84	611	...	71888	9828	67927	56464	11920	9963	120662	31464	10105	54708
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
GTEX-ZVTK-0006-SM-57WBK	1288	495	85	579	320	599	920	3557	179	772	...	135880	16912	113231	108538	16336	21829	251772	106566	39759	103755
GTEX-ZVZP-0006-SM-51MSW	763	152	5	319	260	126	49	4631	552	2912	...	153946	11599	104121	162880	19291	8908	220827	47923	29585	145346
GTEX-ZVZQ-0006-SM-51MR8	1998	481	159	1391	1034	1016	785	4662	257	1095	...	194031	24313	184296	186144	36242	24964	387990	88907	25830	175377
GTEX-ZXES-0005-SM-57WCB	1181	221	7	759	274	86	65	6305	794	2939	...	83592	9832	65882	73655	12275	8488	114308	33482	14236	57067
GTEX-ZXG5-0005-SM-57WCN	1528	766	124	536	583	289	157	5351	236	1369	...	142742	20875	130945	118088	22318	17722	229771	52406	18488	117515

755 rows × 10021 columns

Feature annotation table is mandatory and should have 3 columns:

feature_id: should match with columns of data and needs to be unique. For example, ENSEMBL gene IDs.
feature_name: should match feature names in the reference GMT files, allows duplicates. For example, gene symbols.
feature_type: discrete set of feature types. E.g., if you are analyzing only mRNA-seq data, put mRNA for all genes; if you are integrating miRNA, mRNA, or any other data type, you could use more than one type (e.g., miRNA and mRNA), see Advanced example.

Rows of feature annotation table should be identical to the columns of df_data.

[2]:

df_feature_ann = pd.read_csv(
    "input_data/GTEx_Whole_Blood/gene_annotation.tsv",
    sep="\t", index_col=0
)
display(df_feature_ann)

	feature_name	feature_type
feature_id
ENSG00000188976.10	NOC2L	mRNA
ENSG00000187961.13	KLHL17	mRNA
ENSG00000188290.10	HES4	mRNA
ENSG00000187608.8	ISG15	mRNA
ENSG00000131591.17	C1orf159	mRNA
...	...	...
ENSG00000212907.2	MT-ND4L	mRNA
ENSG00000198886.2	MT-ND4	mRNA
ENSG00000198786.2	MT-ND5	mRNA
ENSG00000198695.2	MT-ND6	mRNA
ENSG00000198727.2	MT-CYB	mRNA

10021 rows × 2 columns

Finally, if you have two or more distinct sample groups (e.g., normal and tumor samples), you might provide sample annotation table, see Advanced example.

1.2. Reference GMT files#

Each reference collection of feature pairs should be represented as a separate GMT file. GMT file is tab-separated file with the following structure:

Column 1: reference feature set name.
Column 2: ignored, you can put any string there (e.g., NA).
Columns 3-…: feature names (number of features can differ between rows).

You can find a toy GMT file in the CorrAdjust GitHub repository. For the current tutorial, we downloaded Canonical Pathways and Gene Ontology databases from MSigDB.

The package will process the file line-by-line, and label all possible feature pairs from each line as reference pairs. Thus, \(n\) features on a line will generate \(n*(n-1)/2\) reference pairs.

One reference set might be represented by several lines. This can be useful, e.g., for miRNA-target gene pairs:

Column 1	Column 2	Column 3	Column 4
…	…	…	…
miR-X-targets	NA	miR-X	target-gene-1
miR-X-targets	NA	miR-X	target-gene-2
miR-X-targets	NA	miR-X	target-gene-3
…	…	…	…

If we instead put everything in one line like this,

Column 1	Column 2	Column 3	Column 4	Column 5	Column 6
…	…	…	…	…	…
miR-X-targets	NA	miR-X	target-gene-1	target-gene-2	target-gene-3
…	…	…	…	…	…

then pairs composed of two target genes will also be labeled as miR-X-targets, which will be incorrect for the analysis of miRNA-mRNA targeting interactions. See Advanced example for a miRNA-mRNA run.

1.3. Reference feature collections configuration#

Configuration dict specifies how reference collections should be handled by CorrAdjust. Below is the example of such a config for two mRNA-mRNA pathway databases. One config can contain an arbitrary number of different collections (GMT files). See more detailed documentation in API reference.

[3]:

ref_feature_colls = {
    "Canonical Pathways": {
        # Relative or absolute path to the GMT file.
        "path": "input_data/GMT_files/c2.cp.v2023.2.Hs.symbols.gmt",
        # Rank feature pairs by absolute correlations.
        "sign": "absolute",
        # Allowed feature pair types.
        # Should match annotation's "feature_type" column.
        "feature_pair_types": ["mRNA-mRNA"],
        # Fraction of all feature pairs to define highly ranked correlations.
        # This is paramter alpha in the CorrAdjust paper notation.
        # 0.01 is a good default value for mRNA-mRNA analysis.
        "high_corr_frac": 0.01
    },
    "Gene Ontology": {
        "path": "input_data/GMT_files/c5.go.v2023.2.Hs.symbols.gmt",
        "sign": "absolute",
        "feature_pair_types": ["mRNA-mRNA"],
        "high_corr_frac": 0.01
    }
}