1. Preparing input data#

To use the CorrAdjust, you will need to prepare the following input data:

  • Data table and additional tables with feature/sample annotations.

  • One or more GMT files listing which features (e.g., genes) belong to the same reference sets (e.g., pathways).

  • Configuration dict.

1.1. Data and annotation tables#

Main data table’s rows and columns should represent samples and features (e.g., genes), respectively. CorrAdjust operates with Pandas data frames. Data should be normalized in a way that allows between-sample comparisons for each feature. Make sure you don’t have constant features (they cannot be used for correlation analysis).

If you have more than ~100 samples, you could split the samples into training and test sets, and the module will provide methods to process them without any training/test leaks with sklearn-style interface.

Below, we use the GTEx whole blood RNA-seq data (this tutorial will fully reproduce Case 1 from the CorrAdjust paper). We import read counts data (pre-filtered to exclude low-expressed genes), normalize it with median-of-ratios method, and then log-transform.

[1]:
import pandas as pd
import numpy as np
from corradjust.utils import MedianOfRatios

# Raw read counts data
df_counts = pd.read_csv(
    "input_data/GTEx_Whole_Blood/raw_counts.tsv",
    sep="\t", index_col=0
)
display(df_counts)

# Split samples into 50%/50% training and test sets
df_counts_train = df_counts.iloc[::2]
df_counts_test = df_counts.iloc[1::2]

# Normalize data using DESeq2 median of ratios algorithm
# This interface is train/test-set friendly, i.e., test data
# has no influence on training data
normalizer = MedianOfRatios()
normalizer.fit(df_counts_train)
df_norm_counts_train = normalizer.transform(df_counts_train)
df_norm_counts_test = normalizer.transform(df_counts_test)

# Log2-transform
df_data_train = np.log2(df_norm_counts_train + 1)
df_data_test = np.log2(df_norm_counts_test + 1)
ENSG00000188976.10 ENSG00000187961.13 ENSG00000188290.10 ENSG00000187608.8 ENSG00000131591.17 ENSG00000186891.13 ENSG00000186827.10 ENSG00000078808.16 ENSG00000176022.4 ENSG00000160087.20 ... ENSG00000198712.1 ENSG00000228253.1 ENSG00000198899.2 ENSG00000198938.2 ENSG00000198840.2 ENSG00000212907.2 ENSG00000198886.2 ENSG00000198786.2 ENSG00000198695.2 ENSG00000198727.2
GTEX-111YS-0006-SM-5NQBE 1200 184 10 447 303 60 32 5023 656 2291 ... 135021 11606 85543 100328 19752 10251 159959 51494 27576 93271
GTEX-1122O-0005-SM-5O99J 674 117 36 482 167 31 56 3917 482 2251 ... 65060 6219 40271 58629 8921 7193 89646 31324 11562 54229
GTEX-1128S-0005-SM-5P9HI 1498 742 60 392 710 1141 466 5158 101 1638 ... 106654 15681 110716 85599 11111 14739 247762 70150 38511 67878
GTEX-113IC-0006-SM-5NQ9C 1869 584 456 10134 891 202 429 4743 694 2396 ... 49286 5252 24159 39851 9008 5020 64451 23966 12086 36639
GTEX-113JC-0006-SM-5O997 652 226 210 7584 275 204 184 1854 84 611 ... 71888 9828 67927 56464 11920 9963 120662 31464 10105 54708
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
GTEX-ZVTK-0006-SM-57WBK 1288 495 85 579 320 599 920 3557 179 772 ... 135880 16912 113231 108538 16336 21829 251772 106566 39759 103755
GTEX-ZVZP-0006-SM-51MSW 763 152 5 319 260 126 49 4631 552 2912 ... 153946 11599 104121 162880 19291 8908 220827 47923 29585 145346
GTEX-ZVZQ-0006-SM-51MR8 1998 481 159 1391 1034 1016 785 4662 257 1095 ... 194031 24313 184296 186144 36242 24964 387990 88907 25830 175377
GTEX-ZXES-0005-SM-57WCB 1181 221 7 759 274 86 65 6305 794 2939 ... 83592 9832 65882 73655 12275 8488 114308 33482 14236 57067
GTEX-ZXG5-0005-SM-57WCN 1528 766 124 536 583 289 157 5351 236 1369 ... 142742 20875 130945 118088 22318 17722 229771 52406 18488 117515

755 rows × 10021 columns

Feature annotation table is mandatory and should have 3 columns:

  1. feature_id: should match with columns of data and needs to be unique. For example, ENSEMBL gene IDs.

  2. feature_name: should match feature names in the reference GMT files, allows duplicates. For example, gene symbols.

  3. feature_type: discrete set of feature types. E.g., if you are analyzing only mRNA-seq data, put mRNA for all genes; if you are integrating miRNA, mRNA, or any other data type, you could use more than one type (e.g., miRNA and mRNA), see Advanced example.

Rows of feature annotation table should be identical to the columns of df_data.

[2]:
df_feature_ann = pd.read_csv(
    "input_data/GTEx_Whole_Blood/gene_annotation.tsv",
    sep="\t", index_col=0
)
display(df_feature_ann)
feature_name feature_type
feature_id
ENSG00000188976.10 NOC2L mRNA
ENSG00000187961.13 KLHL17 mRNA
ENSG00000188290.10 HES4 mRNA
ENSG00000187608.8 ISG15 mRNA
ENSG00000131591.17 C1orf159 mRNA
... ... ...
ENSG00000212907.2 MT-ND4L mRNA
ENSG00000198886.2 MT-ND4 mRNA
ENSG00000198786.2 MT-ND5 mRNA
ENSG00000198695.2 MT-ND6 mRNA
ENSG00000198727.2 MT-CYB mRNA

10021 rows × 2 columns

Finally, if you have two or more distinct sample groups (e.g., normal and tumor samples), you might provide sample annotation table, see Advanced example.

1.2. Reference GMT files#

Each reference collection of feature pairs should be represented as a separate GMT file. GMT file is tab-separated file with the following structure:

  • Column 1: reference feature set name.

  • Column 2: ignored, you can put any string there (e.g., NA).

  • Columns 3-…: feature names (number of features can differ between rows).

You can find a toy GMT file in the CorrAdjust GitHub repository. For the current tutorial, we downloaded Canonical Pathways and Gene Ontology databases from MSigDB.

The package will process the file line-by-line, and label all possible feature pairs from each line as reference pairs. Thus, \(n\) features on a line will generate \(n*(n-1)/2\) reference pairs.

One reference set might be represented by several lines. This can be useful, e.g., for miRNA-target gene pairs:

Column 1

Column 2

Column 3

Column 4

miR-X-targets

NA

miR-X

target-gene-1

miR-X-targets

NA

miR-X

target-gene-2

miR-X-targets

NA

miR-X

target-gene-3

If we instead put everything in one line like this,

Column 1

Column 2

Column 3

Column 4

Column 5

Column 6

miR-X-targets

NA

miR-X

target-gene-1

target-gene-2

target-gene-3

then pairs composed of two target genes will also be labeled as miR-X-targets, which will be incorrect for the analysis of miRNA-mRNA targeting interactions. See Advanced example for a miRNA-mRNA run.

1.3. Reference feature collections configuration#

Configuration dict specifies how reference collections should be handled by CorrAdjust. Below is the example of such a config for two mRNA-mRNA pathway databases. One config can contain an arbitrary number of different collections (GMT files). See more detailed documentation in API reference.

[3]:
ref_feature_colls = {
    "Canonical Pathways": {
        # Relative or absolute path to the GMT file.
        "path": "input_data/GMT_files/c2.cp.v2023.2.Hs.symbols.gmt",
        # Rank feature pairs by absolute correlations.
        "sign": "absolute",
        # Allowed feature pair types.
        # Should match annotation's "feature_type" column.
        "feature_pair_types": ["mRNA-mRNA"],
        # Fraction of all feature pairs to define highly ranked correlations.
        # This is paramter alpha in the CorrAdjust paper notation.
        # 0.01 is a good default value for mRNA-mRNA analysis.
        "high_corr_frac": 0.01
    },
    "Gene Ontology": {
        "path": "input_data/GMT_files/c5.go.v2023.2.Hs.symbols.gmt",
        "sign": "absolute",
        "feature_pair_types": ["mRNA-mRNA"],
        "high_corr_frac": 0.01
    }
}