4. Advanced example#

In this example, we will show how to run CorrAdjust with several feature types (miRNA-mRNA correlations) and several sample groups (male and female samples). The RNA-seq datasets from lymphoblastoid cell lines (LCLs) for the example are taken from the collection generated by the Geuvadis consortium.

4.1. Loading and normalizing the data#

The below code for reading and normalizing expression data is similar to the previous one with one exception. Specifically, we normalize mRNA and small non-coding RNA (sncRNA) matrices separately, since the datasets come from different experiments and should not be normalized together. After normalization, we concatenate the matrices.

[16]:

# Read the data; both tables have identical index
df_counts_mRNA = pd.read_csv(
    "input_data/Geuvadis/raw_counts_mRNA.tsv",
    sep="\t", index_col=0
)
df_counts_sncRNA = pd.read_csv(
    "input_data/Geuvadis/raw_counts_sncRNA.tsv",
    sep="\t", index_col=0
)

# Split into training and test samples
df_counts_mRNA_train = df_counts_mRNA.iloc[::2]
df_counts_mRNA_test = df_counts_mRNA.iloc[1::2]
df_counts_sncRNA_train = df_counts_sncRNA.iloc[::2]
df_counts_sncRNA_test = df_counts_sncRNA.iloc[1::2]

# Normalize mRNA and sncRNA data separately
normalizer = MedianOfRatios()
normalizer.fit(df_counts_mRNA_train)
df_norm_counts_mRNA_train = normalizer.transform(df_counts_mRNA_train)
df_norm_counts_mRNA_test = normalizer.transform(df_counts_mRNA_test)

normalizer = MedianOfRatios()
normalizer.fit(df_counts_sncRNA_train)
df_norm_counts_sncRNA_train = normalizer.transform(df_counts_sncRNA_train)
df_norm_counts_sncRNA_test = normalizer.transform(df_counts_sncRNA_test)

# Concatenate mRNA and sncRNA data
df_norm_counts_train = pd.concat(
    [df_norm_counts_sncRNA_train, df_norm_counts_mRNA_train],
    axis=1
)
df_norm_counts_test = pd.concat(
    [df_norm_counts_sncRNA_test, df_norm_counts_mRNA_test],
    axis=1
)

# Log-transform
df_data_train = np.log2(df_norm_counts_train + 1)
df_data_test = np.log2(df_norm_counts_test + 1)

display(df_data_train)

	tRF-33-79MP9P9NH57SD3	iso-23-2HOMKBFPDP	rRF-18-WLLINPD4	tRF-33-Z3M8ZLSSXUOLD2	iso-22-X2EUIRIKQ	iso-22-BQ8DQWM4H	iso-24-YEQOW4HK2E	yRF-27-KR2Y4BZUNJQ	tRF-18-HR6HFRD2	iso-24-VY2ZSR672F	...	ENSG00000198712.1	ENSG00000228253.1	ENSG00000198899.2	ENSG00000198938.2	ENSG00000198840.2	ENSG00000212907.2	ENSG00000198886.2	ENSG00000198786.2	ENSG00000198695.2	ENSG00000198727.2
TSI_NA20508.1.MI	15.375467	15.768876	14.386797	11.509986	17.077093	16.831354	16.979418	15.446145	17.407458	15.867370	...	16.957235	12.100007	16.197138	17.012233	13.547551	13.994002	17.536592	16.437674	14.407458	15.823431
FIN_HG00315.2.MI	17.004050	14.824433	15.759253	13.413163	14.623674	14.236777	15.006757	14.406483	15.180716	14.415009	...	16.971874	12.258090	15.962215	17.203759	13.432169	13.696331	17.444533	16.435032	15.052151	16.405424
FIN_HG00377.2.MI	17.363959	16.194561	15.104459	11.785276	15.627775	15.123750	15.992820	15.160283	14.282834	15.429691	...	17.195135	12.745818	16.419661	17.434837	13.754321	13.820417	17.623701	16.587501	15.150230	16.501734
CEU_NA07056.1.MI	17.097118	17.129173	17.445657	14.503363	16.753292	14.451858	15.225481	14.832857	15.499062	15.173789	...	18.067899	12.790246	17.218922	18.242186	14.478591	14.349128	18.692826	17.028353	15.858036	16.719261
YRI_NA19119.3.MI	16.795022	17.570451	15.459582	11.381865	17.177996	15.552042	16.201064	15.789840	17.553967	16.495589	...	16.224141	12.177533	15.455781	16.560902	13.195824	13.352683	16.708912	15.811658	14.117363	15.174058
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
GBR_HG01791.2.MI	16.126779	15.715584	14.766975	15.139796	16.288074	16.552333	15.605472	14.096111	16.977521	15.361669	...	16.720027	12.270003	15.723737	16.764922	13.368546	13.347992	17.085581	16.192440	14.622754	15.805635
FIN_HG00362.3.MI	17.121267	16.418557	13.368475	14.779898	16.243674	15.027084	15.819105	15.377365	13.327610	15.522643	...	16.910549	12.079542	15.968046	17.013613	13.524998	13.623903	17.297789	16.082245	14.738878	15.962373
CEU_NA07000.1.MI	16.410787	17.711002	17.774399	12.880337	17.419655	16.219216	16.623148	15.430145	17.526397	16.129083	...	17.297790	11.366524	16.511681	17.707241	14.237225	14.045978	18.140950	16.570033	15.179156	16.700573
YRI_NA18861.4.MI	15.755783	16.457623	15.142608	11.822476	16.368148	16.651518	16.597095	15.635380	16.886426	16.344094	...	13.245564	9.010921	12.461696	13.331866	10.670205	10.269527	14.117214	14.357536	12.045723	13.455430
GBR_HG00237.4.MI	13.614071	14.934466	11.768121	13.726310	14.688703	16.593015	16.395290	15.619196	14.995440	15.242237	...	17.009939	12.551017	15.668342	17.099195	13.859134	13.359202	16.929224	15.904793	14.359246	15.802043

202 rows × 15770 columns

Fene annotation file also follows the same format but now we have two feature types (sncRNA and mRNA):

[17]:

df_feature_ann = pd.read_csv(
    "input_data/Geuvadis/gene_annotation.tsv",
    sep="\t", index_col=0
)
display(df_feature_ann)

	feature_name	feature_type
feature_id
tRF-33-79MP9P9NH57SD3	tRF-33-79MP9P9NH57SD3	sncRNA
iso-23-2HOMKBFPDP	iso-23-2HOMKBFPDP&hsa-miR-191-5p\|0\|0	sncRNA
rRF-18-WLLINPD4	rRF-18-WLLINPD4	sncRNA
tRF-33-Z3M8ZLSSXUOLD2	tRF-33-Z3M8ZLSSXUOLD2	sncRNA
iso-22-X2EUIRIKQ	iso-22-X2EUIRIKQ&hsa-miR-146a-5p\|0\|0	sncRNA
...	...	...
ENSG00000212907.2	MT-ND4L	mRNA
ENSG00000198886.2	MT-ND4	mRNA
ENSG00000198786.2	MT-ND5	mRNA
ENSG00000198695.2	MT-ND6	mRNA
ENSG00000198727.2	MT-CYB	mRNA

15770 rows × 2 columns

Finally, we read a table with annotation of samples (male and female). Note that the group column could have an arbitrary number of different values.

[18]:

df_samp_ann = pd.read_csv(
    "input_data/Geuvadis/sample_annotation.tsv",
    sep="\t", index_col=0
)
df_samp_ann_train = df_samp_ann.iloc[::2]
df_samp_ann_test = df_samp_ann.iloc[1::2]
display(df_samp_ann_train)

	group
sample
TSI_NA20508.1.MI	female
FIN_HG00315.2.MI	female
FIN_HG00377.2.MI	female
CEU_NA07056.1.MI	female
YRI_NA19119.3.MI	male
...	...
GBR_HG01791.2.MI	male
FIN_HG00362.3.MI	female
CEU_NA07000.1.MI	female
YRI_NA18861.4.MI	female
GBR_HG00237.4.MI	female

202 rows × 1 columns

As a reference feature pair sets, we will use experimentally validated targets from TarBase v9. Note that targets are available only for canonical isomiRs (isomiRs with |0|0 suffix), while expression data has both canonical and non-canonical ones, as well as other non-isomiR sncRNAs.

[19]:

ref_feature_colls = {
    "TarBase": {
        "path": "input_data/GMT_files/TarBase.gmt",
        # We expect targets among negative miRNA-mRNA correlations
        "sign": "negative",
        "feature_pair_types": ["sncRNA-mRNA"],
        # Good default value for miRNA-mRNA analysis
        "high_corr_frac": 0.05
    }
}

4.2. Training the model#

Now we create and fit the model, passing sample annotation as an additional argument to fit.

[20]:

model = CorrAdjust(
    df_feature_ann,
    ref_feature_colls,
    "out_data/Geuvadis"
)
model.fit(
    df_data_train,
    df_samp_ann=df_samp_ann_train,
    # Set n_PCs value from paper
    n_PCs=17
)

2025-02-28 20:01:28.968561 | Loading TarBase reference collection...
2025-02-28 20:01:31.982782 | Computing PCA...
2025-02-28 20:01:33.001837 | Starting PC optimization...
100%|██████████████████████████████████████████████████████████████| 154/154 [24:48<00:00,  9.67s/it]

../_images/tutorial_advanced_run_13_1.png

../_images/tutorial_advanced_run_13_2.png

../_images/tutorial_advanced_run_13_3.png

If sample annotation with more than one group is passed to fit, it will produce separate optimization plots for each group. The last plot stands for mean scores (in this case, mean of male and female scores). Mean training score is used for maximization, and mean validation score is used for early stopping.

4.3. Using the trained model#

Calls of all downstream methods should include sample annotation and sample group name. To avoid code duplication, this can be done in a loop.

[21]:

for samp_group in ["female", "male"]:
    feature_scores = model.compute_feature_scores(
        df_data_test,
        df_samp_ann=df_samp_ann_test,
        samp_group=samp_group
    )

    # This attribute adds title in top-left corners of plots
    model.title = f"Sex = {samp_group}"

    model.make_volcano_plot(
        feature_scores,
        f"volcano.test_samples.{samp_group}.png",
        annotate_features=10,
        # Use formatter function to shorten isomiR names
        feature_name_fmt=lambda name: name.split("hsa-")[1].split("|")[0]
    )

    model.make_corr_distr_plot(
        df_data_test,
        f"corr_distr.test_samples.{samp_group}.png",
        df_samp_ann=df_samp_ann_test,
        samp_group=samp_group
    )

    model.export_corrs(
        df_data_test,
        f"corrs.test_samples.{samp_group}.tsv",
        df_samp_ann=df_samp_ann_test,
        samp_group=samp_group
    )

    # These lines just show a representative chunk of the file
    corrs = pd.read_csv(
        f"out_data/Geuvadis/corrs.test_samples.{samp_group}.tsv",
        sep="\t", nrows=100000
    )
    print(f"Chunk of corrs.test_samples.{samp_group}.tsv:")
    display(corrs.loc[corrs["TarBase_flag"] != -1].head(10))

2025-02-28 20:26:24.436736 | Computing raw correlations for female...
2025-02-28 20:26:26.703805 | Computing corrected correlations for female...
2025-02-28 20:26:36.252894 | Computing raw correlations for female...
2025-02-28 20:26:38.543961 | Computing corrected correlations for female...
2025-02-28 20:26:48.916197 | Computing raw correlations for female...
2025-02-28 20:26:51.184528 | Computing corrected correlations for female...
2025-02-28 20:26:56.129525 | Starting export to file...
100%|██████████████████████████████████████████████████████████████| 125/125 [05:22<00:00,  2.58s/it]

Chunk of corrs.test_samples.female.tsv:

	feature_id1	feature_id2	feature_name1	feature_name2	feature_type1	feature_type2	corr_clean	pvalue_clean	corr_raw	pvalue_raw	TarBase_flag	TarBase_trainval
1587	iso-22-VFPFU47YP	ENSG00000082701.17	iso-22-VFPFU47YP&hsa-miR-15a-5p\|0\|0	GSK3B	sncRNA	mRNA	-0.715228	3.162188e-31	0.134124	0.057662	1	0
4837	iso-23-U0XZH3PK0H	ENSG00000082701.17	iso-23-U0XZH3PK0H&hsa-miR-20a-5p\|0\|0	GSK3B	sncRNA	mRNA	-0.683877	1.117633e-27	0.244307	0.000474	1	0
6494	iso-23-20XYYWPK0H	ENSG00000131023.13	iso-23-20XYYWPK0H&hsa-miR-93-5p\|0\|0	LATS1	sncRNA	mRNA	-0.675366	8.629332e-27	0.179440	0.010808	1	0
6997	iso-22-83PI02EZN	ENSG00000118513.19	iso-22-83PI02EZN&hsa-miR-148a-3p\|0\|0	MYB	sncRNA	mRNA	-0.673046	1.488833e-26	-0.276047	0.000073	1	1
7477	iso-23-9P9Z35ZQDX	ENSG00000082701.17	iso-23-9P9Z35ZQDX&hsa-miR-142-3p\|0\|0	GSK3B	sncRNA	mRNA	-0.670901	2.454460e-26	0.186854	0.007907	0	1
7712	iso-23-J42LSNSO9	ENSG00000109171.15	iso-23-J42LSNSO9&hsa-miR-301a-3p\|0\|0	SLAIN2	sncRNA	mRNA	-0.669953	3.056737e-26	0.061544	0.385438	1	0
9829	iso-22-URIVHM63P	ENSG00000060237.18	iso-22-URIVHM63P&hsa-miR-660-5p\|0\|0	WNK1	sncRNA	mRNA	-0.662119	1.820237e-25	0.173901	0.013552	1	1
10006	iso-23-U0XZH3PK0H	ENSG00000109171.15	iso-23-U0XZH3PK0H&hsa-miR-20a-5p\|0\|0	SLAIN2	sncRNA	mRNA	-0.661582	2.052673e-25	0.167919	0.017184	1	1
10875	iso-23-J42LSNSO9	ENSG00000082701.17	iso-23-J42LSNSO9&hsa-miR-301a-3p\|0\|0	GSK3B	sncRNA	mRNA	-0.658831	3.787756e-25	0.290319	0.000029	1	1
11858	iso-23-J424HNSO9	ENSG00000048405.10	iso-23-J424HNSO9&hsa-miR-301b-3p\|0\|0	ZNF800	sncRNA	mRNA	-0.655913	7.202982e-25	0.169388	0.016221	1	0

2025-02-28 20:32:19.835571 | Computing raw correlations for male...
2025-02-28 20:32:22.327870 | Computing corrected correlations for male...
2025-02-28 20:32:32.340275 | Computing raw correlations for male...
2025-02-28 20:32:34.848362 | Computing corrected correlations for male...
2025-02-28 20:32:45.141563 | Computing raw correlations for male...
2025-02-28 20:32:47.713209 | Computing corrected correlations for male...
2025-02-28 20:32:52.741026 | Starting export to file...
100%|███████████████████████████████████████████████████████████████| 125/125 [05:14<00:00,  2.52s/it]

Chunk of corrs.test_samples.male.tsv:

	feature_id1	feature_id2	feature_name1	feature_name2	feature_type1	feature_type2	corr_clean	pvalue_clean	corr_raw	pvalue_raw	TarBase_flag	TarBase_trainval
21390	iso-23-20XYYWPK0H	ENSG00000130338.13	iso-23-20XYYWPK0H&hsa-miR-93-5p\|0\|0	TULP4	sncRNA	mRNA	-0.655115	8.576050e-25	-0.066628	0.347335	1	1
21892	iso-21-42NBQRZ00	ENSG00000009307.16	iso-21-42NBQRZ00&hsa-miR-151a-3p\|0\|0	CSDE1	sncRNA	mRNA	-0.654322	1.019516e-24	-0.175774	0.012562	0	1
29488	iso-23-U0XZH3PK0H	ENSG00000082701.17	iso-23-U0XZH3PK0H&hsa-miR-20a-5p\|0\|0	GSK3B	sncRNA	mRNA	-0.643953	9.335504e-24	0.120305	0.088911	1	0
32337	iso-21-42NBQRZ00	ENSG00000138593.9	iso-21-42NBQRZ00&hsa-miR-151a-3p\|0\|0	SECISBP2L	sncRNA	mRNA	-0.640631	1.863912e-23	-0.189063	0.007188	0	1
34594	iso-22-J424IB1JI	ENSG00000118058.23	iso-22-J424IB1JI&hsa-miR-130b-3p\|0\|0	KMT2A	sncRNA	mRNA	-0.638107	3.134571e-23	0.211219	0.002613	1	1
41222	iso-23-20XZ03PK0H	ENSG00000102908.22	iso-23-20XZ03PK0H&hsa-miR-17-5p\|0\|0	NFAT5	sncRNA	mRNA	-0.631484	1.198429e-22	0.170516	0.015514	1	1
42711	iso-21-W05I2PWPE	ENSG00000009307.16	iso-21-W05I2PWPE&hsa-miR-151a-5p\|0\|0	CSDE1	sncRNA	mRNA	-0.630074	1.587678e-22	-0.113037	0.110109	1	1
44583	iso-23-20XZ03PK0H	ENSG00000110367.13	iso-23-20XZ03PK0H&hsa-miR-17-5p\|0\|0	DDX6	sncRNA	mRNA	-0.628444	2.194175e-22	0.084632	0.232272	1	0
46898	iso-22-XKVL7YXYQ	ENSG00000118058.23	iso-22-XKVL7YXYQ&hsa-let-7i-5p\|0\|0	KMT2A	sncRNA	mRNA	-0.626423	3.267926e-22	0.112801	0.110857	1	1
49526	iso-23-U0XZH3PK0H	ENSG00000110367.13	iso-23-U0XZH3PK0H&hsa-miR-20a-5p\|0\|0	DDX6	sncRNA	mRNA	-0.624238	5.011761e-22	0.033899	0.632831	1	1

../_images/tutorial_advanced_run_17_6.png

../_images/tutorial_advanced_run_17_7.png

../_images/tutorial_advanced_run_17_8.png

../_images/tutorial_advanced_run_17_9.png

The only method that doesn’t require group name is transform, since all samples are cleaned together.

[22]:

df_data_test_clean, df_rsquareds = model.transform(
    df_data_test,
    df_samp_ann_test
)
display(df_data_test_clean)

	tRF-33-79MP9P9NH57SD3	iso-23-2HOMKBFPDP	rRF-18-WLLINPD4	tRF-33-Z3M8ZLSSXUOLD2	iso-22-X2EUIRIKQ	iso-22-BQ8DQWM4H	iso-24-YEQOW4HK2E	yRF-27-KR2Y4BZUNJQ	tRF-18-HR6HFRD2	iso-24-VY2ZSR672F	...	ENSG00000198712.1	ENSG00000228253.1	ENSG00000198899.2	ENSG00000198938.2	ENSG00000198840.2	ENSG00000212907.2	ENSG00000198886.2	ENSG00000198786.2	ENSG00000198695.2	ENSG00000198727.2
CEU_NA12812.1.MI	-0.620065	-0.310726	-0.249793	1.290901	0.281442	-0.050428	0.527796	-0.401979	-0.772192	-0.512847	...	0.158439	0.173913	0.427129	0.534144	0.454318	0.439413	0.549146	0.316825	0.381061	0.078790
CEU_NA12749.1.MI	-1.106119	-0.628111	-1.249008	0.616152	0.206401	-0.108065	0.103019	0.038530	1.169227	-0.064189	...	-0.344318	-0.269753	-0.155804	-0.453415	-0.155392	-0.505204	-0.384939	-0.579743	-0.510531	-0.332724
TSI_NA20510.3.MI	0.195529	-0.151787	0.250982	-0.509563	0.062368	-0.322359	-0.387207	0.716692	0.794845	-0.685646	...	0.132211	0.255880	0.054310	-0.062379	0.110532	0.161819	0.200947	0.060922	0.095826	-0.142283
GBR_HG00155.1.MI	-0.053329	-0.300658	0.170739	-0.383377	-0.068466	-0.254343	-0.439967	-0.292031	1.943921	-0.449839	...	-0.092572	-0.054234	0.190476	0.035316	0.234884	0.145346	0.030836	-0.008509	-0.025680	0.144530
FIN_HG00350.3.MI	0.525551	-0.086287	0.157886	-0.400232	0.006228	-0.445866	-0.210717	0.212302	1.625556	-0.428203	...	-0.038814	-0.153922	-0.146719	0.017803	0.096110	-0.301353	0.002004	-0.402903	-0.169281	-0.109961
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
CEU_NA12778.1.MI	0.581613	-0.307854	1.827755	0.375388	0.559242	0.283521	-0.132107	0.957126	0.674212	-0.213702	...	0.122245	-0.301193	0.274027	0.291671	-0.014586	-0.008767	0.404175	0.418945	0.170228	0.276137
TSI_NA20761.1.MI	0.427250	-0.000609	1.854137	-0.967633	0.286078	-0.553838	-0.271268	0.738555	0.066126	-0.495017	...	-0.095754	-1.027643	-0.271515	-0.459521	-0.555034	-0.464335	0.236649	0.014550	0.192458	-0.273718
FIN_HG00308.3.MI	0.049832	0.064105	0.216882	0.504403	-0.077997	-0.153369	-0.179680	-0.236491	0.524470	0.310773	...	0.224791	0.431358	0.144727	0.234761	0.124680	0.232421	0.155974	-0.221077	-0.224894	-0.307741
CEU_NA12399.7.MI	0.996266	-0.470806	-0.691879	0.213296	-0.218509	0.302927	0.066839	0.112897	-1.323507	0.208681	...	-0.306433	0.255695	-0.451077	-0.298684	-0.363276	-0.000983	-0.401184	-0.278655	-0.341041	-0.359523
YRI_NA19144.4.MI	-0.631015	-0.555298	-2.123311	-0.045040	-0.326791	-0.092634	-0.241750	0.372753	0.076177	-0.711939	...	-0.074233	-1.253004	-0.406206	0.013982	0.171947	-0.449277	-0.364824	-0.354861	0.004093	-0.504313

201 rows × 15770 columns