Open Access Article
Nicolas
Borisov
ab,
Yaroslav
Ilnytsky
cd,
Boseon
Byeon
e,
Olga
Kovalchuk
*cd and
Igor
Kovalchuk
*cd
aArmenian Bioinformatics Institute, 7 Ezras Hasratyan str., 0014, Yerevan, Armenia. E-mail: Nicolas.borissoff@abi.am
bVivan Therapeutics, (My Personal Therapeutics Ltd.), The Westworks, White City Place 195 Wood Lane, London, W12 7FQ, England, UK. E-mail: nikolay@mypersonaltherapeutics.com
cDepartment of Biological Sciences, University of Lethbridge, Lethbridge, Alberta T1K 3M4, Canada. E-mail: igor.kovalchuk@uleth.ca; olga.kovalchuk@uleth.ca; slava.ilyntskyy@uleth.ca; bbyeon@gmail.com
dPathway Rx., 2 Fortress Rise SW, Alberta T3H 4Z2, Canada
eBiomedical and Health Informatics, Computer Science Department, State University of New York, 2 S Clinton St, Syracuse, NY 13202, USA
First published on 1st October 2025
Although multi-omics analysis is popular for revealing diverse physiological effects and biomarkers in many branches of state-of-the-art molecular and cell biology and bioinformatics, there is still no consensus on a gold standard protocol for the integration of various multi-omics profiles into a uniformly shaped system bioinformatics platform. In the current study, we performed the integration of data on DNA methylation, and the expression of coding RNA (mRNA), micro-RNA (miRNA), and long non-coding RNA into a joint platform for calculation of signaling pathway impact analysis (SPIA) and drug efficiency index (DEI). We found that the mirrored and balanced DEI values fitted the DNA methylome data better than the original DEI. Additionally, the protein-coding mRNA-based values correlated more strongly with antisense lncRNA-based values than with miRNA-based values. The whole correlation between the mRNA-based and antisense lncRNA-based values was generally positive. This platform allowed integrative analysis of several levels of gene expression regulation of protein-coding genes and their regulators, including methylation and noncoding RNAs.
Consideration of multi-omics events at the integrated level is important because it provides a comprehensive understanding of biological systems by combining data from various omics layers used.1,2 Its advantages include examining multiple molecular levels simultaneously, offering a more complete picture than single-omics approaches.3–5 The multi-omics approach also helps cross-validate the findings from different omics layers, increases the reliability and accuracy of the results, and, second, improves the identification of robust biomarkers for disease diagnosis, prognosis, and treatment monitoring by considering multiple types of molecular data.6,7
Statistical and enrichment approaches include simple enrichment analysis and quantitative statistical analysis. At the current moment, the mostly qualitative approach based on Gene Ontology classification has largely gone out of favor.8,9 In contrast, quantitative statistical analysis using tools such as Integrated Molecular Pathway-Level Analysis (IMPaLA),10 Pathway Multiomics,11 MultiGSEA,12 PaintOmics13 and ActivePathways14 allows for integration of multiple omics layers to compute pathway enrichment scores, which provide statistical significance and visual representations of pathway activities.
Machine learning approaches involve supervised and unsupervised learning. Supervised learning techniques, such as DIABLO,15 or OmicsAnalyst,16 which apply the LASSO regression,17 use annotated (phenotype groups are used as class labels) datasets to predict pathway activities based on integrated multi-omics data, enhancing predictive performance and accuracy. Unsupervised learning methods, like clustering,16,18 principal component analysis (PCA),18 and tensor decomposition,18 discover latent features and patterns in multi-omics data without predefined labels.
Network-based approaches construct interaction networks from multi-omics data, identifying key regulatory nodes and pathways. A realistic picture of pathway activation can only be revealed by topological network-based methods that consider the biological reality of pathways by incorporating data on the type and direction of protein interactions.19 Not surprisingly, topology-based methods have outperformed their counterparts in benchmarking tests.19 Different researchers suggested a wide repertoire of algorithms and toolkits for quantitative pathway topology-based assessment of pathway activation levels (PALs), like Oncobox,20 topology analysis of pathway phenotype association (TAPPA),21 topology-based score (TBScore),22 pathway-express (PE),23 signaling pathway impact analysis (SPIA),24in silico pathway activation network decomposition analysis (iPANDA),25 Drug Efficiency Index (DEI),26,27etc. Such pathway activation level calculations utilize high-throughput gene expression or mutation profiles. Diverse methods, algorithms, and software for automated curation of pathway topology databases and uniformly shaped annotations of their content have also been developed.28,29
It may seem like it is sufficient to obtain the data from whole transcriptome sequencing (WTS, RNA-seq), as it allows evaluating the level of activation/inactivation of various pathways. However, it is known that ncRNAs, especially miRNAs, are able to regulate mRNA expression negatively through translational inhibition, and thus mRNA sequencing does not fully represent changes in the pathways. Different ncRNAs interfere with the gene expression process at different stages and with different affinities for distinct mRNAs. For example, small interfering RNAs (siRNAs) are RNA duplexes with typically 21–23 nucleotides that bind to a strictly specific mRNA molecule and prevent their movement from the nucleus to the cytoplasm; thus, mRNA is quickly cleaved in the nucleus as well as the cytoplasm.30,31 Although micro-RNAs (miRNAs) have almost the same length (19–25 nucleotides), they are not so gene-specific32 and bind to the target mRNA molecules in the cytoplasm, preventing translation and accelerating mRNA degradation by RNAases.30
In contrast to miRNAs, most antisense RNAs (asRNAs) are longer than 200 nucleotides, although shorter asRNAs also exist.33 Like siRNAs, asRNAs are gene-specific; like miRNAs, they bind to mRNA molecules in the cytoplasm and prevent translation. The influence of asRNAs on the abundance of mRNAs is controversial: although asRNA may stimulate mRNA cleavage, the complexing of asRNA with mRNA can protect mRNA from RNAase and inhibit its degradation.34–36 Also, asRNAs can bind to the DNA template strand, preventing the transcription machinery from producing mRNA. Therefore, asRNAs can affect the splicing of pre-mRNA, leading to different mRNA isoforms.
There are no examples for the incorporation of the results of DNA methylome, siRNAs, dsRNAs, asRNAs or miRNAs profiling into the analysis of dysregulated pathways. Integration of mRNA-seq data with siRNA-seq data may help better understand the transcription and translation events. In the current study (see Fig. 1), we report on the systemic multi-omics integration of protein-coding mRNA expression profiles, and non-coding RNA expression profiles, including micro-RNA and long non-coding RNA/anti-sense RNA (antisense lncRNA/asRNA) profiles, into the SPIA/DEI-based computational platform26,27 for pathway activation assessment and drug efficiency scoring.
The first term here is the p-value for the probability to obtain the observed or a greater number Nd of differentially expressed genes (between the pools of case and normal samples) randomly, assuming a hypergeometrical distribution of Nd. The second term is a summation over the perturbation factors (PF) for all genes g of the pathway K,
To obtain an estimator for pathway perturbation that is positive for an up-regulated pathway and negative for a down-regulated pathway, use the second term in the formula for the perturbation factor (PF) from the precious paragraph, resulting in the accuracy value,
| Acc(g) = PF(g) − ΔE(g). |
It can be shown that this accuracy vector may be expressed as follows:24
| Acc = B·(I − B)−1·ΔE, |
I is the identity matrix, and
The resulting score for pathway perturbation is calculated as follows:
.
672 uniformly processed human molecular pathways extracted from different source databases. It is the largest knowledge base of human pathways with annotated gene functions, i.e. ready for the pathway activation calculations. Superposition of the enclosed pathways formed an interactome graph of protein–protein interactions and metabolic reactions totaling 361
654 interactions and 64
095 molecular participants. All pathways were functionally classified according to their main underlying biological processes using the Gene Ontology (GO) tree. Each pathway node was algorithmically functionally annotated by a specific activation/repressor role index. This enables direct calculation of pathway activation levels (PALs, i.e. using the SPIA method) using human RNA/protein expression profiles.
Using the Drug Efficiency Index, DEI, software,26,27 the user can analyze custom expression data to evaluate SPIA scores in samples of interest against a built-in or custom set of controls and statistically evaluate differentially regulated pathways.
Considering the fact that small RNAs typically direct the methylation of specific loci, and that both non-coding RNA (ncRNA) and DNA methylation downregulate gene expression (Fig. 2), we suggested calculating the methylation-based and ncRNA-based SPIA values with the negative sign compared to standard, transcriptome/mRNA-based values, using the same pathway topology graphs: SPIAmethyl,ncRNA = −SPIAmRNA.
1. Calculate the pathway activation level (PAL) values for all molecular pathways (e.g. SPIA24).
2. Calculate the values of the pathway weight (wp) factor as follows. For pathways with a positive mean PAL score of the case samples, wp = ((number of case samples with a positive PAL score)/(total number of case samples)). For pathways with a negative mean PAL score of the case samples, wp = ((number of case samples with a negative PAL score)/(total number of case samples)).
3. Adjust the mean PAL score of each pathway by the weight factor,
| PALμ = mean(PAL)·wp. |
4. Perform the Student's t-test if the values of PALμ for the pool of case samples are different from 0 (for the pool of control samples, the values of PALμ are clearly equal to 0). During the Student's t-test, the following case classes are considered: (a) untreated case (U), e.g. the pathological state before drug application, should be far from the control (C); (b) treated case (T), e.g. the pathological state after drug application, should be close to the control.
The following output values result from such calculations:
(a) |tU| = absolute t-value for the Student's t-test for U-vs.-C profiles; (b) |tT| = absolute t-value for the Student's test for T-vs.-C profiles.
5. In addition to the first-generation DEI metric26 for individual drug activities,
, which is equal to 1 when tT = 0.
An alternative metric is called the mirrored DEI:27
The DEIM metric is equal to 1 when tT = −tU; this is the maximum possible value of this metric.
Similar to the previous DEI metric, DEIM = 0 when tT = tU, and DEIM = −1 when |tT| ≫ |tU|.
The third metric,27 balanced DEI,
is the mean value of the DEI and mirrored DEI. In our previous work,27 we validated the DEI, DEIM, and DEBB, methods, including their ability to distinguish clinically effective and ineffective treatments, the pathological and healthy samples, and treated and untreated patients.
641 distinct genes affected by the miRNAs.
The use of external normal references, or even synthetic controls, may require special cross-platform normalization methods, like those in our work,38 for a direct comparison. Although we made a lot of efforts in developing these normalization methods, they may, however, significantly impact the case-to normal log-fold-changes (LFC).38 This may introduce some unforeseen artifacts; therefore, we decided to utilize samples from the same cohort as a reference in the current study.
To make the first test for the methylation module, we used the data on the antiproliferative activity of the DNA hypomethylating agent 5-aza-2′-deoxycytidine (DAC)40 (see GEO dataset GSE198673). Li et al. (2023) tested whether DAC can inhibit the growth of clear cell renal cell carcinoma (ccRCC), both for the wild-type (WT) and knock-out (KO, SETD2−/−) variants, since the SETD2 (Su(var)3-9, Enhancer of Zeste, and Trithorax Domain Containing 2) gene is one of the major histone methyltransferases.40 Both WT and KO ccRCC cells were treated with 300 nM of DAC. Then the cell growth rate and DNA methylation profile were monitored for 40 days; DNA methylation and the expression of protein-coding mRNA were profiled on days 0, 5, 15, and 40 after the DAC treatment.
Additionally, we curated four other recently published cohorts of DNA methylation profiles, collected for myelodysplastic syndrome (MDS) (GSE119617), type 2 diabetes (GSE145746),41 multiple sclerosis (MS) (GSE151017),42 and chronic myelomonocytic leukemia (CMML) (GSE221269) – see Table 1.
| Paper reference | — | (Bansal et al., 2020)41 | (Bansal et al., 2020)42 | (Bansal et al., 2020)42 | (Ringh et al., 2021)42 | (Ringh et al., 2021)42 | (Ringh et al., 2021)42 | (Ringh et al., 2021)42 | — |
|---|---|---|---|---|---|---|---|---|---|
| GSE ID | GSE119617 | GSE145745 | GSE145745 | GSE145745 | GSE151017 | GSE151017 | GSE151017 | GSE151017 | GSE221269 |
| Disease | Myelo-dysplastic syndrome | Type 2 diabetes | Type 2 diabetes | Type 2 diabetes | Multiple sclerosis | Multiple sclerosis | Multiple sclerosis | Multiple sclerosis | Chronic myelo-monocytic leukemia |
| Drug | 5-Azacitidine | TGFB1 24 h | TGFB1 72 h | TGFB1 96 h | IFNb; relapse | IFNb; remission | Tysabr; remission | Other drugs; remission | Azathio-prine (AZA) |
| Untreated cases | 8 | 4 | 4 | 4 | 2 | 4 | 6 | 6 | 10 |
| Untreated controls | 5 | 4 | 4 | 4 | 44 | 44 | 44 | 44 | 5 |
| Treated cases | 8 | 4 | 2 | 2 | 4 | 10 | 6 | 6 | 10 |
| Treated controls | 5 | 4 | 2 | 2 | 44 | 44 | 44 | 44 | 5 |
| Methylation sites | 34 669 |
825 425 |
825 425 |
825 425 |
734 078 |
734 078 |
734 078 |
734 078 |
719 859 |
| Affected genes | 1488 | 23 039 |
23 039 |
23 039 |
23 314 |
23 314 |
23 314 |
23 314 |
22 390 |
| Paper reference | (Ma et al., 2015; Ma and Hu, 2023)43,44 | (Yu et al., 2021)45 | (Zhao et al., 2021)46 | (He et al., 2024)47 | (Liao et al., 2023)48 | (Wang et al., 2022)49 |
|---|---|---|---|---|---|---|
| GSE ID | GSE127905 | GSE164595 | GSE168404 | GSE194299 | GSE197671 | GSE205661 |
| Profiling platform | Illumina HiSeq X Ten | Illumina HiSeq 4000 | Illumina HiSeq 2500 | Illumina HiSeq 2000 | Illumina NovaSeq 6000 | Agilent-046064 miRNA microarray Agilent-052909 antisense lncRNA/mRNA mircoarray |
| Disease/sample type | Colon cancer/HCT116 cells with P14AS over-expression or/and AUF1 knockdown vs. controls | B cells treated and untreated with methylation inhibitors | Polycystic ovary syndrome/granulosa cells | Lung cancer | Heart failure/cardiomyocytes | Temporal lobe epilepsy with hippocampal sclerosis/brain tissue |
| mRNA profiles | Protein-coding mRNA, antisense lncRNA | Protein-coding mRNA, antisense lncRNA | Protein-coding mRNA, antisense lncRNA, miRNA | Protein-coding mRNA, antisense lncRNA | Protein-coding mRNA, antisense lncRNA | Protein-coding mRNA, miRNA |
| Disease cases | 4 | 8 | 5 | 3 | 8 | 6 |
| Controls | 4 | 12 | 5 | 3 | 8 | 9 |
| Number of gene targets of ncRNA | 1141 | 1079 | 5775 | 1381 | 1381 | 10 227 |
We then performed Gene Ontology (GO) enrichment analysis of antisense lncRNA molecular targets from these cohorts using the enrichGO software tool50 and Metascape online service.51
![]() | ||
Fig. 3 Methylated-vs.-unmethylated log 2-fold change (LFC) in DNA reads for WT and SETD2−/− KO ccRCC samples40 (GEO reference GSE198673). | ||
The DEI calculations require three types of profiles:
(a) Control samples (C), used as a reference for LFC computations for every gene expression.
(b) Untreated case samples (U) for the U-vs.-C comparison.
(c) Treated case samples (T) for the T-vs.-C comparison.
The study by Li et al. (2023) did not include any normal or healthy samples as a control.40 That is why we used the sample, which showed the slowest proliferation rate (SETD2−/−, exposed with 300 nM of DAC, five days after treatment), as a quasi-normal control reference (Table 3).
| Panel of Fig. 3 | Control samples (C) | Untreated case samples (U) | Treated case samples (T) |
|---|---|---|---|
| (A) | KO (SETD2−/−), treated with 300 nM of DAC, five days after treatment (the sample, which showed the slowest proliferation rate) | WT and KO, with no DAC 5, 15, and 40 days after DAC treatment of DAC-treated samples | WT and KO, with 300 nM of DAC 5, 15, and 40 days after DAC treatment |
| (B) | WT, with 300 nM of DAC 0, 5, 15, and 40 days after DAC addition | KO, with 300 nM of DAC 0, 5, 15, and 40 days after DAC addition |
We calculated the DEI values for the following combinations of untreated (U) and treated (T) case samples (Table 3). For the (A) experiment, the treatment procedure was DAC addition, and the T-vs.-U comparison implied juxtaposition of samples which had received DAC, and those which had not. For the (B) experiment, the treatment procedure was the knockout (KO) of SETD2, and the T-vs.-U comparison used juxtaposition of SETD2−/− (KO) and WT samples.
Based on our group comparisons, we demonstrated the beneficial role of both DAC (Fig. 4) and SETD2 knockout (Fig. 4(B)) for inhibition of cell proliferation. We observed this effect in terms of the drug efficiency index (DEI), as well as of the mirrored (DEIm) and balanced (DEIb) modifications of DEI.26
![]() | ||
| Fig. 4 Drug efficiency index (DEI), mirrored DEI (DEIm), and balanced DEI (DEIb) for two comparisons of ccRCC methylome profiles (Table 3, GSE198673). Panel (A): DAC-treated vs. untreated samples. Panel (B): SETD2−/− KO vs. WT samples. | ||
Other four DNA methylation case-vs.-control cohorts (Tables 4 and 5, totaling 82 case samples and 76 control samples) confirm the more adequate role of DEIm and DEIb values (compared to the old DEI metric) for the assessment of drug activity in such different diseases as MDS, type 2 diabetes, MS, and CMML. In particular, the DEIm and DEIb metrics were always positive for all these cohorts except in two cases: (1) relapsed MS and (2) type 2 diabetes at the longest time after drug administration (Table 5).
| Reference | — | (Bansal et al., 2020)41 | (Bansal et al., 2020)41 | (Bansal et al., 2020)41 | (Ringh et al., 2021)42 | (Ringh et al., 2021)42 | (Ringh et al., 2021)42 | (Ringh et al., 2021)42 | — |
|---|---|---|---|---|---|---|---|---|---|
| GSE ID | GSE119617 | GSE145745 | GSE145745 | GSE145745 | GSE151017 | GSE151017 | GSE151017 | GSE151017 | GSE221269 |
| Disease | Myelo-dysplastic syndrome | Type 2 diabetes | Type 2 diabetes | Type 2 diabetes | Multiple sclerosis | Multiple sclerosis | Multiple sclerosis | Multiple sclerosis | Chronic myelo-monocytic leukemia |
| Drug | 5-Azaci-tidine | TGFB1 24 h | TGFB1 72 h | TGFB1 96 h | IFNb; relapse | IFNb; remission | Tysabr; remission | Other drugs; remission | Azathio-prine (AZA) |
| tU | 2.35 | 3.05 | 3.05 | 3.05 | −0.57 | −2.19 | −2.85 | −2.85 | 2.91 |
| tT | −3.80 | 2.74 | −5.22 | 5.59 | −1.63 | 2.93 | −0.62 | 3.66 | −5.52 |
| DEI | −0.24 | 0.05 | −0.26 | −0.29 | −0.48 | −0.15 | 0.64 | −0.13 | −0.31 |
| DEIm | 0.53 | 0.03 | 0.48 | −0.17 | −0.32 | 0.71 | 0.24 | 0.75 | 0.38 |
| DEIb | 0.14 | 0.04 | 0.11 | −0.23 | −0.40 | 0.28 | 0.44 | 0.31 | 0.04 |
We analyzed the correlation between mRNA-based vs. miRNA-based, as well as between mRNA-based and antisense lncRNA-based values, at different levels of data aggregation (case-to-control LFC for each gene, and SPIA for pathways) in the six multi-omics cohorts listed in Table 2. For these cohorts, the correlation between the antisense lncRNA s-based and protein-coding mRNA-based values was generally higher than between the miRNA-based and protein-coding mRNA-based values (Fig. 4). Note that no false discovery rate (FDR) correction is required for p-values shown in Fig. 5. These correction methods, like the Benjamini–Hochberg one, provide more reliable marker sets for high-throughput profiles when multiple features, like distinct genes, are tested. However, in Fig. 5, we compare correlation coefficients between gene expression/pathway activation profiles, rather than the expression/pathway activation levels distinctly. Consequently, for single-value statistical tests, the BH correction is trivial: p_adj = p_raw, and no adjustment is needed. We added the corresponding explanation to the paper text, preventing the question about the BH adjustment from the readers.
![]() | ||
| Fig. 5 Spearman correlation between protein-coding mRNA values and ncRNA values (red – antisense lncRNA, blue – miRNA) for six GEO cohorts (see Table 2). (A) Gene expression levels; (B) case-to-control log-FC (LFC); (C) SPIA. The p-value is shown for two-sided Student's test between red and blue groups of correlation coefficients. | ||
Although it may seem counter-intuitive, the overall correlation between antisense lncRNA- and mRNA-based case-to-control LFCs was positive (Fig. 5(A)). Indeed, many authors found that antisense lncRNAs may increase the abundance of sequestered (and, therefore, inactivated) mRNA in the cytoplasm, not only in bacteria but also in mammals.34–36
Hence, we found that antisense lncRNA values correlated better with mRNA values than miRNA values correlated with mRNA values. Therefore, antisense lncRNAs may be more informative than miRNAs in the analysis of interference in signaling pathway activation caused by the non-coding transcriptome.
To reveal the gene expression modulating effect of antisense lncRNAs and miRNAs involved in the current study, we applied gene ontology enrichment analysis according to the enrichGO method to the targets of antisense lncRNAs and miRNAs in six multi-omics cohorts listed in Table 2. For the GO analysis, we embraced different sets of genes: (A) all antisense lncRNA; (B) those genes, which have high correlation (the top 25% quantile) in the expression level between the corresponding protein-coding mRNA and miRNA values; and (C) those genes, which have high correlation (the top 25% quantile) in the expression level between the corresponding protein-coding mRNA and antisense lncRNA values (Fig. 6). We showed that the overall GO terms for the options A, B, and C overlap significantly (Fig. 6(D)). Note also that the most explicitly manifested GO terms for all these options are related to the developmental processes.
![]() | ||
| Fig. 6 GO enrichment analysis of antisense lncRNA molecular targets combined from six multi-omics cohorts (Table 2). (A) All antisense lncRNA targets; (B) top quartile of positively correlated genes between protein-coding mRNA and miRNA; (C) top quartile of positively correlated genes between protein-coding mRNA and antisense lncRNA; (D) intersection of GO terms shown in panels A–C. | ||
Table S1 contains results for GO annotation of target genes for antisense lncRNA from these six multi-omics cohorts (Table 2), which we obtained using the Metascape online service for GO analysis.51 We provided (see Table S2, Fig. 7) the target gene statistics for antisense lncRNA and KEGG pathways that we curated in our pathway database. This analysis reveals the high enrichment levels for cancer-related pathways, which comprise 11 out of 20 top enriched signalling cascades (Fig. 7).
The integration of these diverse data types is crucial for understanding complex biological processes and identifying the molecular pathways involved in various diseases. By combining omics profiles, researchers can gain a comprehensive understanding of pathway activations and the complex molecular mechanisms underlying various diseases.7 This holistic approach enhances the accuracy and robustness of pathway activation assessments, providing critical insights for personalized medicine and therapeutic development.55,56
In the current work, we investigated the integration of multi-omics profiles into the calculation of pathway activation levels according to the SPIA method.24 Specifically, we used our multi-omics SPIA platform26,27 to integrate mRNA, ncRNA and antisense lncRNA data. All these additional (beyond the standard protein-coding mRNA) regulatory processes i.e., DNA methylome, miRNAs, and antisense lncRNAs, theoretically inhibit gene expression, that is why we calculated the SPIA values for these additional profiles with the sign opposite to SPIA values based on protein-coding mRNA data.
For the multi-omics-based SPIA assessments, we obtained the following results. First, similar to the post-traumatic stress disorder (PTSD) protein-coding mRNA-based data,26 the mirrored and balanced DEI values proved more adequate than the original DEI values.27 Second, the protein-coding mRNA-based values had better correlation with antisense lncRNA -based values than with miRNA-based values. Also, the correlation between the mRNA-based and antisense lncRNA -based values was mostly positive. This surprising effect may be caused by the antisense lncRNA-dependent sequestration of inactivated mRNA in the cytoplasm, which may artifactually inflate mRNA abundance estimates.34–36
In contrast, when correlation is attempted at the level of the whole genome, it is more difficult to demonstrate. Analysis of gene expressions and methylation pattern in horse sarcoids showed significant negative correlations between DNA methylation at the promoter regions and mRNA levels, with the R of ∼−0.23.61 Similar correlation in DNA methylation in the introns showed a much weaker negative correlation with gene expression (∼−0.1), while no significant correlation was found between DNA methylation in exons and gene expression. The authors used MethGET (Methylation and Gene Expression Teller) software.62 This software appeared to be superior to several other previously published tools for DNA methylation analysis such as COHCAP,63 PiiL,64 and ViewBS.65
Integration of DNA methylation and gene expression data is not a simple task and to date has been attempted with various degrees of success. Sajedi et al. (2023) developed the iNETgrate package that allows to integrate data from all genes, simultaneously building a comprehensive gene-level network. However, they do not rely on complex pathway topology graphs, preferring mostly statistical analysis such as correlations and principal components.66 They utilized data from five independent human cohorts (cancer- and Alzheimer-related datasets) to understand the contribution of epigenome to the survival outcomes. When they analyzed the modalities individually based either on gene expression or DNA methylation, they achieved the p-value of 10−4, while when they utilized both modalities (DNA methylation and gene expression), they were able to increase the significance to the p-value of 10−7. This work demonstrated the power multi-omics data integration for the prognostic prediction capabilities of the survival model.
Several other multi-omics data integrations were proposed. Zachariou et al. (2018) developed a “super network”, attempting integration of six different types of interactions to identify significant pathways related to a disease.67 Their method allows pathway analysis on top genes based on the quantity of shared information between gene pairs utilizing gene expression analysis. It was not demonstrated, however, whether this super network can integrate methylation data.
Ma et al. (2017) developed Edge-Based Module Detection Network (EMDN) for the analysis of differentially co-methylated and co-expressed networks.68 After constructing multiple networks, the standard modules within these networks are defined as epigenetic modules. The authors compared the EMDN performance with Consensus clustering (CSC),69 the multiple-modularity method (MolTi)70 and spectral clustering (SPEC)71 modules. EMDN outperformed other artificial networks, demonstrating higher accuracy.68 EMDN, as many other similar algorithms, relies on the establishment of differentially methylated or expressed genes, and thus requires paired comparison, such as normal versus disease samples or case versus control, or untreated versus treated. Our approach, unless we need to calculate the DEI values, as well as iNETgrate approach does not have this limitation.
Another fairly advanced model, INTEND (Integration of Transcriptomic and Epigenomic Data), which addresses the integration of disjointed methylation and gene expression data, was recently published.72 While INTEND integrates data from the same individual for multiple data sets, it does not use any information matching methylation and gene expression data to the same individual. Instead, INTEND learns a predictive model between the two by training on data sets having a large number of gene expression and methylome data sets from the same analyzed cohorts. At the first step, INTEND is trained to predict gene expression data based on methylation data located close to genes. Then, it compares predicted expression to the expression of the same set of genes stemming from transcriptome analysis. The authors evaluated INTEND performance on cancer datasets spanning 4329 patients by comparing it with four other integration methods: LIGER, Seurat v3, JLMA and MMD-MA, and demonstrated INTEND to be superior to all four.72 However, the INTEND utility relies on ML procedures, such as LASSO regression rather than following the signal propagation along multiple highly branched pathways and networks.
miARma-Seq integrates the results of interaction between mRNA and miRNA based on the information stored in miRGate database;77 it relies on established negative correlation. miRGate includes information on miRNA sequences from mRBase78 and 3-UTR sequences from EnsEMBL,79 as well as the information about experimentally validated targets stored in miRTarbase,80 Tarbase81 and OncomiDB.82
The authors profiled samples of colorectal cancer and identified 29 differently expressed miRNAs and 368 mRNA-encoding genes; they found that out of total possible 10
672 correlations, ∼60% were statistically significant, with many of them having a positive correlation, rather than the expected negative correlation.76
More direct integration of mRNA and miRNA data could be possible if all associations between miRNA and mRNA are known. Since miRNA can target hundreds of mRNAs, and many different miRNAs can target the same mRNA, and also, since many miRNAs positively correlate with the expression of some genes (perhaps by targeting the miRNAs that inhibit those miRNAs), direct estimates of the effects of miRNAs on mRNA expression are hard to calculate.
While substantial effort was made to integrate data from miRNA and mRNA sequencing, almost no effort was made to do the same for lncRNAs and mRNA sequencing. This could be due to the fact that there is no clear relationship between lncRNAs and mRNA expression similar to miRNA/mRNA pairs. However, we found that protein-coding mRNA-based values are better correlated with the long antisense non-coding RNA-based ones rather than with micro-RNA-based. This effect was revealed because of the use of our bioinformatics integrated platform, which allows multi-omics analysis in terms of DEI and SPIA values.
Supplementary information is available. See DOI: https://doi.org/10.1039/d5mo00151j.
| This journal is © The Royal Society of Chemistry 2025 |