Le
Ou-Yang
^{ab},
Hong
Yan
^{b} and
Xiao-Fei
Zhang
*^{c}
^{a}College of Information Engineering, Shenzhen University, Shenzhen, China
^{b}Department of Electronic and Engineering, City University of Hong Kong, Hong Kong, China
^{c}School of Mathematics and Statistics & Hubei Key Laboratory of Mathematical Sciences, Central China Normal University, Wuhan, China. E-mail: zhangxf@mail.ccnu.edu.cn
First published on 11th November 2016
Exploring how the structure of a gene regulatory network differs between two different disease states is fundamental for understanding the biological mechanisms behind disease development and progression. Recently, with rapid advances in microarray technologies, gene expression profiles of the same patients can be collected from multiple microarray platforms. However, previous differential network analysis methods were usually developed based on a single type of platform, which could not utilize the common information shared across different platforms. In this study, we introduce a multi-view differential network analysis model to infer the differential network between two different patient groups based on gene expression profiles collected from multiple platforms. Unlike previous differential network analysis models that need to analyze each platform separately, our model can draw support from multiple data platforms to jointly estimate the differential networks and produce more accurate and reliable results. Our simulation studies demonstrate that our method consistently outperforms other available differential network analysis methods. We also applied our method to identify network rewiring associated with platinum resistance using TCGA ovarian cancer samples. The experimental results demonstrate that the hub genes in our identified differential networks on the PI3K/AKT/mTOR pathway play an important role in drug resistance.
In recent years, the accumulation of gene expression profiles has provided rich functional information on gene regulatory network analysis. Gaussian graphical models, which assume the available gene expression data are generated from a multivariate Gaussian distribution, have been widely used for estimating gene regulatory networks from gene expression data.^{14,15} Based on Gaussian graphical models, two genes are conditionally independent given the other genes if and only if the corresponding entry of the inverse covariance matrix, or precision matrix, is zero. Therefore, we can infer gene regulatory networks by estimating the corresponding precision matrices.^{16} Gaussian graphical models have the advantage of inferring direct dependencies between genes. However, most existing methods estimate a gene regulatory network for a specific condition, and do not consider the network rewiring between different conditions or disease states. Differential network analysis, which aims to gain insights into altered regulatory mechanisms between two different conditions or disease states, has recently emerged as an important complement to differential expression analysis.^{17} Several computational approaches have been proposed for differential network analysis.^{9,13,18–21} Based on Gaussian graphical models, the differential network between two conditions can be formulated as the difference between the two corresponding precision matrices. Thus, a straightforward strategy is to separately estimate the precision matrix for each group and then calculate their difference.^{13,18} To exploit the similarity between gene networks corresponding to distinct but related conditions, Danaher et al. introduced a joint graphical lasso model to estimate multiple graphical models that share certain characteristics, such as the locations or weights of non-zero edges.^{22} However, most of the above methods need to infer the network for individual conditions and assume that all the gene networks are sparse. But real gene networks often contain genes that interact with many other genes, which may violate the sparsity condition. Furthermore, there are many interactions that are not strong in a single static condition, but the changes in these interactions are large.^{9} Direct application of these sparse regularized network inference methods may not be able to detect these changes.
The direct estimation of differential networks has emerged as an alternative approach for differential network analysis. Zhao et al.^{20} proposed a differential network estimation method to directly estimate the difference between precision matrices. Their method does not require separate estimation of each precision matrix and does not require each precision matrix to be sparse. However, this estimator is computationally expensive, especially for large scale data sets.^{9,20} Recently, Yuan et al.^{9} proposed a new convex function to directly estimate the precision matrix difference based on a D-trace loss function.^{23} Their model can be solved efficiently using the alternating direction method of multiplier.^{9}
Recently, advances in biotechnology have allowed biomedical researchers to collect a wide variety of gene expression data on a common set of patients from different platforms.^{24} For example, The Cancer Genome Atlas (TCGA) has provided gene expression data collected from multiple platforms (e.g., Agilent 244K Custom Gene Expression G450, Affymetrix HT Human Genome U133 Array Plate Set and Affymetrix Human Exon 1.0 ST Array), which enables the reconstruction of gene regulatory networks and differential networks between two patient groups from different data platforms.^{25,26} Gene expression data collected from different platforms may share some common characteristics; integrating multiple gene expression profiles could help to identify differential networks more accurately. Although some previous methods can be applied to infer gene regulatory networks based on gene expression profiles collected from multiple data platforms,^{22,25} few methods are designed for estimating differential networks from multiple data platforms. We need to develop new computational methods that can integrate information on multiple gene expression profiles to estimate the differential networks between two patient groups.
To address the above problems and make effective use of multi-platform gene expression data, in this study, we propose a novel multi-view differential network analysis model (MEDIA) to directly infer the differential network between two patient groups based on multiple gene expression data sets collected from different data platforms. Instead of assuming that each individual group-specific network is sparse, we only assume that the differential network between two patient groups is sparse. This assumption is reasonable since for different disease states or patient groups, the corresponding gene regulatory networks are very similar to each other.^{9,22} Based on the D-trace loss function,^{9,23} and the group lasso penalty,^{27} our model can exploit the characteristics shared by gene expression data collected from different types of platforms. We utilize the alternating direction method of multiplier (ADMM) to solve the optimization problem. To illustrate the effectiveness of MEDIA, we first perform simulation studies, and then apply MEDIA to TCGA ovarian cancer samples to identify network rewiring associated with platinum resistance. Through the simulation studies and real data analysis, MEDIA shows a superior performance compared to other state-of-the-art differential network inference methods, which demonstrates the effectiveness and power of our model. Moreover, we find that the hub genes in our identified differential networks on the PI3K/AKT/mTOR pathway play an important role in drug resistance.
Following ref. 9, we use a symmetric matrix Δ_{k} to denote the precision matrix difference (Σ^{k}_{Y})^{−1} − (Σ^{k}_{X})^{−1} and introduce a D-trace loss function L_{D}(Δ_{k},Σ^{k}_{X},Σ^{k}_{Y}), as follows:
L_{D}(Δ_{k},Σ^{k}_{X},Σ^{k}_{Y}) = ½〈Δ_{k}^{2},(Σ^{k}_{X}Σ^{k}_{Y} + Σ^{k}_{Y}Σ^{k}_{X})/2〉 − 〈Δ_{k},Σ^{k}_{X} − Σ^{k}_{Y}〉. | (1) |
Moreover, based on the assumption that the differential network between two patient groups is sparse, we need to introduce a lasso-type penalty on Δ_{k} for k = 1,…,K to encourage sparsity in each Δ_{k}. If we just apply l_{1} norm penalty on each Δ_{k} (i.e., ∥Δ_{k}∥_{1}) in addition to the above D-trace loss function, it is equivalent to applying the lasso penalized D-trace loss method on each platform separately. However, the inference of the individual differential network Δ_{k} does not well utilize the cross-platform information, which is crucial to produce accurate and reliable results. Although different data platforms might reflect the dependencies between genes in different scales, the differential networks inferred from K different data platforms should have the same sparse structure with different edge weights. Therefore, to effectively fuse gene expression data collected from multiple platforms, we might wish to jointly estimate a collection of differential networks Δ_{1},Δ_{2},…,Δ_{K}, where each Δ_{k} corresponds to the k-th platform. Here, to make effective use of the cross-platform information, the differential networks may be enforced to be sparsity-consistent. Based on this assumption, we formulated the following multi-view differential network analysis model (MEDIA). For the sake of convenience, we denote {^{k}_{X}}^{K}_{k=1}, {^{k}_{Y}}^{K}_{k=1} and {Δ_{k}}^{K}_{k=1} as {_{X}}, {_{Y}} and {Δ} respectively.
(2) |
(3) |
(4) |
(5) |
(6) |
(7) |
(8) |
(9) |
(10) |
(11) |
The D-trace loss estimate method is designed to directly estimate the difference between two precision matrices, but this method can only analyze each data platform separately. The comparison between the MEDIA and D-trace loss estimate method will show the gain of performance when MEDIA utilizes group sparse penalty to borrow strength across multiple data platforms. GGL is designed to jointly estimate multiple precision matrices that share certain characteristics, while MEDIA directly estimates multiple differential networks from multiple data platforms. Comparing MEDIA with GGL would show the benefit of directly inferring multiple differential networks without estimating the individual precision matrices. We do not compare MEDIA with fused graphical lasso^{22} and the method of Zhao et al.^{20} as a previous study has shown that the D-trace loss estimate method outperforms these two approaches.^{9} In the following experiments, when applying the GGL, precision matrices corresponding to multiple platforms are fitted for each group separately, and then we estimate that the differential network corresponding to each platform by calculating the difference between the inferred group-specific precision matrices. When applying the D-trace loss estimate method, the differential network is inferred for each platform separately. For MEDIA, we can directly infer multiple differential networks from multiple data platforms simultaneously. For GGL, we use the JGL function with ‘penalty = group’ from the R package JGL. To ease interpretation, similar to ref. 22, we reparameterize the tuning parameters for GGL, where and .
Following previous studies,^{9} we use several metrics to evaluate the performance of various algorithms. If ^{k}_{ij} is the (i,j)th entry of an estimator and δ^{k}_{ij} is the (i,j)th entry of the true Δ_{k}, for k = 1,…,K, the true positive rate (TPR), false positive rate (FPR), precision (Pre) and recall (Rec) are defined as
2. For k = 1,…,K, we repeat steps 3–5 to generate data sets for each data platform.
3. We generate a p × p symmetric matrix S^{k}_{X} to denote the adjacency matrix of the network corresponding to the first group. In particular, (S^{k}_{X})_{ij} = 0 if there is no edge between feature i and j, and a uniform distribution with support [−1, −0.5] ∪ [0.5,1] is used to generate the nonzero entry of S^{k}_{X}. We duplicate S^{k}_{X} into S^{k}_{Y}. Then we set the entries of S^{k}_{Y} that correspond to differential edges to be zeros or change their signs. Therefore, 10% of the entries in S^{k}_{X} and S^{k}_{Y} have different values.
4. We let d = min(λ_{min}(S^{k}_{X}),λ_{min}(S^{k}_{Y})), where λ_{min}(·) denotes the smallest eigenvalue of the matrix. To ensure positive definiteness, we set (Σ^{k}_{X})^{−1} = S^{k}_{X} + (0.1 + |d|)I and (Σ^{k}_{Y})^{−1} = S^{k}_{Y} + (0.1 + |d|)I, where I is a p × p identity matrix.
5. We generate n_{x} = n_{y} = n independent subjects for each group, i.e., x^{k}_{1},…,x^{k}_{n} ∼ N(0,Σ^{k}_{X}) and y^{k}_{1},…,x^{k}_{n} ∼ N(0,Σ^{k}_{Y}).
Following the criterion used in ref. 33, we can define platinum-based chemotherapy response groups. In particular, patients are defined as platinum sensitive if there is no evidence of disease progression within 6 months from the end of the last primary treatment, and the follow-up interval is at least 6 months from the date of last primary treatment. Patients with evidence of disease progression within 6 months from the end of primary treatment are defined as platinum resistant. Among the 514 patients, 340 patients have explicit platinum status, and 242 are platinum sensitive and 98 are platinum resistant. For each platform, the gene expression data within each sub-group of patients are standardized to have a mean of zero and a standard deviation of one.
In this study, we take a pathway-based analysis. We focus our attention on a specific pathway, namely the PI3K/AKT/mTOR pathway, which is frequently mutated or altered in ovarian cancer^{26} and is frequently implicated in resistance to anticancer therapies,^{34} to investigate the change in gene interactions between platinum sensitive and platinum resistant ovarian cancer patients. We downloaded the PI3K/AKT signaling pathway and the mTOR signaling pathway from the Kyoto Encyclopedia of Genes and Genomes database.^{35} Out of the 362 genes in the PI3K/AKT/mTOR pathway, 301 genes are included in our considered gene expression data sets. The standardized gene expression data for each sample can be downloaded from http://https://github.com/Zhangxf-ccnu/MEDIA.
Within biological networks, hub genes always play important roles in biological processes. Thus, we are interested in the biological significance of hub genes in our identified differential networks. The list of hub genes that have degrees greater than 5 in the differential networks is presented in Table 1. As the estimated differential networks describe the change of conditional dependencies among 301 genes between the platinum sensitive patient group and the platinum resistant patient group, we first analyze whether the hub genes in the identified differential networks are associated with drug resistance and cancer development. We collected 161 genes related to cisplatin resistance and 758 genes related to drug resistance from the database of Genomic Elements Associated with drug Resistance (GEAR), and collected 572 genes for which mutations have been causally implicated in cancer from the Cancer Gene Census (CGC) database.^{36} Among the 301 genes in the PI3K/AKT signaling pathway, there are 26 genes, 74 genes and 60 genes associated with cisplatin resistance, drug resistance and cancer development respectively. We observe that out of the 36 hub genes in our identified differential networks, 7 of them are related to cisplatin resistance, 14 of them are related to drug resistance and 12 of them are related to cancer (the corresponding results are shown in Table 1). Based on Fisher's exact test, we find that our identified hub genes are significantly enriched with all three types of cancer-related genes (the p-values are 0.0234, 0.0401 and 0.0439 respectively).
Evidence | Genes (degree of connectivity) |
---|---|
Here “GEAR_{cisplatin}” denotes the hubs detected by MEDIA that are associated with resistance to cisplatin, “GEAR_{drug}” denotes the hubs detected by MEDIA that are associated with resistance to drugs and “CGC” denotes the hubs detected by MEDIA that are causally implicated in cancer according to the Cancer Gene Census (CGC) database. | |
Hub genes detected by MEDIA | CDKN1A (9), TP53 (9), IRS1 (8), KIT (8), MAP2K1 (8), MYC (8), PRLR (8), RPS6KB1 (8), THBS4 (8), CCND2 (7), CCNE1 (7), COL4A6 (7), FGF13 (7), GNG12 (7), ITGA9 (7), LAMC3 (7), MET (7), PDGFA (7), PPP2R2A (7), BCL2L1 (6), CCND3 (6), COL4A2 (6), FGF20 (6), FGFR3 (6), GNB1 (6), IL6 (6), IL6R (6), ITGA6 (6), KITLG (6), PDPK1 (6), PPP2R5B (6), PTEN (6), RAC1 (6), RXRA (6), TSC2 (6), VEGFA (6) |
GEAR_{cisplatin} | CDKN1A (9), TP53 (9), MYC (8), BCL2L1 (6), CCND3 (6), IL6 (6), PTEN (6) |
GEAR_{drug} | CDKN1A (9), TP53 (9), KIT (8), MYC (8), RPS6KB1 (8), CCND2 (7), FGF13 (7), BCL2L1 (6), CCND3 (6), FGFR3 (6), IL6 (6), PTEN (6), RAC1 (6), VEGFA (6) |
CGC | TP53 (9), KIT (8), MAP2K1 (8), MYC (8), CCND2 (7), CCNE1 (7), MET (7), CCND3 (6), FGFR3 (6), PTEN (6), RAC1 (6), TSC2 (6) |
To give an example of the differential networks inferred by MEDIA on the PI3K/AKT/mTOR pathway, we show the top three hub genes (CDKN1A, TP53, IRS1) and their neighbors in Fig. 3. Note that if these three hub genes connect with other hub genes, we also show the neighbors of those hub genes. Therefore, there are six hub genes shown in Fig. 3, where CCNE1, FGF13 and COL4A6 are hub genes that connect with CDKN1A, TP53 and IRS1 respectively. Among the six hub genes shown in Fig. 3, CDKN1A and TP53 are known to be associated with cisplatin drug resistance, while FGF13 is known to be associated with drug resistance (Table 1). The other three hub genes (i.e., IRS1, CCNE1 and COL4A6) may be potential genes that are related to platinum resistance. IRS1 plays a key role in transmitting signals from the IGF1 receptors to the PI3K-Akt signalling pathway and the Erk MAP kinase pathway.^{37} As a signalling adapter protein, IRS1 can integrate different signalling cascades and has a functional role in cancer progression and platinum resistance.^{38} Ravikumar et al. have reported that IRS1 is an important mediator of ovarian cancer cell growth suppression and could be a potential effective target for chemotherapeutic intervention.^{31} Amplification of CCNE1 (cyclin E1) is the most prominent structural variant associated with primary treatment failure in high-grade serous ovarian cancer.^{39} Etemadmoghadam et al. have revealed that amplification of CCNE1 is a preadaptation that confers primary platinum resistance.^{40} We find that the associations between BCL2 and other two hub genes (FGF13 and CCNE1) undergo changes between the two patient groups (Fig. 3). Researchers have found that overexpression of BCL2 plays a role in the development of drug resistance.^{41,42} Eliopoulos et al. have revealed that BCL2 is frequently expressed in fresh biopsies of primary ovarian carcinoma and that it is an important determinant of drug-induced apoptosis thereby modulating resistance to chemotherapy.^{43} Okada et al. have reported that FGF13 plays a pivotal role in platinum resistance and in reducing intracellular platinum concentrations in cancer cells.^{44} Thus, the change in associations between BCL2 and two hub genes (CCNE1, FGF13) may be related to drug resistance. COL4A6 is involved in ECM-receptor interaction pathways. ECM can affect drug resistance by preventing drug penetration into the cancer tissue and the expression of COL4A6 is downregulated in drug-resistant cell lines.^{45} Among the seven neighbors of COL4A6 in the differential network, five of them (CCND3, FGF13, SPP1, IRS1 and CCNE1) are associated with ovarian cancer and drug resistance (e.g., CCND3 and FGF13 have been reported to be associated with drug resistance,^{44,46} SPP1 has been reported to be associated with ovarian cancer and cancer metastasis,^{47} and both IRS1 and CCNE1 are closely related to ovarian cancer and drug resistance^{38,40}). Therefore, COL4A6 and its neighbors in the differential network may potentially be used as a molecular target to improve the treatment of platinum-resistant tumors.
Besides these hub genes in the differential network, genes whose associations with hub genes result in changes between two patient groups could also be possible therapeutic targets in ovarian serous carcinoma. For example, FGFR1, a neighbor of CDKN1A in the differential network, has been found to associate with resistance to endocrine therapy in ER+/FGFR1-amplified breast cancer.^{48} CCND1, another neighbor of CDKN1A in the differential network, has been previously reported in ovarian cancer for its role in cisplatin resistance.^{49} Therefore, it is of interest to study how the associations between hub genes and their neighbors in the inferred differential networks correlate with platinum response in ovarian cancer.
We compare MEDIA with D-trace loss and GGL on the ovarian cancer data. For GGL, we run it separately for each patient group and each time it is applied across all the three platforms, then we estimate the differential network corresponding to each platform by calculating the difference between the inferred group-specific precision matrices. For D-trace loss, we run it separately for each platform and each time it is applied across the two patient groups. As suggested by Yuan et al.,^{9} the tuning parameter of D-trace loss is tuned by ESCV. Following Danaher et al.,^{22} the tuning parameters of GGL are tuned using the Akaike information criterion (AIC).
A common challenge in evaluating the differential network inference method comprehensively using real data is the lack of a gold standard. As we cannot obtain the true gene networks in the platinum-sensitive and platinum-resistant tumors, it is difficult to compare different methods according to the accuracy of the identified differential networks. In this study, we adopt an alternative strategy to evaluate the performance of various differential network analysis methods. In particular, for each method, we count the number of hub nodes in its reconstructed differential networks that match with known drug resistance-related or cancer-related genes. Therefore, a method that could well capture known functionally important genes may have a better performance in differential network identification. For D-trace loss and GGL, we consider the first 36 genes that have the highest degrees of connectivity as hub genes. From Table 2, we find that by considering the information shared by different data platforms and estimating the differential network between two patient groups directly, MEDIA can detect more cisplatin resistance-related, drug resistance-related and cancer-related genes than D-trace loss and GGL.
Methods | GEAR_{cisplatin} | GEAR_{drug} | CGC |
---|---|---|---|
MEDIA | 7 | 14 | 12 |
D-trace loss | 6 | 11 | 7 |
GGL | 2 | 6 | 8 |
Besides graphical models, correlation-based approaches have also been developed to identify gene regulatory networks and differential networks. Although these approaches have addressed some biological problems, each group-specific network is inferred by computing the pairwise correlation (e.g., Pearson correlation coefficient or mutual information) between each pair of genes, which includes both direct and indirect dependencies.^{50} To address this problem, conditional mutual information (CMI) has been proposed to infer the nonlinear direct dependencies among genes.^{51} Recently, Zhang et al.^{52} and Zhao et al.^{53} proposed new concepts with new measures (conditional mutual inclusive information and part mutual information) of causal strength to overcome the underestimation problem of CMI. Compared with graphical models, correlation-based approaches are much easier to understand. To infer differential networks, a typical method uses some correlation-based approaches to compute the group-specific gene regulatory networks and estimates the differential networks by edge-wise subtraction of the group-specific gene networks.^{19,54,55} However, estimating each group-specific gene regulatory network separately will totally ignore the similarity between the two group-specific networks. On the other hand, based on graphical models, the differential network estimation problem can be formulated as a principle statistical learning problem. The similarity between the two groups and the consistency across multiple data platforms can be exploited by imposing a group sparsity penalty on the difference between group-specific networks. Moreover, other structures of interest can be easily imposed on the resulting differential networks by utilizing suitable penalty functions. Therefore, in this study, we focus on developing a graphical model that can directly estimate the differential network between two groups of samples and borrow strength across multiple data platforms to achieve more reliable results. As our model is based on the D-trace loss function and group lasso penalty, the comparison with D-trace loss and GGL is a compelling illustration of how graphical models improve their performance by imposing reasonable penalty functions.
Recently, Liu et al.^{21} developed a statistical method to construct sample-specific networks which reflect the variation between each single sample and the reference network. Personalized characterization of diseases is the key to achieving personalized medicine. However, similar to traditional differential network estimation methods and graphical models, our model can only infer an aggregated network for multiple samples. In the future, we will investigate how to utilize data collected from multiple platforms to identify sample-specific differential networks.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c6mb00619a |
This journal is © The Royal Society of Chemistry 2017 |