Mingyi
Wang
a,
Vagner
Augusto Benedito
ab,
Patrick
Xuechun Zhao
a and
Michael
Udvardi
*a
aPlant Biology Division, The Samuel Roberts Noble Foundation, Inc., 2510 Sam Noble Parkway, Ardmore, OK 73401, USA. E-mail: mudvardi@noble.org; Fax: +1 580 224 6692; Tel: +1 580 224 6655
bGenetics & Developmental Biology Program, Plant & Soil Sciences Division, West Virginia University, 2090 Agricultural Sciences Building, Morgantown, WV 26506, USA
First published on 19th February 2010
Recently, simplified graphical modeling approaches based on low-order conditional (in-)dependence calculations have received attention because of their potential to model gene regulatory networks. Such methods are able to reconstruct large-scale gene networks with a small number of experimental measurements, at minimal computational cost. However, unlike Bayesian networks, current low-order graphical models provide no means to distinguish between cause and effect in gene regulatory relationships. To address this problem, we developed a low-order constraint-based algorithm for gene regulatory network inference. The method is capable of inferring causal directions using limited-order conditional independence tests and provides a computationally-feasible way to analyze high-dimensional datasets while maintaining high reliability. To assess the performance of our algorithm, we compared it to several existing graphical models: relevance networks; graphical Gaussian models; ARACNE; Bayesian networks; and the classical constraint-based algorithm, using realistic synthetic datasets. Furthermore, we applied our algorithm to real microarray data from Escherichia coli Affymetrix arrays and validated the results by comparison to known regulatory interactions collected in RegulonDB. The algorithm was found to be both effective and efficient at reconstructing gene regulatory networks from microarray data.
To circumvent these problems, simplified graphical models, such as RNs, GGMs and low-order conditional (in-)dependence models, have received attention for practical use.12–14 RNs (or co-expression networks) are the simplest approach, and are constructed by computing a similarity score for each pair of genes, e.g., the correlation or mutual information between expression profiles. If similarity is above a certain threshold, the pair of genes is connected in the graph, if not, it remains unconnected. RNs are relatively easy to calculate and reasonably accurate, even when the number of genes is much larger than the number of samples. The results from RNs agree well with functional similarity,15 and many co-expression relationships are conserved over evolution supporting the conclusion that they represent biologically-meaningful networks.2 However, RNs contain only limited information about the underlying biological mechanisms since the effect of other genes on the relationship between two genes is ignored. For example, from similarity of expression profiles alone, we cannot distinguish between direct and indirect relationships. In contrast, GGMs can identify a direct correlation between two genes after accounting for the impact of all other genes in the model. In this mode, each gene pair is tested for conditional independence (CI) given the data from all other genes. From these tests, one can tell if the correlation between two genes is direct or mediated through other genes. A problem with GGMs is that full conditional models are hard to estimate if the number of samples is small compared to the number of genes.3,16 Low-order conditional dependence models represent a compromise between RNs and GGMs and are capable of identifying direct and indirect correlations between any two genes after correcting for the influence of a third gene only. Thus, in contrast to GGMs, low-order conditional independence models do not consider the effects of all other genes on the correlation between any two genes. This facilitates the study of dependence patterns in a more complex and exhaustive way than with only pair-wise correlation-based relationships (i.e. RNs), while maintaining high accuracy even from few observations. In this approach, modeling is limited to 0-1 order conditional independencies (thus also called 0-1 graphs5). This simplification avoids the need to carry out statistically unreliable and computationally costly searches for conditional independence in large subsets.
In contrast to BN models, the output from most simplified graphical models contains undirected edges between nodes/variables and provides no means to distinguish between response variables and covariates and, thus, between cause and effect. This makes it difficult for biologists to discern regulatory relationships between genes. To redress this difficulty, we revisited basic concepts used in constraint-based algorithms,17 an important offshoot of BN learning methods, in which dependencies and conditional dependencies are tested in the data and directed graphs are built accordingly. The PC-algorithm (after its authors Peter and Clark) proposed in ref. 17 is a well-known example. However, for the PC-algorithm, in the worst case, all possible combinations of the conditioning set need to be examined which would require an exponential number of tests. Consequently, it is hard to apply the PC-algorithm to large gene expression datasets. Therefore, we developed an algorithm that estimates causal relationships based on a low-order constraint-based approach, in which low-order CI tests rather than full-order CI tests are required. The algorithm has high computational efficiency but still finds most causal relationships.
Let V denote a non-empty finite set of random variables. A Bayesian network (BN) for V is defined by a pair 〈G,Θ〉. The structural model is a directed acyclic graph (DAG) G = (V,E), in which nodes represent variables in V (in BN, variable and node can then be used interchangeably) and the set of edges E is all edges between nodes in V. We use the notation X → Y if and only if there is a directed edge between two nodes X and Y, and X–Y if and only if there is an undirected edge between X and Y. The parents of a node X (written Parents(G,X)) is the set of nodes that have directed edges to X. The adjacency set of a node X in graph G, denoted by Adjacencies(G,X), are all nodes that are directly connected to X by an edge. The elements of Adjacencies(G,X) are also called neighbors of X or adjacent to X. We call the set of edges connecting the k nodes a path from X1 to Xk. Y is called a descendant of X, and X is called an ancestor of Y is there is a path from X to Y, and Y is called a non-descendant of X if Y is not a descendant of X. For each node there is a probability distribution at that node given the state of its parents in G, denoted by P(X|Parents(G,X)). Θ are parameters specifying all these probabilities. BNs follow the Markov condition, stating that given its parents each variable is independent of its non-descendants. Under the Markov assumption, each BN specifies a decomposition of the joint distribution over all distributions of the nodes, in a unique way: P(V) = ΠX∈VP(X|Parents(G,X)).
It is necessary to give a brief description of the conditional independence (CI) relation. X and Y are said to be conditionally independent given S (where X ∈ V, Y ∈ V and S ⊆ V\{X,Y}) if P(S) ≠ 0 and one of the following holds: (1) P(X|Y,S) = P(X|S) and P(X|S) ≠ 0, P(Y|S) ≠ 0; (2) P(X|S) = 0 or P(Y|S) = 0. This CI relation is denoted by I(X,Y|S). A CI relation is characterized by its order, which is simply the number of variables in the conditioning set S.
A criterion called d-separation captures exactly the CI relationships that are implied by the Markov condition. We say X and Y are d-separated by a node set S ⊆ V\{X,Y} in G if every path between X and Y is blocked by S. A path between X and Y is blocked by S if one of the following holds: (1) W ∈ S and W does not have converging arrows along the path between X and Y, or (2) W has converging arrows along the path and neither W nor any of its descendants are in S. Here, we say a node W has converging arrows along a path if two edges on the path point to W. A probability distribution Θ on V is said to be faithful with respect to a graph G if conditional independencies of the distribution can be inferred from so-called d-separation in the graph G and vice-versa. More precisely, faithfulness of Θ with respect to G means: for any X,Y ∈ V with X ≠ Y and any set S ⊆ V\{X,Y}, X and Y are conditionally independent given S if and only if node X and node Y are d-separated by the set S.
The nodes W, X and Y form a v-structure in a DAG G when X → W ← Y is the subgraph of G induced by W, X and Y. Two DAGs are equivalent if and when they represent the same d-separation statements. The equivalence class of a DAG G is the set of DAGs that are equivalent to G. Even given an infinite number of observations, we cannot distinguish among the different DAGs of an equivalence class. Using published results,19 we can characterize equivalent classes more precisely: two DAGs are equivalent if, and only if, they have the same skeleton and the same v-structures. The skeleton of any DAG is the undirected graph resulting from ignoring the directionality of every edge. A common tool for visualizing equivalence classes of DAGs is a partial directed acyclic graph (PDAG), which is a graph that contains both directed and undirected edges. There may be more than one PDAG that correspond to the same equivalence class because extra undirected edges can be oriented sometimes. Thus, completed PDAG (CPDAG) is proposed to represent an equivalence class uniquely.20 The CPDAG corresponding to an equivalence class is the PDAG consisting of a direct edge for every compelled edge in the equivalence class, and an undirected edge for every reversible edge in the equivalence class. A directed edge X → Y is compelled in G if for every DAG G′ equivalent to G, X → Y exists in G′. CPDAGs are also called maximally oriented graphs.21 Several orientation rules21 can be used to generate a CPDAG. The connections (edges) in a BN can be used to interpret causal relationships between nodes.8
In the first phase, G is initiated as a fully connected undirected graph. For example, suppose we have a simple DAG with only five nodes (Fig. 1a), a complete graph is produced (Fig. 1b) in the first step of our algorithm. Then, iterative CI tests are performed for each connected node pair given a node subset S taken from neighbor nodes of the connected node pairs. Under the DAG faithful assumption, correlations or non-correlations, direct or indirect correlations between node pairs can be distinguished by CI tests. In this procedure, we used depD(X,Y|S) as a measure of the strength of the conditional dependence between X and Y given S with respect to D. In order to decide if I(X,Y|S) is true or not, depD(X,Y|S) runs a partial correlation coefficient calculation when D is continuous and then uses ε as the significance level. In our algorithm, the partial correlation coefficient calculation follows the method previously used17,22 and is described in Appendix C.
![]() | ||
Fig. 1 Given relations in the original DAG (a), the LPC-algorithm starts with a complete graph (b), then creates undirected graphs after 0-order CI tests (c), 1st-order CI tests (d), and 2nd-order CI tests (e) in the first phase, and infers directions (f)–(h) using the orientation rules in the second phase. |
For a given connected node pair X, Y in Fig. 1b, the conditioning set is taken from neighbors of X or Y, i.e., any subset S (S ⊆ Adjacencies (G,X)\{Y}). The size of S (|S|) is the order of CI tests and is controlled by l in our algorithm. Thus, in this case, 0-order CI test depD(T,Y|Ø) is performed when l = 0. The 1st-order CI tests depD (T,Y|X), depD (T,Y|W) and depD (T,Y|Z) are performed when l = 1. For 2nd-order CI tests (l = 2), depD (T,Y|W,Z), depD (T,Y|W,X) and depD (T,Y|X,Z) are performed. All the connected node pairs in G are checked given l (lines 3–12 of Table 2). The order of CI tests is iteratively increased from 0 to the maximal order k. In this procedure, if a connected node pair T and Y are conditionally independent given S, then the edge between T and Y will be removed and S will be recorded into Sepset, which is used to store the conditioning set. From our example in Fig. 1, suppose we set k = 2, from d-separations given in Fig. 1a, X–W can be broken from 0-order tests because I(X,W| Ø). No link can be broken from 1st-order CI tests and thus Fig. 1d is unchanged. {X,Z} blocks each path between T and W and each path between T and Y, that is, I(T,W|{X,Z}) and I(T,Y|{X,Z}). Thus, T–W and T–Y are removed after 2nd-order CI tests (Fig. 1e), and the conditioning sets are saved (i.e., Sepset(T,W) = {X,Z} and Sepset(T,Y) = {X,Z}). This procedure is repeated until the order of CI tests l is increased to k or no more conditioning sets can be found (lines 2–14 of Table 2). The highest order of CI tests is limited by k specified by the user, thus avoiding an exponential increase in the number of CI tests with the number of neighbors. After the first phase, the skeleton of a PDAG (Fig. 1e in our example) is generated and the weights of all edges are assigned the minimal conditional dependence values for each node pair calculated from all the CI tests.
In the second phase, orientation rules are applied to orient the graph skeleton. First, the v-structures are determined (lines 15–19 of Table 2) for triple nodes X, Y and Z, if X and Y, and Y and Z are connected while X and Z are not connected and Y ∉ Sepset(X,Z), then we can infer directionality X → Y ← Z (any one of the three alternatives X → Y → Z, X ← Y ← Z and X ← Y → Z will lead to I(X,Z|Y) and Y ∈ Sepset(X,Z), and thus cause a contradiction). In our example, we have I(X,W| Ø), Z ∉ Ø and min(|Adjacencies(G,X)\{Z}|, Adjacencies(G,Z)\{X}|) = 2 and min(|Adjacencies(G,W)\{Z}|, Adjacencies(G,Z)\{W}|) = 1, thus X → Z and W → Z can be inferred according to the v-structure rule in our algorithm. Similarly, X → Y and W → Y can be inferred. The other four orientation rules are given in lines 21–24 of the LPC-algorithm (Table 2). Next, the orientation rules (R1–R4) are repeatedly used (lines 20–25 of Table 2) to determine the directions of undirected edges until no more edges can be oriented. The basic idea is to make sure that all other undirected edges also can be oriented based on the DAG assumption. The details can be found in Appendix A (lines 21–24 of Table 2). According to the orientation rule 1 (line 21 of Table 2), Z → T can be determined (Fig. 1g) because W → Z and Z–T while W and T are not connected and |Adjacencies(G,T)\{Z}| = 1 (≤k). According to the orientation rule 2 (line 22 of Table 2), by satisfying X → Z and Z → T and min(|Adjacencies(G,X)\{T}|, Adjacencies(G,T)\{X}|) = 1(≤k), we can infer X → T. No other rules can be applied in this case. Fig. 1h is the final structure inferred from the LPC-algorithm.
In the second phase, the key point of the LPC-algorithm is that the neighbor number for linked node pairs is checked before applying each orientation rule, i.e., min(|Adjacencies(G,X)\{Y}|, Adjacencies(G,Y)\{X}|) ≤ k given the node pair between X and Y must be satisfied. This is a non-trivial step that must be completed before applying the orientation rules in the second phase. The goal of this step is to ensure that the orientation rules in the second phase are still correct when only 0–k low-order CI tests are performed in the first phase. Theoretical proofs of soundness are provided in Appendix B. Thus, we can extend the use of the classical constraint-based algorithm to datasets with large variable numbers because only polynomial runtime is used. The number of CI tests in the first phase is bounded by O((n4 − 4n3 + 7n2 − 4)/4) in the worst case when k = 2 and time efficiency is maintained. This is particularly important for microarray data when the number of genes is much higher than the number of samples because conditional dependencies are generally hard to estimate with only limited samples and high orders.17,22
Because only low-order CI tests are performed, some false edges may be retained in the skeleton and some directions are missed. This is the price that the LPC-algorithm pays to reduce runtime. Note that if k is increased to n − 2, the LPC-algorithm is equivalent to the classical PC-algorithm. Thus, the LPC-algorithm can be viewed as a special case of the PC-algorithm.
For fairness of directed graph comparisons under BN frameworks, we used DAGs as the network structures. It will not affect the conclusions for the evaluation of undirected graphs if other structures are chosen (such as scale-free networks—see results in the ESI†). We generated 10 DAGs G with 100 nodes, which have different topology structures but have a fixed edge number of 500. The maximal degree of a node in these DAGs is 10. For each G, seven different datasets with different sample sizes (10, 20, 50, 100, 200, 500 and 1000) were generated separately using the above formula. To further evaluate the performances of our algorithm under other network structures, we also generated 10 scale-free networks with feedback loops. The test results for scale-free networks are presented in the ESI†.
![]() | ||
Fig. 2 The performance comparisons between several methods. For (a) and (b), the average AUC(PvsR)s and true positives under the fixed 20 false positives were plotted for LPC, GGM, BN, RN, PC and ARACNE under 7 different sample sizes. For (c)–(f), the predictive errors comparing between LPC, PC and BN. The average errors over 10 datasets under 7 sample sizes: (c) total SHDs; (d) missing edges; (e) extra edges; (f) reversed edges or missing directions. |
From the AUC(PvsR) evaluation (Fig. 2a), the performances of LPC and PC were better than other methods at sample sizes of 100 and 1000, while GGM outperformed all other algorithms at other sample sizes. The average AUC(PvsR)s were low when sample sizes were only 10, 20 and 50, meaning that very few TPs were captured in such cases. In terms of AUC(PvsR)s, the performances of LPC, PC and BN improved steadily with increasing sample sizes, whereas, AUC(PvsR)s of RN stabilized and of GGM decreased with sample sizes above 200. This indicates that simple graphical models (such as RNs) can detect connections well, and better than the sophisticated models LPC and PC at very low sample size but their performance improves relatively little when sample sizes increase. GGM achieved best performance overall in the AUC(PvsR) evaluation. This can be attributed to a small sample stable estimation procedure used in GGM for genomic data with small sample sizes.3
From TP for fixed FP tests (Fig. 2b), the LPC-algorithm was superior to RN, GGM, ARACNE and BN at almost all sample sizes and was also slightly better than PC. Overall, LPC and PC outperformed RN, ARACNE and BN in this test. For ARACNE, the major reason for its poor performance was the non-linear similarity measures employed, which were less precise for this simulated data. This is also corroborated by a previous study.23
In terms of SHD metrics, the LPC-algorithm performed similarly to the PC-algorithm and much better than BN. LPC outperforms PC for sample sizes greater or equal to 100 while it performs worse than PC for sample sizes smaller than 100 (Fig. 2c). The major prediction errors for both methods were from missing edges, the numbers of which varied from ∼480 to 220 at different sample sizes. This indicates that some associations are difficult to detect by partial correlation calculations. The LPC-algorithm produced more spurious edges than the PC-algorithm (Fig. 2e) and fewer missing edges (Fig. 2d), presumably because higher-order CI tests were not performed in LPC. PC is slightly better than LPC in direction prediction (Fig. 2f), as more directions are missed when only limited-order CI tests are performed as in LPC. Overall, however, prediction accuracy of LPC was similar to that of PC (Fig. 2c).
In both tests described above, the BN learning method performed poorly compared to other methods as most edges were not be recovered by this method (Fig. 2d and e). Apart from the reasons mentioned above, we presume that the implementation of the BN learning method is very restrictive for some data types because it is based on the multivariate Gaussian assumption.25,26 Scoring and searching methods for continuous data are further stumbling blocks for the BN community because they are computationally expensive and lack the causality-related theoretical correctness guarantee.27
![]() | ||
Fig. 3 The average numbers of CI tests performed in the LPC-algorithm over 10 independent DAGs when the order k is varied from 0 to 9. The average numbers and standard deviations (error bars) are shown in (a). The average SHDs and their standard deviations from the LPC-algorithm versus the order k are plotted in (b). Each sample size (SS) is depicted by one color curve (see legend) in these two sub-figures. |
To confirm the time efficiency of our LPC-algorithm, runtime comparisons were done on a Pentium Xeons, 2.33 GHz and 4 GB RAM running Windows XP. The average processor times for estimating the CPDAG under different sample sizes are given in Table 1. For every sample size, the LPC-algorithm ran faster than PC on average. The difference increased with higher sample sizes. The explanation for this is that the relative number of CI tests is reduced in our algorithm, compared to PC, at higher sample sizes (Fig. 3a and Table 1), while this is not the case for the PC-algorithm, which has no limit on k. We do not list runtime for the BN learning method because it is much slower than LPC and PC. The BN method required more than 2 days to complete the same tests for each sample size.
Sample size | LPC (k = 2) | PC | Maximal k in PC |
---|---|---|---|
10 | 1.11(0.38) | 1.39(0.07) | 4 |
20 | 1.09(0.41) | 1.41(0.06) | 4 |
50 | 1.23(0.46) | 1.41(0.07) | 6 |
100 | 1.46(0.45) | 1.78(0.65) | 9 |
200 | 2.74(0.50) | 11.24(15.11) | 10 |
500 | 36.7(38.31) | 138.99(106.99) | 12 |
1000 | 641.10(757.97) | 1430.92(1380.84) | 14 |
Notice that this test was based on moderate-sized networks of 100 nodes each. Many more CI tests would be spared by our LPC-algorithm compared to the PC-algorithm when networks are scaled up to several thousand nodes. However, due to the computational complexity, it was not feasible to perform these tests using the PC-algorithm (or set k values high in the LPC-algorithm) for networks having over 1000 nodes.
We also examined the effect on performance of limited (0–k) order CI tests in LPC. We compared average SHD measurements at different orders, i.e., k is varied from 0 to 9 (Fig. 3b). From Fig. 3b, we can see that the average SHDs are decreased considerably after initial low-order CI tests (when k is varied from 0 to 4 at most). For SS = 10, 20, 50 and 100, the SHD remains constant after k > 2. Even for the sample size 1000, the SHD are not improved further for k > 4. This shows that the major performance improvement of LPC comes from low-order CI tests. SHD values decreased with increasing order, although performance improved only from order k = 1–3. Little was gained by increasing to higher-order (k > 4) CI tests at all sample sizes. Therefore, limitation of our algorithm to order 2 or 3 is justified on the grounds of greater efficiency.
We used the E. coli K12 transcriptional network compiled in the RegulonDB database version 6.328 as the ‘true’ network, from which we derived a directed graph of 3071 interactions. We then compared our predicted gene interactions to the true network. All 16884 gene pair edges resulting from LPC analysis were oriented (see results in the ESI†). Importantly, only 2.9% of gene pairs were directed from non-TF to TF genes. In other words, 97.1% of all predicted relationships were directed from TF genes to other genes, including other TF genes, as expected. Among all the predicted gene interactions, 208 are validated by known regulatory relations listed in RegulonDB. Of these, 199 TFs were predicted correctly to regulate known target genes, while the direction of interaction between TFs and known target genes was incorrectly predicted in only 9 cases.
We compared the performance of LPC with CLR, a state-of-the-art method used for gene network reconstruction that performed well in a large-scale benchmark evaluation.29 When we selected the same number of gene pairs (16884) from CLR analysis, based on top-ranking, 209 true interactions were recovered. Therefore, the performance of LPC (208 true interactions in our case) was comparable to that of CLR in terms of undirected graph prediction. However, CLR can only predict an undirected network while our LPC-algorithm can also infer the direction of edges connecting nodes (genes). We did not compare our results to those of two recent studies30,31 using the same E. coli dataset because these previous studies used (semi-) supervised learning methods to infer TF–gene regulations. In other words, additional prior data were used to train the prediction models, making comparisons with our method of unsupervised learning unfair. It is important to note that the level of prior knowledge for E. coli is unmatched in most other species, which would severely handicap supervised approaches in more complex or less-studied species.
As part of our analysis of publicly available E. coli transcriptome data, a sub-network consisting of the 50 most significant gene interactions predicted by our algorithm was generated (Fig. 4). Importantly, the majority of predicted regulatory relationships have been demonstrated experimentally in the past, as documented in RegulonDB (blue arrows), while others are not included in the current version of RegulonDB (marked in grey in Fig. 4). However, additional published data are consistent with regulatory relationships between some of the latter: nac, glnK and amtB are involved in responses to nitrogen starvation;32–34 and FliA is a sigma factor that is necessary for transcription of class 3 flagellar genes, such as flgK.35 Furthermore, the sigma factor FecI and the membrane proteins ExbB and ExbD act in coordination for iron uptake.36,37 Thus, there is good agreement between this subset of our LPC-predicted GRN and previous experimental data that established gene interactions in E. coli. More importantly, novel gene regulatory relations predicted by our LPC algorithm should guide experimental work to confirm and expand GRNs, not only in E. coli, but in the many other organisms for which little experimental information beyond genome and transcriptome data is available.
![]() | ||
Fig. 4 The top 50 transcriptional regulatory relationships inferred by LPC. Directed edges that have experimental support in RegulonDB release 6.3 are indicated by blue arrows, edges that are not yet supported by data in RegulonDB are depicted in grey, and known edges with incorrectly predicted directions are depicted as undirected blue lines. TFs are marked by red circles and regulated genes by green circles. Line width is proportional to the Fisher's Z-transformed partial correlation value returned by LPC. |
One reason for the apparent success of the LPC algorithm in reconstructing E. coli genetic networks is that GRNs are sparsely connected in most cases,38i.e., there are far fewer directly connected genes for each gene than the total number of genes minus one. Therefore, the LPC-algorithm can infer most causal relationships in GRNs based on low-order CI tests.
All methods that attempt to reconstruct GRNs based on transcriptome data are subject to false positive and false negative errors arising from biological and computational sources. For example, the regulatory link between a TF and its target genes, whether it is positive or negative, will be missed (false negative) if transcriptional regulation of the TF is not the primary level of control for that gene/protein.39 Thus, if the TF gene is constitutively-expressed and post-translational modification of the TF protein is the primary means by which TF activity is regulated, then a correlation between transcript levels of the TF gene and its target gene may not exist, and a regulatory link between the two genes never established, at least in wild-type cells. This may be less often the case in prokaryotes, where TF genes are often part of the operons that they control, than in eukaryotes, where this is not the case. Genetic approaches, especially the use of mutants defective in specific TF genes could help to overcome this problem. Use of transcriptome data from defined TF mutants in our LPC should help to reveal regulatory relationships that are hidden in wild-type data.
False positive inferences of regulatory relationships can arise when co-expressed genes are confused as co-regulated. Consider, for instance, two TFs involved in distinct signaling pathways with different roles, different environmental stimulation, and different target genes. If, in a given set of experiments the target genes have similar expression profiles, they will be falsely considered as co-regulated (false positive). Moreover, either of the two TFs could be mistaken as the common regulator of all the target genes, based on the limited experimental data, leading to a false inference of transcriptional regulation.40 To avoid such problems, more informative experiments should be selected,40 or alternate data sources such as promoter sequence information or ChIP-on-chip data should be incorporated with microarray data to enhance GRN inferences.
False positive and false negative associations between genes can also arise from the computational methods and parameters employed. Obviously, the threshold that is set for CI tests between two genes will have a major impact on the predicted GRN, with a higher threshold reducing the number of false positives and increasing the number of false negatives. This is a problem for all algorithms that rely on statistical treatment of data, including the LPC-algorithm. With the accumulation of known regulatory relationships with gene expression data, it is expected that optimal threshold values could be determined wisely. Similar to the PC-algorithm, another limitation of the LPC-algorithm is that false negatives arising from statistical tests cannot be corrected at a later stage and will lead to spurious final results.
Another source of errors in predicted GRNs are the statistical assumptions that are made, which may not correctly reflect the nature of the regulatory relationship between some genes. For example, the DAG assumption for transcriptional regulation does not accommodate more complex relationships, such as feedback loops. This is, in fact, a shortcoming of all static BN learning methods. Nevertheless, from the two phases of the LPC-algorithm, the undirected graph returned from the first phase still makes sense even for the feedback loop situation. Previous studies41 showed that conditional independencies are still sound even in directed cyclic graphs (DCGs) within linear models. Thus, the undirected graph from LPC would identify regulatory connections between genes. The simulation tests over scale-free networks with feedback loops also support this notion (in the ESI†). The second phase of our algorithm may wrongly assign or fail to assign directions to edges when the DAG assumption is broken. However, due to the sparseness of connections in a GRN, our algorithm is still able to recover causal relationships when local network structures satisfy the assumptions in BNs. Furthermore, if sufficient time-series microarray data are available, the LPC-algorithm can be extended to dynamic BN models42 to overcome feedback loop problems encountered in the DAG assumption. As GRN simulation technology continues to mature (such as inclusion of the interampatteness property to explain feedback loops,43) and suitable experimental time-series microarray data become available, it will be interesting to test the ability of our approach to model synthetic and real GRNs containing feedback loops.
![]() |
In the other direction, suppose X and Y are not adjacent. Either there is no path from X to Y or there is no path from Y to X for otherwise we could have a cycle. Without loss of generality, assume there is no path from Y to X. We will show that X and Y are d-separated by the set Adjacencies(G,Y)\X consisting of neighbors of Y. Consider any path π between X and Y that does not pass through any node in Parents(G,Y). There must be a node Z having converging arrow along the π, otherwise it would be a path from Y to X. Consider Z is closest to X on path π, Z cannot be an ancestor or parent of any node in Parents(G,Y) and Z ∉ Parents(G,Y) because otherwise we have a cycle. Clearly, any path between X and Y are blocked by Parents(G,Y) according to the d-separation definition. Adjacencies(G,Y)/X, which completes the proof.∴
In the second phase of the LPC-algorithm, for each orientation rule, the numbers of adjacencies of each node per node pair (denoted by X and Y) are examined (in line 16 and lines 21–24 of Table 2), it follows from Lemma 2 and min(|Adjacencies(G,X)\{Y}|, |Adjacencies(G,Y)\{X}|) ≤ k, without loss of generality, X and Y is also adjacent in CPDAG because otherwise the edge between X and Y are removed by 0–k order CI tests in the phase 1. The conditions of all the orientation rules in phase 2 are still satisfied. It follows from Theorem 2 in ref. 21 that all the orientation rules are still sound. Hence, all the directed edges inferred from the orientation phase in LPC are also presented in CPDAG.∴
![]() | (1) |
The zeroth-order partial correlation ρX,Y|Ø is defined to be the regular Pearson correlation coefficient ρX,Y. Actually, the k-th order partial correlation (i.e., with |S| = k) can be easily computed from three (k − 1)th order partial correlations. Thus the sample partial correlation ρX,Y|S\Z can be calculated recursively by using eqn (1).
For testing whether a partial correlation is zero or not, we apply Fisher's Z-transformation of the partial correlation:
![]() | (2) |
Classical decision theory yields the following rule when using significance level α. Reject the null-hypothesis H0(X,Y|S): ρX,Y|S = 0 against the two-sided alternative HA(X,Y|S): ρX,Y|S ≠ 0 if
![]() | (3) |
Footnote |
† Electronic supplementary information (ESI) available: Additional information. The software, datasets and three supplementary files can be downloaded from http://bioinfo.noble.org/manuscript-support/lpc/. See DOI: 10.1039/b917571g |
This journal is © The Royal Society of Chemistry 2010 |