Inferring cellular regulatory networks with Bayesian model averaging for linear regression ( BMALR ) †

Bayesian network and linear regression methods have been widely applied to reconstruct cellular regulatory networks. In this work, we propose a Bayesian model averaging for linear regression (BMALR) method to infer molecular interactions in biological systems. This method uses a new closed form solution to compute the posterior probabilities of the edges from regulators to the target gene within a hybrid framework of Bayesian model averaging and linear regression methods. We have assessed the performance of BMALR by benchmarking on both in silico DREAM datasets and real experimental datasets. The results show that BMALR achieves both high prediction accuracy and high computational efficiency across different benchmarks. A pre-processing of the datasets with the log transformation can further improve the performance of BMALR, leading to a new top overall performance. In addition, BMALR can achieve robust high performance in community predictions when it is combined with other competing methods. The proposed method BMALR is competitive compared to the existing network inference methods. Therefore, BMALR will be useful to infer regulatory interactions in biological networks. A free open source software tool for the BMALR algorithm is available at https://sites.google.com/site/bmalr4netinfer/.


Introduction
With advances of high-throughput experimental technologies, plenty of network inference methods have been developed to identify regulatory interactions in cellular networks from quantitative experimental data.These network inference methods are becoming increasingly important in the field of systems biology to address many biological problems.Examples of network inference approaches include Bayesian networks, [1][2][3][4][5] mutual information, [6][7][8] linear regression, [9][10][11] ordinary differential equations 12,13 and the statistical test. 14,15Among these approaches, Bayesian networks have become popular due to the following reasons: (1) Bayesian networks use the probability theory, which is suitable for dealing with noise in biological data.(2) The prior knowledge of molecular interactions from the literature or curated databases can be well encoded in the prior distribution structure of Bayesian networks.In addition, linear regression based methods are also widely used in biological network inference due to their high computational efficiency.
A Bayesian network is a graphical model that describes probabilistic relationships between network variables.Such relationships are encoded within the structure of a directed acyclic graph.To infer the interactions of network variables, one strategy is to find a directed acyclic graph that most likely generates observed experimental data, which are assumed to be a steady data set for static Bayesian networks.This is performed by evaluating each possible graph with a score-based approach in the Bayesian context and subsequently search for the graph that maximizes the score. 16The score function is defined with two common probabilistic models: linear Gaussian models and multinomial models. 3However, it is a computationally laborious problem to evaluate all possible graphs that correspond to all possible interactions and choose the best scoring graph. 17,18To address this problem, heuristic search methods (e.g.: the greedyhill climbing approach) were proposed. 5On the other hand, given limited amounts of data, a variety of graph structures may describe the data similarly well.Therefore, a network-averaging strategy was proposed to find the consensus interactions present in most of the high-scoring graphs. 5,19o improve the computational efficiency of network reconstruction, a decomposition technique is applied in regressionbased methods.The inference of regulatory interactions targeting all N genes is decomposed into N independent sub-problems by inferring the interactions from regulators to a single target gene.Linear regression models have been proposed to solve these sub-problems.1][22] Another widely-used linear regression algorithm is Lasso (Least Absolute Shrinkage and Selection Operator), 23 which uses L1-norm regularization to efficiently select a parsimonious set of regulators for each target gene.Different Lasso derived methods have been developed, which include LARS, 24 GLASSO 10 and Inferelator. 25,26he combination of a Bayesian network and linear regression could be a good strategy for the inference of cellular networks as it can take advantage of the strengths from both methods.For example, Rogers et al. 27 proposed a Bayesian regression method to reconstruct regulatory networks from gene expression data with a fast sparse Bayesian regression algorithm of Tipping and Faul. 49This method does not require the discretization of data and requires no setting of the heuristic threshold for the predicted interactions.9][30] These studies have shown the advantages of Bayesian network methods in incorporating prior knowledge and the computational efficiency of regression approaches.
In this work, we developed a new method that uses Bayesian model averaging for linear regression (BMALR) to infer cellular regulatory networks.In this method, we applied a new closed form solution to calculate the posterior probabilities of the edges from putative regulators to the target gene, which leads to high computational efficiency and high prediction accuracy.We assessed this new method by benchmarking with in silico datasets from the DREAM (Dialogue on Reverse Engineering Assessment and Methods) project [31][32][33][34] and the datasets of real experiments.The results indicate that our method BMALR is competitive in terms of prediction accuracy and computational efficiency when it is compared to the best existing methods.The log transformation of datasets can further improve the performance of BMALR.In addition, we evaluated the performance of community predictions 34 with and without BMALR on DREAM benchmarks.The community prediction methods with BMALR can achieve robust high performance, which suggests that BMALR has a complementary advantage in community predictions.

Methods
The proposed method BMALR integrates a Bayesian model averaging method with a linear regression approach.In general, BMALR implements linear regression of the data of the target node on all combinations of other nodes.The final score of the edge from a parent node to the target node is the sum of the posterior probability of the linear regression models that contain this edge.
In the following sections, we introduce more technical details about Bayesian model averaging and a new close form solution for calculating the likelihood of each local structure in a Gaussian Bayesian network.At the end, the datasets, performance metrics and network inference methods used for comparison are described.

Bayesian model averaging
To infer regulatory interactions in cellular networks, one way is to find the Bayesian network structure G that best explains the data.This is normally achieved by maximizing the likelihood of the observed dataset generated from the network structure G (maximum likelihood) or the posterior probability of the structure G given the observed data (maximum a posteriori).However, given a limited number of observed datasets, many Bayesian network models may explain the data almost equally well.It would be risky to make an inference on the interactions of network variables depending on a single optimized Bayesian network structure. 5,19nstead of searching for the best Bayesian network structure, Bayesian model averaging was proposed to find the edge features ( f ) that are present in most high-scoring Bayesian network structures. 5,19An edge feature ( f ) is the edge relation feature between network variables X i and X j in a Bayesian network structure (G).Such a feature can be quantified with the posterior probability of f: This probability shows our confidence in edge feature f given the observed dataset (D).If the Bayesian network structure G contains f, f (G) equals to 1, otherwise 0.
As the number of candidate network structures increases exponentially with the number of node variables, it is not feasible to exactly compute eqn (1) from all possible candidate networks.

Local model averaging for linear regression
In order to have an efficient and accurate estimate for posterior probability of the edge feature, the global network is decomposed into a set of local networks. 27,28Each local network is composed of one target variable and its possible parents (regulators).The local network can be reconstructed by model averaging with the local structures composed of all the possible parent sets of a target variable, 19 which is similar to the local learning approach based on the Markov blanket. 35Here we model the local structures with a weighted linear combination of the values of their parents: where X i is the value of variable i. w ji is a weight constant representing the influence of variable j on variable i.If w ji is zero, there is no edge from j to i in the regulatory network.If w ji is non-zero, j is one of the i's regulators (parents).e i denotes the noise.
The posterior probability of each edge is the sum of posterior probabilities of all the putative local structures containing the edge.This leads to the following approximation (in comparison with eqn (1)) for posterior probability of an edge feature f: where the node X i is the target of the edge feature f. S i is the set of all possible parent sets of X i .G Pa is a local structure that is composed of the edges from the nodes in Pa, a parent set of node X i .D Pa,X i denotes the data restricted to X i and the node variables in Pa.If the local structure G Pa contains f, f (G Pa ) is equal to 1, otherwise 0.
Based on the Bayes theorem, eqn (3) can be written as: where P(G Pa ) is the prior probability of structure G Pa .For simplicity, we can assume that cardinalities of parent sets are uniformly distributed, meaning P(G Pa ) 1. 19 Here, we set a restriction on the maximum cardinalities of parent sets (denoted as maxFanIn) in order to reduce the computation cost.This strategy has been applied before in other Bayesian learning approaches. 3,36In this paper, we set the default value of maxFanIn as 2. It is worth noting that the inference results are not sensitive to the value of maxFanIn, 29 which is shown in the Results and discussion section.

Probability in Gaussian Bayesian networks
To calculate P(D Pa,X i |G Pa ), the likelihood that the local structure G Pa generates the data D Pa,X i , we assume that the noise term in the linear regression models (e i in eqn ( 2)) follows multivariate normal distribution.As a result, variable i and its regulators Pa will be in multivariate normal distribution, which corresponds to a Gaussian Bayesian network. 37,38In Gaussian Bayesian networks, we can compute the likelihood that the local structure G Pa generates the data D Pa,X i as: 37,38 Assuming a normal-Wishart prior in the Gaussian Bayesian network, the probability density of data, r(D Pa,X i ) and r(D Pa ) can be approximated with the following formula.
where W is a set of variables (e.g.: for the calculation of r(D Pa ),

Datasets
We evaluated the proposed method with a variety of datasets: (1) the in silico benchmark data from DREAM4 ''in silico network challenge'' and DREAM 5 ''network inference challenge'', which are available at http://wiki.c2b2.columbia.edu/dream/index.php/Challenges.(2) The real experimental data obtained in single cell flow cytometry experiments, 4 which measured the expression level of 11 signaling molecules in the T-cell signaling network upon various interventions.

Performance metrics
To assess the performance of BMALR and previous network inference methods, we used a number of performance metrics, which include the area under the precision-recall (AUPR) and receiver operating characteristic (AUROC) curves, the F-score, 39 empirical p-values of AUPR and AUROC for the DREAM project benchmarks, 34 as well as the p-value of the one-sided Fisher's exact test. 40o compute AUPR and AUROC metrics, we counted true positive (TP), false positive (FP), true negative (TN), and false negative (FN) by comparing the inferred network with gold standard networks at a certain threshold setting.Accordingly, true positive rate (TPR, or recall), positive predictive value (PPV, or precision), false positive rate (FPR) are calculated.Then the values of AUPR and AUROC were calculated by creating the precision-recall curve (PR curve) and the receiver operating characteristic curve (ROC curve) at various threshold settings.
The F-score is defined as: The p-value of the one-sided Fisher's exact test is defined as: where N p is the number of edges in the gold standard network.N t is the maximum number of possible edges in the network.N t = n(n À 1) for a directed network with n nodes (excluding selfloop edges).N TP and N FP are the number of true positive and false positive edges in the inferred network, respectively.This p-value represents the probability to obtain no less than N TP true positive edges by randomly selecting Np edges.The p-value of the one-sided Fisher's exact test is useful for the evaluation of inference methods on small networks.

Log transformation of data
To test whether the preprocessing of datasets can improve the performance of BMALR, we applied log transformation to reduce the positive skew of the data so that the distributions of transformed datasets are similar to normal distributions.We first check the skewness of the variable X.If the distribution of variable X is positively skewed, log transformation is applied as the following: 41 a constant C is added to X, then an optimization procedure is applied to obtain an optimal C that makes the nonparametric skew of log(X + C) as close to zero as possible.

Network inference methods used for comparison
We compared the proposed BMALR method with 5 popular network inference methods: (1) GENIE3 42 based on feature selection with decision tree.GENIE3 was the winner of both DREAM4 and DREAM5 network inference challenges.(2) CLR 7 This journal is © The Royal Society of Chemistry 2014 based on mutual information with background correction.(3) ARACNE 43,44 based on mutual information with data processing inequality.(4) The PC-algorithm in the PCALG package.We implemented pcSelect in the PCALG package, 45,46 a simplified PC-algorithm with variable selection.(5) The regression based method, LARS (Least Angle Regression). 24(6) Bayesian model averaging method using Markov Chain Monte Carlo sampling (denoted as MCMC). 19,47We ran these tools with their default settings.

Results and discussion
Performance on DREAM4 in silico size 100 multifactorial sub-challenge The aim of DREAM4 in silico size 100 multifactorial sub-challenge is to infer the structures of five gold standard networks from given in silico gene expression datasets that are simulated by 100 multifactorial perturbations of all the genes simultaneously. 32,33The five gold standard networks have 100 components (genes) and they have 176, 249, 195, 211 and 193 gold standard links, respectively.The performance metrics of our method (BMALR) and other applied inference methods for this sub-challenge is given in Table 1.The results show that BMALR obtained the highest overall score among all the applied network inference methods.More specifically, BMALR achieved the highest AUPR and AUROC scores in four networks and the second-highest scores in one network.
The individual PR and ROC curves of the inference methods on each network are available in Fig. S1 and S2 (ESI †).The overall performance of BMALR is slightly better than GENIE3, which was the winner of this sub-challenge.Gaussian Bayesian models assume that the datasets of variables have normal distributions.However, if the datasets of the variables have highly skewed distributions, the assumption of normal distribution of the data would be invalid.To deal with the skewed data, we applied the log transformation for the datasets 41 and tested whether such a strategy could improve the performance of BMALR.As is shown in Table 1, a preprocessing of the datasets with the log transformation improves the performance of BMALR (16% increase of the overall score in this benchmark).

Performance on DREAM5 network inference sub-challenge
To evaluate the performance of BMALR method on predicting the interactions of large networks, we compared BMALR with other methods by applying them to the DREAM5 network inference sub-challenge with the dataset simulated from an in silico network, which has 1643 genes, 195 transcription factors (TFs) and 805 chips.The top 100 000 edge predictions are used for the DREAM5 evaluation. 34The In silico DREAM5 network model assumes that the mRNAs are directly translated into proteins without any further regulations.The simulated data for TFs' protein abundances are highly correlated with corresponding mRNA abundance data.Therefore, we use TFs' mRNA abundances as proxies for their protein abundances.
Table 2 shows the performance metrics of BMALR and other inference methods on the DREAM5 network inference sub-challenge.The results indicate that BMALR obtained the best performance with the highest AUPR score and the second highest AUROC score, as well as the best optimal F-score.The log transformation of datasets can further increase the performance of BMALR, leading to a best overall performance.
According to the individual PR and ROC curves of the inference methods in this benchmark (Fig. 1), BMALR tends to have higher precision than other methods at low recalls in the PR curve, while the ROC curve profile of BMALR is almost the same as GENIE3.In addition, it is worth noting that the computational cost of BMALR in this sub-challenge is much less than other competitive methods such as GENIE3, PCALG and ARACNE (Table 2).under 9 perturbation conditions, which were simultaneously measured with single cell flow cytometry. 4We first applied z-score normalization for each dataset, so that all the proteins have a mean of 0 and a standard deviation of 1.

In this section, we evaluated the performance of BMALR and other methods on the benchmark of T-cell signaling network with 11 components and 20 interactions. The experimental data consist of 11 phosphorylated T-cell signaling molecules
The results shown in Table 3 suggest that BMALR is among the top performing approaches, with the highest AUROC score.The optimal F-score of BMLAR is the same as that obtained with a Bayesian model averaging method using Markov Chain Monte Carlo sampling (MCMC). 19,47The individual PR and ROC curves of each network inference method on this benchmark are shown in Fig. 2.
Fig. 3 shows the networks predicted by the top 2 performing methods (BMALR and MCMC), which have the same number of edges as the gold standard network derived by Sachs et al. 4 The two predicted networks are very similar.They have only 2 different edges: the MCMC predicted network has a false edge pJNK -PKC that is not shown in the predicted network by BMALR and a true positive edge MEK -ERK is missed in the MCMC predicted network, while it is present in the predicted network by BMALR.Interestingly, most of the false positive

Similarity of predictions from different inference methods
Recent work has shown that different network inference methods provide different predicted interactions on the same regulatory system.To address this problem, it was proposed to make community predictions by integrating the results from multiple inference methods because limitations of different methods tend to be cancelled out. 33,34Therefore, it is important to investigate the similarity and difference of the predictions from different network inference methods. 15,34n order to compare the overlap and the difference of predicted interactions between our method and others, we compared the binary networks from each method at a certain cutoff, which is set to ensure that the binary networks generated from each method have the same number of edges (4012 edges for the DREAM5 subchallenge) with the gold standard network.To quantify the similarity between the predictions of different methods, we performed single linkage cluster analysis 48 with a distance metric based on spearman correlations of the ranks of predicted edges from each method.As shown in Fig. S3 (ESI †), the interactions predicted by our method BMALR are most similar to those predicted by LARS and PCALG.This could be explained by the fact that linear regression is used in BMALR, PCALG and LARS.
To test whether BMALR could help to achieve higher performance by making community predictions with other individual network inference methods, we performed community predictions with combinations of every two individual methods for gold standard network 1 (NET1) of DREAM4 in silico size 100 multifactorial sub-challenge following the approach proposed by Marbach et al. 34 As shown in Fig. 4, community methods with BMALR can achieve robust high performance in the predicted networks.Similar high performance of community methods with BMALR is obtained with the application of other gold standard networks of DREAM4 sub-challenge (Fig. S4, ESI †).In all the cases, the highest performance predictions were obtained in the community predictions with the combination of BMALR plus other competition methods such as CLR, GENIE3 and ARACNE.These results suggest that BMALR is complementary to other inference methods and could be a good candidate method for community predictions.

Fan-in error analysis
We next checked whether BMALR is resistant to the fan-in error stemming from the difficulties in predicting combinatorial regulations. 33We performed indegree based network analysis on DREAM5 sub-challenge to study the robustness of BMALR and other methods to the fan-in error.Specifically, we investigated the overall performance of each inference method on predicting the regulatory input(s) of genes with one transcription factor (indegree = 1), two transcriptional factors (indegree = 2), etc.As is described in ref. 33, the prediction confidence for different regulatory inputs of genes is quantified with their ranks in the corresponding list of edge predictions (the first edge has a scaled prediction confidence of 100% and the last edge has a confidence of 0%).
The data shown in Fig. 5 indicate that most inference methods have high prediction confidence for the edge of genes with 1 transcription factor input.In general, the prediction confidence is reduced as the indegree of genes increases, but the prediction confidence of BMALR and GNEIE3 seems to be more robust on high indegrees.
Comparing BMALR with other Bayesian network based methods BMLAR uses the framework of Bayesian model averaging that integrates a linear regression approach.Therefore, we investigated  whether the combination of Bayesian model averaging with a linear regression method could improve prediction performance compared with other Bayesian network methods.We applied previously reported Bayesian network methods such as the PC-algorithm method (PCALG), 45,46 BNFinder 2 and the SBM method with the sparse Bayesian model 27,49 to the DREAM4 in silico size 100 multifactorial sub-challenge.The results shown in Table 4 indicate that BMLAR achieves the highest prediction accuracy over the other comparable Bayesian network methods.In addition, the computational efficiency of BMLAR is among the top method, with a similar magnitude of used computational time to the PC-algorithm method with pcSelect (feature selection version of PCALG).It is worth noting that both BMALR and pcSelect methods combine a Bayesian network with a linear regression method.The difference is that BMALR uses model averaging to calculate the posterior probability of each edge, while pcSelect uses the partial correlation based independence test to score each edge.
Similar to the method proposed by Geiger and Heckerman (referred as Geiger-Heckerman in the following), 37 BMALR calculates the probability of local linear regression models using Gaussian directed acyclic graphical (DAG) models.The major difference between BMALR and Geiger-Heckerman is that BMALR integrates the idea of model averaging to compute the edge posterior probabilities, which has been applied to infer the regulatory network from time series data by Hill et al. 29 However, Hill et al. computed the probability of local linear regression models using Bayesian linear models with interaction terms, which is different from the Gaussian DAG models used in BMALR.Therefore, BMALR is a hybrid method for network inference, which is based on the method frameworks developed by Hill et al. and Geiger-Heckerman.

The influence of the maximum number of parents (maxFanIn) on the performance of BMALR
To analyze whether the performance of BMALR depends on the setting of the maximum number of parents (maxFanIn), we applied BMALR to the benchmarks of DREAM4 and DREAM5 in silico sub-challenges, and the T-cell signaling network with different settings of the maximum number of parents (maxFanIn was set between 2-7).The results shown in Tables S1-S3 (ESI †) suggest that the performance metrics of BMALR are robust to the maxFanIn parameter (when maxFanIn Z 2).Biological networks are usually sparsely connected, in which most network nodes have a few upstream regulatory links. 50Therefore, the probability of local structures with large numbers of parents is relatively small, which might explain the observed insensitivity of the BMALR method to the maxFanIn parameter.
W)  is the l W -by-l W Pearson correlation matrix of D W . M is the sample size.The derivation of above equations and other technical details for calculating probability density of data are provided in the ESI † text.

Fig. 1
Fig. 1 PR and ROC curves of different methods for DREAM5 in silico sub-challenge.BMALR* denotes the results of BMALR with the log transformation of the datasets.(A) Precision-recall (PR) and (B) receiver operating characteristic (ROC) curves.

Fig. 2
Fig. 2 PR curve and ROC curve of the T-cell signaling network.(A) PR curves (B) ROC curves.

Fig. 3
Fig. 3 The predicted T-cell signaling networks with top 20 edges by (A) BMALR and (B) MCMC network inference method, respectively.

Fig. 4
Fig.4The performance (area under precision-recall curve, AUPR) of individual methods and community methods with combinations of every two individual methods for gold standard network 1 (NET1) of DREAM4 in silico size 100 multifactorial sub-challenge.The dashed horizontal line denotes the highest performance level in all the methods.

Table 1
Performance metrics of inference methods on DREAM4 in silico size 100 multifactorial sub-challenge.The bold numbers indicate the best value in each performance metric.BMALR* denotes the results of BMALR with the log transformation of the datasets

Table 2
The performance metrics of inference methods on DREAM5 network inference sub-challenge with the in silico dataset.The bold numbers indicate the best values among all methods.BMALR* denotes the results of BMALR with the log transformation of the datasets

Table 3
The performance metrics of inference methods on the benchmark of the T-cell signaling network.The bold numbers indicate the best values among all methods