Andreas
Tjärnberg‡
ab,
Torbjörn E. M.
Nordling‡
*acd,
Matthew
Studham
a,
Sven
Nelander
cd and
Erik L. L.
Sonnhammer
abe
aStockholm Bioinformatics Centre, Science for Life Laboratory, Box 1031, 17121 Solna, Sweden. E-mail: tn@nordron.com
bDepartment of Biochemistry and Biophysics, Stockholm University, Sweden
cDepartment of Immunology, Genetics and Pathology, Uppsala University, Rudbeck laboratory, 75185 Uppsala, Sweden
dScience for Life Laboratory, Uppsala University, Sweden
eSwedish eScience Research Center, Sweden
First published on 22nd October 2014
Statistical regularisation methods such as LASSO and related L1 regularised regression methods are commonly used to construct models of gene regulatory networks. Although they can theoretically infer the correct network structure, they have been shown in practice to make errors, i.e. leave out existing links and include non-existing links. We show that L1 regularisation methods typically produce a poor network model when the analysed data are ill-conditioned, i.e. the gene expression data matrix has a high condition number, even if it contains enough information for correct network inference. However, the correct structure of network models can be obtained for informative data, data with such a signal to noise ratio that existing links can be proven to exist, when these methods fail, by using least-squares regression and setting small parameters to zero, or by using robust network inference, a recent method taking the intersection of all non-rejectable models. Since available experimental data sets are generally ill-conditioned, we recommend to check the condition number of the data matrix to avoid this pitfall of L1 regularised inference, and to also consider alternative methods.
Theoretically, LASSO has been shown to be able to recover the correct network under certain conditions, such as the Strong Irrepresentable Condition (SIC) and Restricted Isometry Property (RIP).16–18 In a network inference context, these conditions concern the relation among observed vectors of expression changes. However, even results based on SIC only ensure that the LASSO estimator is sign consistent with a probability that goes to one as the number of samples goes to infinity. Some of the inferred links could thus not exist in reality, in particular for the low number of samples seen in biological data sets. In real applications, SIC is of little use because it cannot be calculated without knowing the true network. Even though performance of L1-regularisation methods has been analysed rather extensively, we have not seen any article reporting that they fail for sufficiently informative data, which we show here.
In a number of cases, when reverse engineering algorithms have been applied to biological networks, believed to have a well understood connectivity, networks with a different connectivity have been obtained. For instance, Lorenz et al.2 reported a mere 62% sensitivity and 69% precision with 24% of the predicted regulatory interactions having the opposite sign in the model of the Snf1 network in S. cerevisiae. Moreover, benchmarking studies, such as the Dialogue for Reverse Engineering Assessments and Methods (DREAM), have shown that GRN inference usually results in a large fraction of false positives, i.e. inferred links absent in the true network, and false negatives, i.e. missed links present in the true network.19,20 This has in later years lead researchers to complement expression data with other data types, such as binding data, ChIP-seq, and a priori information.21,22 Note that we here speak about addition of other data types to guide the inference method and not integration of other data types in the model. In the former case the degree of freedom of the model is kept fixed and the data are intended to constrain model parameters, while the degree of freedom in the latter case is increased. Use of these, so called multi-data-type genomic datasets, makes it harder to assess the performance of inference methods compared to expression data alone. It is in particular harder to know to which degree a link is supported by expression data versus a priori information. Even if the complete topology of the network is provided, e.g. from ChIP binding data, the signs (activation/repression) of the links still need to be inferred. Addition of other data does not fix the method per se. We therefore think that awareness of the pitfall of L1-regularisation methods that we report here is more essential than before.
A number of GRN inference benchmark studies23–25 have been published, spanning a wide range of methods and data sets. In general, the conclusion is that although they tend to perform better than random, all inference methods produce models that are far from correct. The dependency on the nature of the data is strong as a method may do well in one benchmark but poorly in another one. Selection of the regularisation coefficient, which determines the sparsity of the estimate, is a major issue because it must be correct for the estimated network model to be correct.26 Vinh et al.27 detail the difficulties of benchmarking, especially on small networks, where sparsity cannot be achieved to any larger degree due to the network's small size. They show that methods for inference of GRNs do not construct any good networks with sufficient confidence and that the parameter settings of the algorithms are crucial to find a good estimate of the structure of the network. However, no method for optimising these crucial parameters is given. Jörnsten et al.28 show that the structural agreement between network models inferred for the same biological system using bootstrapping based on measurements obtained at two different platforms only is good for a narrow range of the regularisation coefficient. This makes it important to assess how the accuracy of different inference methods depends on data and system properties, which we here do for five methods.
Data sets generated in vivo for gold standard networks are rare for benchmark purposes due to a lack of knowledge about the interactions among the genes. An attempt has been made to create such a gold standard for benchmarking by recording an in vivo data set from a synthetically engineered five gene network in yeast, called IRMA.3 Penfold and Wild24 benchmarked time series algorithms in addition to steady-state algorithms and evaluated their performance on IRMA. They found that no methods could retrieve the designed structure of IRMA from the data. The IRMA network was perturbed by single gene over-expression to trigger the response of the network and the change in mRNA abundance was then measured when the system had reached steady-state, as well as a time series sampled either every 10 or 20 minutes for up to 5 hours. For single gene perturbations there is no guarantee that the gene space is sufficiently excited to give informative data, i.e. that a sufficient variation in the response of the genes over the experiments is achieved.29 Another issue with gold standard networks is the definition of a link. The inference method and model formalism have to yield the same type of links as recorded in the gold standard in order for a comparison to be meaningful and fair. The five methods employed here infer so called influences, while gold standard networks typically contain links corresponding to physical binding between molecules.23 Simulated data sets are thus still necessary for benchmarking due to the lack of “real” data sets that are informative enough for accurate GRN inference and differences in the definition of a link. It is thus not possible to exhaustively demonstrate the pitfalls of L1-regularisation methods on real data, despite the multitude of data that exist. However, we have applied the studied inference methods to the in vivo data collected by Lorenz et al.,2 compared the inferred networks to two reference networks that can be seen as gold standards, and related the accuracies to the expected performance in our simulations based on the properties of the data.
In this study, we focus on analysing network and data properties that are important for the accuracy of GRN inference. In particular, the condition number of the network and response matrices, as well as the Signal to Noise Ratio (SNR), are examined. To this end, we generated a set of linear networks with essential properties similar to real biological GRNs. These were then used to generate both gene expression data sets that have properties similar to published in vivo data and data sets that are informative enough for inference of the correct network. This was done to mimic real data sets, while varying the properties and utilising the advantage of knowing the true network. We restrict ourselves to linear models, because it is sufficient to demonstrate the presented pitfall of L1-regularisation methods. Considering that the class of linear models is a subset of the class of nonlinear models, awareness of this pitfall is essential also when inferring a nonlinear model. By identifying easily testable conditions that need to be satisfied for successful GRN inference, we provide guidelines useful for avoiding pitfalls that can cause poor network models.
![]() | (1) |
Y = −A−1P + A−1F + E | (2) |
By taking the transpose of the variables and “true” network model, and introducing the notation used for regressors Φ ≜ [ϕ1,…,ϕj,…,ϕN] = YT, regressands Ξ ≜ [ξ1,…,ξi,…,ξN] = −PT, regressor errors ϒ ≜ [υ1,…,υj,…υN] = ET, and regressand errors Π ≜ [ε1,…,εi,…εN] = −FT, we obtain the matrix form of the standard linear data model used in errors-in-variables regression problems
Φ = ![]() ![]() | (3a) |
ΦǍT = ![]() ![]() | (3b) |
![]() | (4) |
The Elastic net14 is a method based on LASSO which combines the L1 penalty from LASSO and the L2 penalty from ridge regression. The influences of the penalties are then weighted by a parameter α such that,
![]() | (5) |
Bolasso15 is a bootstrap approach to LASSO inference, where the statistical properties of bootstrapping are combined with the LASSO, see algorithm 1. We use a constant number of bootstraps, nBS = 100, for each data set, as the statistical confidence should increase with nBS. This is well above the minimum number of bootstraps needed,15, with N being the number of variables. We extend the bootstrap algorithm by requiring that the bootstrapped data set has the same rank as the original data. In practice this means putting a rank requirement on the P matrix so that it has full row rank. This improves the performance, because it ensures that all genes are perturbed in at least one experiment, which is a necessary condition for correct inference.32 Bolasso was not applied to the 10 gene data sets because the data matrix becomes rank deficient if a sample that is left out during the bootstrap procedure contains the only perturbation of a gene. This is often the case in the 10 gene data, and the consequence is that links cannot achieve 100% bootstrap support if they can only be inferred when that unique experiment is sampled. For the same reason, Bolasso was not applied to the data reported by Lorenz et al.2 as it only consists of one set of single gene perturbations, leading to a rank deficient data matrix as soon as one of the experiments is excluded during the bootstrap procedure.
Algorithm 1 Plain bootstrap LASSO algorithm. B is the inferred model and A the logical intersection of inferred models. aij is a link from j to i. nBS is the number of bootstraps. | ||
---|---|---|
procedure BOOTSTRAP LASSO(data,nBS) | ||
a ij = 1 ∀ i and j ∈ A | ||
for 1:nBSdo | ||
dataBS = DRAW WITH REPLACEMENT(data) | ||
B = LASSO(dataBS,ζ) | ||
A = A ∧ LOGICAL(B) | ||
end for | ||
A = {aij ∈ A} | ||
end procedure | ||
function DRAW WITH REPLACEMENT(data) | ||
Draw samples with replacement | ||
s.t. |dataBS| = |data| | ||
return dataBS | ||
end function |
Least-Squares Cut-Off (LSCO) is a simple inference algorithm based on ordinary least squares (OLS) followed by the removal of all weak links, i.e. small nonzero parameters,
![]() | (6) |
Robust Network Inference (RNI) is achieved by implicitly checking all network models that cannot be rejected based on the assumed data model and the desired significance level and only including the links that are present in all of these models.32 This gives the intersection of all non-rejectable models. In practice, the network model is obtained by calculating Nordling's confidence score and only including links with a value above one. Nordling's confidence score for the existence of the link aij is defined as
γ(aij) ≜ σN(Ψ(χ)), | (7a) |
![]() | (7b) |
and Ψ ≜ [ϕ1,…,ϕj−1,ϕj+1,…,ϕN,ξi], | (7c) |
Tables 1 and 2 gives an overview of the N = 10 and N = 45 network properties respectively. For a complete list of the networks and properties see Tables S3 and S4 (ESI†).
Network properties | Low κ(A) | High κ(A) |
---|---|---|
# Genes, N | 10 | 10 |
# Networks | 10 | 10 |
Structure | Random | Random |
Interampatteness degree, κ(A) | 6.9–10 | 91.6–108 |
Sparsity | 0.25 | 0.25 |
Network properties | Low κ(A) | High κ(A) |
---|---|---|
# Genes, N | 45 | 45 |
# Networks | 10 | 10 |
Structure | Random | Random |
Interampatteness degree, κ(A) | 25.4–41.3 | 411.5–492.8 |
Sparsity | ≈0.07 | ≈0.07 |
We followed three different perturbation approaches two for N = 10: Naive Random Double Perturbation (NRDP) and Sparse Balanced Excitation Design (SBED), and one for N = 45: triple Single Sets and a Single Double set (SSSD).
NRDP was constructed by perturbing two randomly chosen genes for each sample while making sure that P had full rank and that each gene was perturbed at least once. By perturbing genes more than once we make sure that each sample has some dependency on the remaining data set, a requirement for using the sample in leave one out cross-optimisation of ζ.26 This design yields data sets where the condition number of Y is close to the interampatteness degree of the network. We therefore generate data sets with similar conditions to those reported in the literature, 5, 154, and 215, respectively, in Gardner et al.,1 Alter et al.34 and Lorenz et al.2
The objective of the SBED is to excite all directions of the gene space uniformly, i.e. spread out the response equally in the gene space, and obtain a well conditioned Y matrix.29 We do this approximately by minimising κ(Y) and the number of perturbed genes. To achieve uniform excitation is simpler for a dense perturbation matrix P as the different signal directions in Y can be more easily tuned. However as the possibility to perturb a majority of the genes at once is unrealistic, we keep P as sparse as possible, i.e. we do a trade-off between a sparse perturbation design and uniform excitation in all directions of the gene space.
The SSSD perturbation design is constructed by using triple replicates where a single gene is perturbed for each sample with one extra set of double perturbation where two random genes are perturbed for each sample. This setup simulates a plausible experimental design approach that naively tries to maximise the information in the data set while utilising the fact that there needs to be a dependence between samples to do some form of cross validation.
Tables 3 and 4 shows an overview of data set properties. For a complete list of the data sets and properties see Tables S1, S2 and S5 (ESI†).
Data set property | ||
---|---|---|
Perturbation design | SBED | NRDP |
Samples, M | 2N | 2N |
# Data sets | 20 | 20 |
Condition number, κ(Y) | 1.3–2.0 | 9.5–181.3 |
Max # perturbations per sample | 2–6 | 2 |
Min # perturbations per sample | 1–3 | 2 |
Data set property | ||
---|---|---|
Perturbation design | SSSD | SSSD |
Samples, M | 4N | 4N |
# Data sets | 10 | 10 |
Condition number, κ(Y) | 25.7–41.3 | 412.8–504.51 |
Max # perturbations per sample | 2 | 2 |
Min # perturbations per sample | 1 | 1 |
We applied noise to each data set with a variance λ selected to give the desired Signal to Noise Ratio (SNR)
![]() | (8) |
The Fraction of Provably Existing Links (FPEL) is the fraction of links existing in the true network that can be proven to exist based on the observed data. They are proven to exist by rejecting all alternative network models lacking these links at a desired significance level based on the observed data and true data model, i.e. when the considered set of network models contains the true network and the measurement noise is described by the error model that was used to generate it. FPEL is calculated as the number of links with Nordling's confidence score (7) above one divided by the number of links in the true network.32 It is the sensitivity of RNI. If all existing links can be proven to exist, i.e. FPEL = 1, then the data set is said to be informative enough for network inference. Note that FPEL and MCC are not directly comparable, since only the latter accounts for FP and TN. MCC is relative to the number of possible links N2, while FPEL is relative to the number of links present in the true network L. Nonetheless, MCC = 1 corresponds to FPEL = 1 for RNI. Only measurement data and an error model are needed to calculate the number of provably existing links, implying that it can be used for validation even when no golden standard or true network exists.
![]() | (9) |
If all elements of i are smaller than 1, then the Weak Irrepresentable Condition (WIC) is fulfilled and if all elements are smaller than 1 minus a positive constant η, then the Strong Irrepresentable Condition (SIC) is fulfilled.16 The latter is used to show that LASSO is strongly sign consistent and the former that it is general sign consistent; both imply that the probability that all elements in the LASSO estimate of θi have the correct sign goes to one when the number of samples M goes to infinity. A few additional technical conditions are required in the theorems, but it is logical to expect a high accuracy of the network estimate produced by LASSO when
![]() | (10) |
If all columns in Φ0i are orthogonal to all columns in Φ0ci, then i = 0 and SIC are fulfilled. Assume for a moment that Φ0i = ϕ1 and Φ0ci = ϕ2, then
i = |ϕT1ϕ2∥ϕ2∥−2|. Now if ϕ1 = αϕ2, i.e. ϕ1 is parallel to ϕ2, then
i = |α| is greater or equal to one unless α is smaller than one, i.e. unless ϕ1 is shorter than ϕ2. Hence the projection of any regressor corresponding to a zero element that is not orthogonal to the regressors corresponding to a nonzero element onto them must always be shorter than all of them to fulfill SIC. This would always hold if all regressors corresponding to a zero were shorter than all regressors corresponding to a nonzero element. This makes it interesting to calculate the minimum ratio between the shortest regressor corresponding to a nonzero element and the longest corresponding to a zero over all rows
![]() | (11) |
It is important to note that in each case, for each noise realisation, we selected the ζ value that yielded the LASSO estimate that was closest to the true network, i.e. highest MCC for the 100 noise realisations for each noise level, and similarly for Elastic Net, Bolasso, and LSCO. The former was done to avoid the influence of the rule used to select the regularisation coefficient ζ, which typically has a strong influence on the accuracy of the network estimate and is difficult to select correctly.26 Our network estimates are thus in general unrealistically accurate and require knowledge of the true network which is only available for simulated data, yet they are still far from correct. The latter was done to decrease the impact of random effects of the noise realisations in favour of data properties by doing Monte Carlo simulations. For this reason we also used the same 100 noise realisations for all data sets and all SNRs. We varied the network and data properties within ranges deemed reasonable and relevant for network inference based on previous studies. In the literature a single gene is typically perturbed in each experiment, but we here used NRDP, i.e. perturbed two genes in each experiment, selected at random while ensuring that each gene is perturbed and that the perturbations constitute a linearly independent set. We also analysed data sets generated by the typical single gene design and observed the same failure of LASSO (data not shown). A total of 2N (N = 10) or 4N (N = 45) simulated perturbation experiments were used in all data sets, which are comparable to the 3N experiments performed in vivo by Lorenz et al.2 and Gardner et al.,1 respectively.
For the 20 SBED data sets with a response matrix having a low condition number, LASSO, Elastic Net, and LSCO performed equally well and recovered the true network in all cases when the data sets were informative enough for network inference, see Fig. 2. The SBED was in these cases used to balance the excitation of all directions in the space spanned by the 10 genes of the network, so that all singular values of the response matrix were of similar magnitude, while perturbing as few genes as possible in each experiment. These response matrices therefore have condition numbers in the range 1.3 to 2.0, which we have not yet seen for any published gene expression data set. It is worth remembering that we selected the optimal ζ value for each data set and noise realisation for both LASSO, Elastic Net, and LSCO, so the performance is in general unrealistically good. For SNR 0.1, a weak indication of LASSO and Elastic Net outperforming LSCO is present but we cannot say that one method in practice is preferable over the other because the accuracy is sensitive to the selection of the value of the regularisation coefficient ζ.26 The same networks and noise realisations as described above were used.
![]() | ||
Fig. 2 GRN inference accuracy versus signal to noise ratio using, Elastic Net, LSCO, and RNI on SBED data sets with N = 10 and low condition number κ(Y). For an SNR of 10, Elastic Net, and LSCO can infer the true network structure for some of the data sets even though all existing links cannot be proven to exist (RNI has a MCC < 1). For an SNR > 10 the median of all methods inference accuracy is approaching 1 and is above 90% for all data sets. For a description of the plot see Fig. 1. |
![]() | ||
Fig. 3 GRN inference accuracy versus signal to noise ratio for, Elastic Net, Bolasso, LSCO, and RNI on SSSD data sets with N = 45 and high condition number κ(Y). All L1 regularised methods fail even when all existing links can be proven to exist, corresponding to MCC = 1 for RNI. For a description of the plot see Fig. 1. |
For the 10 SSSD data sets with a response matrix having a low condition number Bolasso and LSCO performed equally well and recovered the true network in all cases when the data sets were informative enough for network inference, see Fig. 4, while LASSO and Elastic Net performed worse. This is probably due to the condition number of the response matrix being significantly larger than for the 10 gene case in Fig. 2. It is now in the range 26 to 41. We also observe that Elastic Net performs worse than all other methods for the 45 gene case, but have not investigated why. The same networks and noise realisations as described above were used.
![]() | ||
Fig. 4 GRN inference accuracy versus signal to noise ratio for, Elastic Net, Bolasso, LSCO, and RNI on SSSD data sets with N = 45 and low condition number κ(Y). For a description of the plot see Fig. 1. |
For low SNRs, RNI seems to be partly outperformed by all other methods because of a large number of false negatives, which are a consequence of ensuring that only true positives are included in the network model under the mild assumptions that are fulfilled here, and partly because the optimal regularisation coefficient ζ is selected based on knowledge of the true network. RNI partly performs better than LASSO and Elastic Net from SNR 1 and better than Bolasso from SNR 100 for the data sets with a higher condition number because the SNR is defined based on the weakest singular value and the total excitation hence in general is higher. RNI is mainly included in this study because it gives FPEL and thereby can be seen as a lower bound on the performance that should be required from every other inference method. No other inference method that we are aware of can be used to prove that a link must exist in order to explain the observed data when accounting for the error model of the noise. The ability to prove the existence of links under mild assumptions is in our opinion so valuable in knowledge generation that network models generated by other methods only should be used when the methods outperform RNI.
Another indicator is rmin, the minimum ratio between the shortest regressor corresponding to a nonzero element and the longest corresponding to a zero. All data sets with a low condition number have a considerably higher rmin than all data sets with a high condition number, see Fig. 6. The fact that rmin is below one for all data sets implies that the longest regressor corresponding to a non-existing link exceeds the length of the shortest regressor corresponding to an existing link. For the data sets with a low condition number, the longest regressor corresponding to a non-existing link is expected to be nearly orthogonal to all regressors corresponding to existing links, while in data sets with a high condition number they are not.
![]() | ||
Fig. 6 GRN inference accuracy versus minimum ratio between the shortest regressor corresponding to a nonzero element and the longest corresponding to a zero over all rows. A clear separation between data sets where it can infer the true structure (red) and where it cannot (blue) is seen at 0.5 for the ratio rmin. Colours as in Fig. 5. |
Evaluation of the irrepresentable conditions and the ratio between the shortest regressor corresponding to a nonzero element and the longest corresponding to a zero requires knowledge of the true network, so they cannot in practice be used to evaluate if LASSO will produce an accurate estimate. The lack of a linear relation between MCC and μ or rmin indicates that neither of the measures captures all aspects that affect the performance of LASSO, so further studies are needed. Until a better testable criterion for failure of LASSO is presented, we recommend all users to check the condition number of the response matrix as discussed above. The condition number has the advantage of being a classical tool in linear algebra that is easy to calculate.
To avoid the influence of the selection of the regularisation parameter ζ we varied it over the whole range from a full to an empty inferred network, see supplemental, and report the largest MCC value of each method for each of the networks reported by Lorenz et al.2–S10, S19, and S9. The first golden standard S10 is a collection of links that Lorenz et al.2 found experimental evidence in the literature. To get the second golden standard S19 they complemented these links with links that they found in their validation experiments using ChIP and qPCR. S9 is their final network estimate using NIR39 followed by t-tests keeping only statistically significant interactions. It is not a gold standard but we included it for comparison. Note that a link in these golden standards can mean very different things; anything from a binding observed in a ChIP experiment to an influence on the expression of the other gene. The applied inference methods can only pick up influences that led to expression changes present in the recorded data. It is therefore unlikely that any of these golden standards equal the ``true'' network that would be achieved if more data were collected until Nordling's confidence score for each possible link is either above one or approaches zero. We therefore refrain from making statements about which method that performs best based on comparison to these golden standards. The MCC of LASSO, Elastic Net, LSCO, and RNI is below 0.27 for both golden standards and hence in agreement with our expectation based on our simulations, see Table 5. The MCC between S9 and the two golden standards is below 0.28, i.e. of the same magnitude. The RNI inferred network is empty, indicating that the data contain so little information that no link can be proven to exist.
S10 | S19 | S9 | |
---|---|---|---|
LASSO | 0.18 | 0.22 | 0.36 |
LSCO | 0.21 | 0.20 | 0.32 |
Elastic net | 0.18 | 0.27 | 0.40 |
RNI | 0.00 | 0.00 | 0.00 |
NIR (S9) | 0.25 | 0.28 | 1.00 |
For both well-conditioned and ill-conditioned data, we found an SNR, as defined in (8), of 10 to be sufficient for LSCO and RNI to achieve maximum accuracies close to one. For data with an SNR below one the accuracy of all methods was in general low. This puts high demands on the quality of experimental data to be useful for GRN inference.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c4mb00419a |
‡ These authors contributed equally. |
This journal is © The Royal Society of Chemistry 2015 |