Marie
Schrynemackers
*a,
Louis
Wehenkel
a,
M. Madan
Babu
b and
Pierre
Geurts
a
aDepartment of EE and CS & GIGA-R, University of Liège, Belgium. E-mail: marie.schrynemackers@ulg.ac.be
bMRC Laboratory of Molecular Biology, Cambridge, UK
First published on 11th May 2015
Networks are ubiquitous in biology, and computational approaches have been largely investigated for their inference. In particular, supervised machine learning methods can be used to complete a partially known network by integrating various measurements. Two main supervised frameworks have been proposed: the local approach, which trains a separate model for each network node, and the global approach, which trains a single model over pairs of nodes. Here, we systematically investigate, theoretically and empirically, the exploitation of tree-based ensemble methods in the context of these two approaches for biological network inference. We first formalize the problem of network inference as a classification of pairs, unifying in the process homogeneous and bipartite graphs and discussing two main sampling schemes. We then present the global and the local approaches, extending the latter for the prediction of interactions between two unseen network nodes, and discuss their specializations to tree-based ensemble methods, highlighting their interpretability and drawing links with clustering techniques. Extensive computational experiments are carried out with these methods on various biological networks that clearly highlight that these methods are competitive with existing methods.
When formulated as a supervised learning problem, network inference consists in learning a classifier on pairs of nodes. Mainly two approaches have been investigated in the literature to adapt existing classification methods for this problem.1 The first one, that we call the global approach, considers this problem as a standard classification problem on an input feature vector obtained by concatenating the feature vectors of each node from the pair.1 The second approach, called local,2,3 trains a different classifier for each node separately, aiming at predicting its direct neighbors in the graph. These two approaches have been mainly exploited with support vector machine (SVM) classifiers. In particular, several kernels have been proposed for comparing pairs of nodes in the global approach4,5 and the global and local approaches can be related for specific choices of this kernel.6 A number of papers applied the global approach with tree-based ensemble methods, mainly Random Forests,7 for the prediction of protein–protein8–11 and drug–protein12 interactions, combining various feature sets. Besides the local and global methods, other approaches for the supervised graph inference includes, among others, matrix completion methods,13 methods based on output kernel regression,14,15 Random Forests-based similarity learning,16 and methods based on network properties.17
In this paper, we would like to systematically investigate, theoretically and empirically, the exploitation of tree-based ensemble methods in the context of the local and global approaches for supervised biological network inference. We first formalize biological network inference as the problem of classification of pairs, considering in the same framework homogeneous graphs, defined on one kind of nodes, and bipartite graphs, linking nodes of two families. We then define the general local and global approaches in the context of this formalization, extending in the process the local approach for the prediction of interactions between two unseen network nodes. The paper discusses in details the specialization of these approaches to tree-based ensemble methods. In particular, we highlight their high potential in terms of interpretability and draw connections between these methods and unsupervised (bi-)clustering methods. Experiments on several biological networks show the good predictive performance of the resulting family of methods. Both the local and the global approaches are competitive with however an advantage for the global approach in terms of predictive performance and for the local approach in terms of compactness of the inferred models.
The paper is structured as follows. Section 2 first defines the general problem of supervised network inference and cast it as a classification problem on pairs. Then, it presents two generic approaches to address it and their particularization for tree ensembles. Section 3 reports experiments with these methods on several homogeneous and bipartite biological networks. Section 4 concludes and discusses future work directions. Additional experimental results and implementation details can be found in the ESI.†
In this context, the problem of supervised network inference can be formulated as follows (see Fig. 1):
Given a partial knowledge of the adjacency matrix Y of the target network, find the best possible predictions of the missing or unknown entries of this matrix by exploiting the feature description of the network nodes.
In this paper, we address this problem as a supervised classification problem on pairs.18 A learning sample, denoted LSp, is constructed as the set of all pairs of nodes that are known to interact or not (i.e., the known entries in the adjacency matrix). The input variables used to describe these pairs are the feature vectors of the two nodes in the pair. A classification model f (i.e. a function associating a label in {0,1} to each combination of the input variables) can then be trained from LSp and used to predict the missing entries of the adjacency matrix.
The evaluation of the predictions of the supervised network inference methods requires special care. Indeed, all pairs are not as easy as the others to predict: it is typically much more difficult to predict pairs that involve nodes for which no examples of interactions are provided in the learning sample LSp. As a consequence, to get a complete assessment of a given method, one needs to partition the predictions into different families, depending on whether the nodes in the tested pair are represented or not in the learning set LSp, and then to perform a separate evaluation within each family.18
To formalize this, let us denote by LSc and LSr the nodes from the two sets that are present in LSp (i.e. which are involved in some pairs in LSp) and by TSc and TSr (where TS stands for the test set) the nodes that are unseen in LSp. The pairs of nodes to predict (i.e., outside LSp) can be divided into the following four families (where S1 × S2 denotes the cartesian product between sets S1 and S2 and S1/S2 their difference):
• (LSr × LSc)/LSp: predictions of (unseen) pairs between two nodes which are represented in the learning sample.
• LSr × TSc or TSr × LSc: predictions of pairs between one node represented in the learning sample and one unseen node.
• TSr × TSc: predictions of pairs between two unseen nodes.
These families of pairs are represented in the adjacency matrix in Fig. 2A. Thereafter, to simplify the notations, we denote the four families as LS × LS, LS × TS, TS × LS and TS × TS. In the case of an homogeneous undirected graph, only three sets can be defined as the two sets LS × TS and TS × LS are confounded.18
Prediction performances are expected to differ between these four families. Typically, one expects that TS × TS pairs will be the most difficult to predict since less information is available at training about the corresponding nodes. These predictions will then be evaluated separately in this work, as suggested in several publications.18,19 They can be evaluated by performing two kinds of cross-validation (CV): a first CV procedure on pairs of nodes (denoted “CV on pairs”) to evaluate LS × LS predictions (see Fig. 2B) and a second CV procedure on nodes (denoted “CV on nodes”) to evaluate LS × TS, TS × LS and TS × TS predictions (see Fig. 2C).18
In the case of a homogeneous graph, the adjacency matrix Y is a symmetric square matrix. We will introduce two adaptations of the approach to handle such graphs. First, for each pair (nr,nc) in the learning sample, the pair (nc,nr) will also be introduced in the learning sample. Without further constraint on the classification method, this will not ensure however that the learnt function fglob will be symmetric in its arguments. To make it symmetric, we will compute a new class conditional probability model fpglob,sym from the learned one fpglob as follows:
These two sets of classifiers can then be exploited to make LS × TS and TS × LS types of predictions. For pairs (nr,nc) in LS × LS, two predictions can be obtained: fnc(nr) and fnr(nc). We propose to simply combine them by an arithmetic average of the corresponding class conditional probability estimates.
As such, the local approach is in principle not able to make direct predictions for pairs of nodes (nr,nc) ∈ TS × TS (because LS(nr) = LS(nc) = ⊘ for nr ∈ TSr and nc ∈ TSc). We nevertheless propose to use the following two-step procedure to learn a classifier for a node nr ∈ TSr (see Fig. 4):
• First, learn all classifiers fnc for nodes nc ∈ LSc (equivalent to the completion of the columns in Fig. 4),
• Then, learn a classifier from the predictions given by the models fnc trained in the first step (equivalent to the completion of the rows in Fig. 4).
Again by symmetry, the same strategy can be applied to obtain models for the nodes nc ∈ TSc. A prediction is then obtained for a pair (nr,nc) in TS × TS by averaging the class conditional probability predictions of both models
and
. A related two-step procedure has been proposed by Pahikkala et al.20 for learning on pairs with kernel methods.
Note that to derive the learning samples needed to train models and
in the second step, one requires to choose a threshold on the predicted class conditional probability estimates (to turn these probabilities into binary classes). In our experiments, we will set this threshold in such a way that the proportion of edges versus non edges in the predicted subnetworks in LS × TS and TS × LS is equal to the same proportion within the original learning sample of pairs.
This strategy can be specialized to the case of a homogeneous graph in a straightforward way. Only one class of classifiers fn and ffn are trained for nodes in LS and in TS respectively (using the same two-step procedure as in the asymmetric case for the second). LS × LS and TS × TS predictions are still obtained by averaging two predictions, one for each node of the pair.
Single decision trees typically suffer from high variance, which makes them not competitive in terms of accuracy. This problem is circumvented by using ensemble methods that generate several trees and then aggregate their predictions. In this paper, we exploit one particular ensemble method called extremely randomized trees (extra-trees22). This method grows each tree in the ensemble by selecting at each node the best among K randomly generated splits. In our experiments, we use the default setting of K, equal to the square root of the total number of candidate attributes.
One interesting feature of tree-based methods (single and ensemble) is that they can be extended to predict a vectorial output instead of a single scalar output.23 We will exploit this feature of the method in the context of the local approach below.
This approach has the advantage of requiring only four tree ensemble models in total instead of one model for each potential node in the case of the single output approach. It can however only be used when the complete submatrix is observed for pairs in LS × LS, since the tree-based ensemble method cannot cope with missing output values.
In the case of the global approach, as illustrated in Fig. 5A, the tree that is built partitions the adjacency matrix (more precisely, its LSr × LSc part) into rectangular regions. These regions are defined such that pairs in each region are either all connected or all disconnected. The region is furthermore characterized by a path in the tree (from the root to the leaf) corresponding to tests on the input features of both nodes of the pair.
In the case of the local multiple output approach, one of the two trees partitions the rows and the other tree partitions the columns of the adjacency matrix. Each partitioning is carried out in such a way that nodes in each subpartition have a similar connectivity profile. The resulting partitioning of the adjacency matrix will thus follow a checkerboard structure with also only connected or disconnected pairs in the obtained submatrix, as far as possible (Fig. 5B). Each submatrix will be furthermore characterized by two conjunctions of tests, one based on row inputs and one based on column inputs. These two methods can thus be interpreted as carrying out a biclustering25 of the adjacency matrix where the biclustering is however directed by the choice of tests on the input features. A concrete illustration can be found in Fig. 6 and in the ESI.†
![]() | ||
Fig. 6 Illustration of the interpretability of multiple-output decision-tree on a drug–protein interaction network. We zoomed in the rectangular subregion with the highest number of interactions, and presented a list of drug and protein features associated with this region. See the ESI† for more details about the procedures. |
In the case of the local single output approach, the partitioning is more fine-grained as it can be different from one row or column to another. However in this case, each tree gives an interpretable characterization of the nodes which are connected to the node from which the tree was built.
When using ensembles, the global approach provides a global ranking of all features from the most to the less relevant. The local multiple output approach provides two separate rankings, one for the row features and one for the column features and the local single output approach gives a separate ranking for each node. All variants are therefore complementary from an interpretability point of view.
Network | Network size | Number of edges | Number of features | |
---|---|---|---|---|
Homogen. networks | PPI | 984 × 984 | 2438 | 325 |
EMAP | 353 × 353 | 1995 | 418 | |
MN | 668 × 668 | 2782 | 325 | |
Bipartite networks | ERN | 154 × 1164 | 3293 | 445/445 |
SRN | 113 × 1821 | 3663 | 9884/1685 | |
DPI | 1862 × 1554 | 4809 | 660/876 |
As highlighted by several studies,39 in biological networks, nodes of high degrees have a higher chance to be connected to any new node. In our context, this means that we can expect that the degree of a node will be a good predictor to infer new interactions involving this node. We want to assess the importance of this effect and provide a more realistic baseline than the usual random guess performance. To reach this goal, we evaluate the AUROC and AUPR scores when using the sum of the degrees of each node in a pair to rank LS × LS pairs and when using the degree of the nodes belonging to the LS to rank TS × LS or LS × TS pairs. AUROC and AUPR scores will be evaluated using the same protocol as hereabove. As there is no information about the degrees of nodes in TS × TS pairs, we will use random guessing as a baseline for the scores of these predictions (corresponding to an AUROC of 0.5 and an AUPR equal to the proportion of interactions among all nodes pairs).
Precision–recall (AUPR) | ROC (AUC) | ||||||
---|---|---|---|---|---|---|---|
LS × LS | LS × TS | TS × TS | LS × LS | LS × TS | TS × TS | ||
PPI | Global | 0.41 | 0.22 | 0.10 | 0.88 | 0.84 | 0.76 |
Local so | 0.28 | 0.21 | 0.11 | 0.85 | 0.82 | 0.73 | |
Local mo | — | 0.22 | 0.11 | — | 0.83 | 0.72 | |
Baseline | 0.13 | 0.02 | 0.00 | 0.73 | 0.74 | 0.50 | |
EMAP | Global | 0.49 | 0.36 | 0.23 | 0.90 | 0.85 | 0.78 |
Local so | 0.45 | 0.35 | 0.24 | 0.90 | 0.84 | 0.79 | |
Local mo | — | 0.35 | 0.23 | — | 0.85 | 0.80 | |
Baseline | 0.30 | 0.13 | 0.03 | 0.87 | 0.80 | 0.50 | |
MN | Global | 0.71 | 0.40 | 0.09 | 0.95 | 0.85 | 0.69 |
Local so | 0.57 | 0.38 | 0.09 | 0.92 | 0.83 | 0.68 | |
Local mo | — | 0.45 | 0.14 | — | 0.85 | 0.71 | |
Baseline | 0.05 | 0.04 | 0.01 | 0.75 | 0.70 | 0.50 |
![]() | ||
Fig. 7 Precision–recall curves for the metabolic network: higher the number of nodes of a pair present in the learning set, better will be the prediction for this pair. |
In terms of absolute AUPR and AUROC values, LS × LS pairs are clearly the easiest to predict, followed by LS × TS pairs and TS × TS pairs. This ranking was expected from previous discussions. Baseline results in the case of LS × LS and LS × TS predictions confirm that node degrees are very informative: baseline AUROC values are much greater than 0.5 and baseline AUPR values are also significantly higher than the proportion of interactions among all pairs (0.005, 0.03, and 0.01 respectively for PPI, EMAP, and MN), especially in the case of LS × LS predictions. Nevertheless, our methods are better than these baselines in all cases. On the EMAP network, the difference in terms of AUROC is very slight but the difference in terms of AUPR is important. This is typical of highly skewed classification problems, where precision–recall curves are known to give a more informative picture of the performance of an algorithm than ROC curves.40
All tree-based approaches are very close on LS × TS and TS × TS pairs but the global approach has an advantage over the local one on LS × LS pairs. The difference is important on the PPI and MN networks. For the local approach, the performance of single and multiple output approaches are indistinguishable, except with the metabolic network where the multiple output approach gives better results. This is in line with the better performance of the global versus the local approach on this problem, as indeed both the global and the local multiple output approaches grow a single model that can potentially exploit correlations between the outputs. Notice that the multiple output approach is not feasible when we want to predict LS × LS pairs, as we are not able to deal with missing output values in multiple output decision trees.
Precision–recall (AUPR) | ROC (AUC) | ||||||||
---|---|---|---|---|---|---|---|---|---|
LS × LS | LS × TS | TS × LS | TS × TS | LS × LS | LS × TS | TS × LS | TS × TS | ||
ERN (TF–gene) | Global | 0.78 | 0.76 | 0.12 | 0.08 | 0.97 | 0.97 | 0.61 | 0.64 |
Local so | 0.76 | 0.76 | 0.11 | 0.10 | 0.96 | 0.97 | 0.61 | 0.66 | |
Local mo | — | 0.75 | 0.09 | 0.09 | — | 0.97 | 0.61 | 0.65 | |
Baseline | 0.31 | 0.30 | 0.02 | 0.02 | 0.86 | 0.87 | 0.52 | 0.50 | |
SRN (TF–gene) | Global | 0.23 | 0.27 | 0.03 | 0.03 | 0.84 | 0.84 | 0.54 | 0.57 |
Local so | 0.20 | 0.25 | 0.02 | 0.03 | 0.80 | 0.83 | 0.53 | 0.57 | |
Local mo | — | 0.24 | 0.02 | 0.03 | — | 0.83 | 0.53 | 0.57 | |
Baseline | 0.06 | 0.06 | 0.03 | 0.02 | 0.79 | 0.78 | 0.51 | 0.50 | |
DPI (drug–protein) | Global | 0.14 | 0.05 | 0.11 | 0.01 | 0.76 | 0.71 | 0.76 | 0.67 |
Local so | 0.21 | 0.11 | 0.08 | 0.01 | 0.85 | 0.72 | 0.72 | 0.57 | |
Local mo | — | 0.10 | 0.08 | 0.01 | — | 0.72 | 0.71 | 0.60 | |
Baseline | 0.02 | 0.01 | 0.01 | 0.01 | 0.82 | 0.63 | 0.68 | 0.50 |
![]() | ||
Fig. 8 Precision–recall curves for the E. coli regulatory network (TF vs. genes): a prediction is easier to do if the TF belongs to the learning set than if the gene belongs to. |
Like for the homogeneous networks, higher the number of nodes of a pair present in the learning set, better are the predictions, i.e., AUPR and AUROC values are significantly decreasing from LS × LS to TS × TS predictions. On the ERN and SRN networks, performances are very different for the two kinds of LS × TS predictions that can be defined, with much better results when generalizing over genes (i.e., when the TF of the pair is in the learning sample). On the other hand, on the DPI network, both kinds of LS × TS predictions are equally well predicted. These differences are probably due in part to the relative numbers of nodes of both kinds in the learning sample, as there are much more genes than TFs on ERN and SRN and a similar number of drugs and proteins in the DPI network. Differences are however probably also related to the intrinsic difficulty of generalizing over each node family, as on the four additional DPI networks (see the ESI†), generalization over drugs is most of the time better than generalization over proteins, irrespective of the relative numbers of drugs and proteins in the training network. Results are most of the time better than the baselines (based on nodes degrees for LS × LS and LS × TS predictions and on random guessing for TS × TS predictions). The only exceptions are observed when generalizing over TFs on SRN and when predicting TS × TS pairs on SRN and DPI.
The three approaches are very close to each other. Unlike on homogeneous graphs, there is no strong difference between the global and the local approach on LS × LS predictions: it is slightly better in terms of AUPR on ERN and SRN but worse on DPI. The single and multiple output approaches are also very close, both in terms of AUPR and AUROC. Similar results are observed on the four additional DPI networks.
Publication | DB | Protocol | Measures | Their results | Our results |
---|---|---|---|---|---|
Ref. 2 | PPI | LS × TS, 5CV | AUPR | 0.25 | 0.21 |
MN | 0.41 | 0.43 | |||
Ref. 14 | PPI | LS × TS, 10CV | AUPR/ROC | 0.18/0.91 | 0.22/0.84 |
TS × TS | 0.09/0.86 | 0.10/0.76 | |||
MN | LS × TS | 0.18/0.85 | 0.45/0.85 | ||
TS × TS | 0.07/0.72 | 0.14/0.71 | |||
Ref. 3 | ERN | LS × TS, 3CV | Recall 60/80 | 0.44/0.18 | 0.38/0.15 |
Ref. 38 | DPI | LS × LS, 5CV | AUROC | 0.75 | 0.88 |
Ref. 41 | DPI | LS × LS, 5CV | AUROC | 0.87 | 0.88 |
LS × TS & TS × LS | 0.74 | 0.74 |
Globally, these comparisons show that tree-based methods are competitive on all six networks. Moreover, it has to be noticed that (1) no other method has been tested over all these problems, and (2) we have not tuned any parameters of the Extra-trees method. Better performances could be achieved by changing, for example, the randomization scheme,7 the feature selection parameter K, or the number of trees.
The global and local approaches are close in terms of accuracy, except when we predict LS × LS interactions where the global approach gives almost always better predictions. The local multiple output method has the advantage to provide less complex models and requires less memory and training time. All approaches remain however interesting because of their complementarity in terms of interpretability.
As two side contributions, we extended the local approach for the prediction of edges between two unseen nodes and proposed the use of multiple output models in this context. The two-step procedure used to obtain this kind of predictions provides similar results as the global approach, although it trains the second model on the first model's predictions. It would be interesting to investigate other prediction schemes and evaluate this approach in combination with other supervised learning methods such as SVMs.20 The main benefits of using multiple output models is to reduce model sizes and potentially computing times, as well as to reduce variance, and therefore improving accuracy, by exploiting potential correlations between the outputs. It would be interesting to apply other multiple output or multi-label SL methods42 within the local approach.
We focused on the evaluation and comparison of our methods on various biological networks. To the best of our knowledge, no other study has considered simultaneously as many of these networks. Our protocol defines an experimental testbed to evaluate new supervised network inference methods. Given our methodological focus, we have not tried to obtain the best possible predictions on each and every one of these networks. Obviously, better performances could be obtained in each case by using up-to-date training networks, by incorporating other feature sets, and by (cautiously) tuning the main parameters of tree-based ensemble methods. Such adaptation and tuning would not change however our main conclusions about relative comparisons between methods.
A limitation of our protocol is that it assumes the presence of known positive and negative interactions. Most often in biological networks, only positive interactions are recorded, while all unlabeled interactions are not necessarily true negatives (a notable exception in our experiments is the EMAP dataset). In this work, we considered that all unlabeled examples are negative examples. It was shown empirically and theoretically that this approach is reasonable.43 It would be interesting nevertheless to design tree-based ensemble methods that explicitly takes into account the absence of true negative examples.44
Footnotes |
† Electronic supplementary information (ESI) available: Implementation and computational issues, supplementary performance curves, and illustration of interpretability of trees. See DOI: 10.1039/c5mb00174a |
‡ In this paper, the terms network and graph will refer to the same thing. |
This journal is © The Royal Society of Chemistry 2015 |