Cheng
Liang
,
Jiawei
Luo
* and
Dan
Song
College of Information Science and Engineering, Hunan University, Changsha, Hunan, China. E-mail: alcs417@hnu.edu.cn; luojiawei@hnu.edu.cn; dansongph@hnu.edu.cn
First published on 25th June 2014
Advances in proteomic technologies combined with sophisticated computing and modeling methods have generated an unprecedented amount of high-throughput data for system-scale analysis. As a result, the study of protein–protein interaction (PPI) networks has garnered much attention in recent years. One of the most fundamental problems in studying PPI networks is to understand how their architecture originated and evolved to their current state. By investigating how proteins of different ages are connected in the yeast PPI networks, one can deduce their expansion procedure in evolution and how the ancient primitive network expanded and evolved. Studies have shown that proteins are often connected to other proteins of a similar age, suggesting a high degree of age preference between interacting proteins. Though several theories have been proposed to explain this phenomenon, none of them considered protein-clusters as a contributing factor. Here we first investigate the age-dependency of the proteins from the perspective of network motifs. Our analysis confirms that proteins of the same age groups tend to form interacting network motifs; furthermore, those proteins within motifs tend to be within protein complexes and the interactions among them largely contribute to the observed age preference in the yeast PPI networks. In light of these results, we describe a new modeling approach, based on “network motifs”, whereby topologically connected protein clusters in the network are treated as single evolutionary units. Instead of modeling single proteins, our approach models the connections and evolutionary relationships of multiple related protein clusters or “network motifs” that are collectively integrated into an existing PPI network. Through simulation studies, we found that the “network motif” modeling approach can capture yeast PPI network properties better than if individual proteins were considered to be the simplest evolutionary units. Our approach provides a fresh perspective on modeling the evolution of yeast PPI networks, specifically that PPI networks may have a much higher age-dependency of interaction density than had been previously envisioned.
Initially, Barabasi and Albert proposed the preferential attachment (PA) model, in which a new node will gain connections with probability proportional to the degree of pre-existing nodes.4 While the PA model can generate networks with scale-free topology, it lacks the ability to simultaneously recapitulate network modularity, i.e. the high clustering coefficients that are usually observed in real biological networks (i.e. protein complexes or network motifs). Shortly after, the gene duplication–divergence model (DD) emerged as an alternative and more realistic model for network evolution.5–7 This model was inspired by the frequently occurring gene and genome duplication events: (1) randomly duplicating a node and retaining all its connections; (2) rewiring the connections of the duplicated node or the original node according to a certain probability distribution.8,9 Finally, the crystal growth (CG) model was proposed by Kim and Marcotte, which takes into account interaction preferences between proteins with similar age and phylogenetic lineage in the yeast PPI networks; as a result, module structures were introduced as a constraint for the newly added nodes.10 In essence, new nodes are added into pre-divided communities with the possibility of having a fixed number of new connections, analogous to the natural growth of crystal lattice. Interestingly, Kim and Marcotte found that proteins belonging to the same age groups are more likely to interact with each other than those from different age groups. To calibrate the interaction preferences between nodes of different age groups, they introduced a new metric called “interaction density”, where a positive (negative) value indicates whether nodes make connections more frequently with nodes from the same (or different) age groups.
Generally, there are two criteria for evaluating the validity of these theoretical models: (1) whether the model can recapitulate the global or local properties observed in real networks, and (2) whether the underlying principles of these models are biologically plausible.11 The aforementioned models have been either investigated on purely theoretical grounds, or applied and tested on real interaction data sets, which resulted in varying degrees of success in recapitulating the observed network topological properties. One important issue with these previous modeling approaches is that they consider individual proteins as elementary evolutionary units, i.e. the network can only grow or contract by one protein at a time. However, it is often observed that biological networks can grow with the addition of several proteins simultaneously. These associated proteins are called “network motifs”, which are recurrent and statistically significant sub-graphs or patterns in complex networks.12 These motifs are often considered as recurring building blocks that make up bigger and more complex networks; these motifs often carry out specific biological functions as a coherent unit. For example, transcriptional regulatory networks can often be decomposed into motifs that have unique topology and carry out specific cellular functions: e.g. single input motif (SIM), multiple-input motif (MIM), feed-forward loop (FFL).13 Many earlier studies have shown that gene and genome duplication events have contributed to the emergence of network motifs in transcriptional regulatory networks.14 In the context of protein–protein interaction (PPI) networks, due to the high frequency of these motifs, their composition and evolution can often be studied to offer clues on the evolution of the entire network.15,16 In addition, it has been suggested that the abundance and distribution of network motifs also contributed to the robustness of the network to small external perturbations.17 These observations suggested that network motifs can potentially function as an independent evolutionary unit and may be important in the growth of biological networks. Recently, Liu et al. reported that the clustered interacting proteins along with the interactions among them were incorporated into the existing PPI networks during a relatively short period of time.18 They also concluded that proteins of the same age group tend to form network motifs while those of different age groups tend to avoid such motif structures. Such correlative observations are intriguing and have significant evolutionary implications, although they have not been rigorously modeled and simulated.
Considering the advantages of network motifs, here we attempt to model network evolution based on the notion that network motifs can be considered as cohesive and distinct evolutionary units. In our model the concept of “network motif” is similar to but distinct from “protein complexes” or “functional modules”. The main distinction between these two concepts is that network motifs are strictly bound by conserved topological structures. In essence, network motifs often reflect functionality of protein complexes or functional modules without preconceived functional limitations. Compared with protein complexes and functional modules, network motifs have explicit topological definition that ignores functionality so that they can be clearly identified and enumerated in biological networks. We show that this new modeling approach can not only capture the topological properties of the real yeast PPI network, but also generate comparative insight between the different modeling approaches that attempts to understand the underlying evolutionary mechanisms. We believe that such a new approach can provide a different perspective on the evolution of the yeast PPI network.
Dataset | Average degree | Characteristic path | Clustering coefficient | Delta D |
---|---|---|---|---|
LC-Kim | 7.3 | 4.0 | 0.50 | 0.54 |
IDBOS-Gavin | 12.3 | 5.8 | 0.63 | 0.47 |
IDBOS-Krogan | 4.2 | 4.9 | 0.63 | 0.88 |
Y2H-Union | 2.7 | 5.6 | 0.046 | 0.70 |
In order to gain a deeper insight into the age-dependency of interaction density, we used a heat map to show the interaction preference between the four age groups classified by Kim and Marcotte, i.e. “ABE”, “AE/BE”, “E” or “Fu”, according to the taxonomic distribution of constituent domains among archaea (A), bacteria (B), eukaryotes (E) and fungi (Fu). Specifically, all four datasets not only show an overall high positive delta D value, but also exhibit dense interaction density tendencies within the same age group (diagonal) and sparse tendencies between different age groups (off-diagonal) in each column. This clearly demonstrates the existence of strong interaction preference between proteins (Fig. 1). The robustness of the positive delta D pattern regardless of the source of the data also suggests that the strong age preference is a genuine feature of yeast PPI networks, which is not caused by any artifacts in the data.
Since the networks constructed from all four PPI datasets share very similar topological properties and interaction preference between proteins, we considered these properties as standard features of yeast PPI networks, and used them as benchmark features in evaluating the prediction performance of various models. The definition of these parameters can be found in Materials and methods. Particularly, the age-dependency of the interaction density pattern (delta D) will be discussed in detail and considered as the main criterion for our evaluation.
![]() | ||
Fig. 2 The age patterns for the 3-node and 4-node subgraphs. This figure is modified after Figure 2. Network motifs and evolutionary motif modes in Liu, et al., BMC Evol. Biol., 2011, 11, 133. |
Subgraph typea | Motif or not | Empirical p-value | Dataset | ||
---|---|---|---|---|---|
#3 | #2–1 | #1–1–1 | |||
a The subgraph types correspond to the labels in Fig. 2. | |||||
A | No | 0.149 | 0.647 | 0.605 | LC-Kim |
B | Yes | <10−3 | (<10−3) | (<10−3) | LC-Kim |
A | No | 0.006 | 0.67 | 0.938 | IDBOS-Gavin |
B | Yes | <10−3 | 0.929 | 0.97 | IDBOS-Gavin |
A | No | 0.002 | 0.892 | 0.957 | IDBOS-Krogan |
B | Yes | <10−3 | (<10−3) | (<10−3) | IDBOS-Krogan |
A | No | 0.226 | 0.384 | 0.739 | Y2H-Union |
B | Yes | <10−3 | 0.956 | 0.994 | Y2H-Union |
Subgraph typea | Motif or not | Empirical p-value | Dataset | ||||
---|---|---|---|---|---|---|---|
#4 | #3–1 | #2–2 | #2–1–1 | #1–1–1–1 | |||
C | No | 0.028 | 0.396 | 0.649 | 0.542 | 0.986 | LC-Kim |
D | No | 0.913 | 0.773 | 0.745 | 0.165 | 0.568 | LC-Kim |
E | Yes | <10−3 | <10−3 | 0.39 | (<10−3) | (<10−3) | LC-Kim |
F | No | 0.091 | 0.324 | 0.378 | 0.67 | 0.996 | LC-Kim |
G | Yes | 0.005 | 0.309 | 0.259 | 0.962 | (<10−3) | LC-Kim |
H | Yes | <10−3 | <10−3 | (<10−3) | (<10−3) | (<10−3) | LC-Kim |
C | No | 0.082 | 0.284 | 0.58 | 0.647 | 0.984 | IDBOS-Gavin |
D | No | 0.131 | 0.209 | 0.265 | 0.787 | 0.943 | IDBOS-Gavin |
E | Yes | 0.042 | 0.149 | 0.338 | 0.835 | 0.992 | IDBOS-Gavin |
F | No | 0.488 | 0.447 | 0.249 | 0.594 | 0.948 | IDBOS-Gavin |
G | Yes | 0.078 | 0.131 | 0.304 | 0.838 | 0.952 | IDBOS-Gavin |
H | Yes | 0.046 | 0.261 | 0.59 | 0.713 | (<10−3) | IDBOS-Gavin |
C | No | 0.028 | 0.59 | 0.867 | 0.473 | 0.761 | IDBOS-Krogan |
D | No | 0.16 | 0.541 | 0.783 | 0.391 | 0.724 | IDBOS-Krogan |
E | Yes | 0.004 | 0.381 | 0.719 | 0.765 | 0.88 | IDBOS-Krogan |
F | No | 0.233 | 0.238 | 0.076 | 0.842 | 0.956 | IDBOS-Krogan |
G | Yes | 0.037 | 0.674 | 0.631 | 0.506 | 0.913 | IDBOS-Krogan |
H | Yes | <10−3 | 0.83 | 0.613 | 0.834 | 0.989 | IDBOS-Krogan |
C | No | 0.317 | 0.274 | 0.174 | 0.751 | 0.735 | Y2H-Union |
D | No | 0.017 | 0.004 | 0.614 | 0.982 | 0.935 | Y2H-Union |
E | Yes | 0.366 | 0.162 | 0.216 | 0.748 | 0.981 | Y2H-Union |
F | No | 0.976 | 0.55 | 0.054 | 0.608 | 0.478 | Y2H-Union |
G | Yes | 0.022 | 0.173 | 0.534 | 0.86 | 0.899 | Y2H-Union |
H | Yes | <10−3 | 0.698 | 0.788 | 0.985 | 0.355 | Y2H-Union |
After confirming the age-dependency of network motifs, we next investigated the degree to which these network motifs with proteins of similar age contribute to the observed interaction preference in the network. We removed the interactions within the age-homogeneous network motifs (i.e. age pattern 3# and 4#) and recalculated the delta D value. We then compared the significance of these interactions by randomly removing the same amount of interactions in the network. This process was repeated 1000 times to generate the empirical p-value. Clearly, after the removal of these interactions, the delta D drops much more significantly than the random deletion of interactions (Table S2, ESI†), which demonstrates that motifs with proteins of the same age largely contribute to the observed interaction preference in the network.
As we can see (Table 3), networks generated by the PA model have the lowest clustering coefficient, indicating that the PA model is not able to explain the modularity that is frequently observed in the real yeast PPI networks. The symmetric DD model has relatively larger characteristic path length than the real yeast PPI network, while the asymmetric DD model has a smaller clustering coefficient compared to the real yeast PPI network. Only the CG model appears to have the network properties most resembling those of the real yeast PPI network. For node degree distribution P(k), all the models can recapitulate the power-law distribution; for clustering coefficient C(k), the PA model, the symmetric DD model and the CG model have distributions similar to those of the real PPI network. For the average degree of nearest neighbours k〈nn〉, all the models show a decreasing trend (Fig. 3). Similar to the observation in ref. 10, all the models except the CG model have a negative delta D value, which suggests that none of these models except the CG model is sufficient to recapitulate the interaction preference between proteins of real yeast PPI networks.
Model | Characteristic path length | Clustering coefficient | Average degree | Delta D |
---|---|---|---|---|
PA model | 3.5 ± 0.02 | 0.02 ± 0.001 | 8.0 ± 1 × 10−14 | −0.8 ± 0.03 |
Symmetric DD model | 7.6 ± 1.11 | 0.33 ± 0.02 | 4.1 ± 0.27 | −0.26 ± 0.04 |
Asymmetric DD model | 4.5 ± 0.27 | 0.26 ± 0.02 | 6.6 ± 0.49 | −0.71 ± 0.04 |
CG model | 4.3 ± 0.05 | 0.51 ± 0.01 | 8.0 ± 4 × 10−15 | 0.005 ± 0.09 |
Network motif model, n = 3 | 5.6 ± 0.23 | 0.77 ± 0.002 | 10.0 ± 1 × 10−14 | 0.38 ± 0.03 |
Network motif model, n = 3, 4 | 5.1 ± 0.22 | 0.77 ± 0.003 | 11.2 ± 0.04 | 0.29 ± 0.04 |
From our analysis, we found that only the CG model, the network motif model and the real yeast PPI networks have positive delta D values whereas the rest of the models have negative ones. Though the PA model and the DD model (including other DD-based models) are commonly used in simulating the evolution of yeast PPI networks, neither of them could generate the observed delta D value of the real networks. The positive delta D values obtained by the CG model and the network motif model proved that only these two models could generate networks with expected interaction density as the real PPI network. Notably, our model has the largest delta-D value which more closely mimics that of real yeast PPI networks. As a main criterion, we also investigated the interaction preference between different node groups for each model. Clearly, our model demonstrates a distinct tendency of interaction preference between nodes of the same age groups (diagonal) while other models except CG show an opposite trend (Fig. 4). Although CG has a similar tendency as that demonstrated by our model, it is less obvious in the younger age groups (like G3–G3 and G4–G4) compared to ours. Thus the CG model could only generate a moderate interaction preference between nodes while our model could potentially reflect the tendency as the real yeast PPI networks.
In summary, the network motif model can capture the primary topological properties of the yeast PPI networks (Table 3 and Fig. 3 and 4). The superiority of our model in capturing age preference of interaction density reflects the importance of the motif structures in the evolutionary process of the real yeast PPI networks.
As introduced by Kim et al., the interaction density is highly protein age-dependent in yeast PPI networks. By analyzing four yeast PPI networks from three different sources, we confirmed the strong age-dependency between PPIs. Then we further examined the age preference within network motifs. These network motifs are recognized as closely associated proteins that either make up a protein complex (or sub-complex) or constitute part of cellular pathways. Our analysis demonstrates that proteins within network motifs indeed tend to exist in protein complexes. Another property of these network motifs is that they are often over-represented in the PPI network; thus they can be considered as building blocks of such networks. However, network motifs are not always protein complexes as they do not carry specific functions other than conserved topological structures. Similarly, the common network motifs found throughout the four PPIs also illustrated a strong age preference and largely contributed to the observed age-dependency property in the yeast PPI networks. To verify whether this age preference is caused by the duplication divergence mechanism during the evolution of yeast PPI networks, we calculated the proportion of paralogs that are found in the common network motifs. Surprisingly, the paralog pairs have very limited contribution to the age preference, which indicates that the age preference within motifs might be formed by other evolutionary mechanisms.
Motivated by these biological observations, in this paper we proposed a new “network motif” based modeling approach, which simulates the evolution and growth of yeast PPI networks by addition of “network motifs”. An important underlying assumption of our model is that all the members within the same network motif have the same age. Building on this, our model adopted a two-step modeling strategy, anchoring and extension, iteratively attaching known network motifs to the anchor node and its neighbors. Unlike other existing models such as duplication–divergence models, our network motif model does not require special parameters to be tuned except the predefined network size. As a result, our model is fairly robust and insensitive to the initial starting states of the network and the parameters. In contrast, most of the current modeling approaches require custom fitting of a significant number of parameters (Table S4, ESI†); the topology of the resulting networks is often very sensitive to these parameters, e.g. the number of connections of a newly added node in the PA model23 and the divergence probability in the DD model. We showed that the new network-motif based modeling approach can successfully capture the principal topological properties of the yeast PPI network, such as power-law degree distribution, high clustering coefficients, short characteristic path length, and the age-dependency indicator – delta D value. The results demonstrated that our model can get much larger delta D values compared to the CG model, which means the networks generated by our model have stronger interaction preference between PPIs and most resemble the previously observed age-dependency of the interaction density pattern in real yeast PPI networks. In addition, our model is robust to different strategies of choosing the anchor node as well as the seed graph.
The motivation of this study is to show that it is a realistic and alternative approach to model the evolution of yeast PPI networks by addition of “network motifs” consisting of multiple proteins, which have occurred frequently in the evolutionary process. Such an integration process may take longer evolutionary time than the evolutionary step based on single genes as modeled in other methods. As demonstrated in this paper, it is clear that such a fresh modeling approach can indeed grab the main characteristics of the yeast PPI networks that we see today. However it is likely that the real yeast PPI network would have evolved through the addition of both single proteins and groups of protein (i.e. network motifs). From the analysis of all the existing models we can see that one single evolutionary mechanism is not sufficient to explain the diversities demonstrated by the real yeast PPI network. Even though the network motif model can better characterize the interaction preference in the yeast PPI networks, a major shortcoming of our model as well as the CG model is that low-degree nodes are not captured in the generated networks. So a mixed evolutionary mechanism that combines several of the existing plausible evolutionary scenarios would be a reasonable conjecture. Moreover, to obtain more accurate and complete results, we would recommend users to detect more reliable network motifs by removing the noise data in the datasets and use network motifs of different size to simulate their own networks. Our work presented here will open new avenues for research on network evolution modeling.
Input: network motif file, specified network size that is reached at the end of simulation
Output: network of required size
Initialization: a seed network of four fully connected nodes
Step 1. Choose an anchor node in the network following the anti-preferential rule. This anchor node is to be connected to the network motifs selected later.
Step 2. Read the network motif storage file, choose a network motif and record its size n.
Step 3. Add the chosen motif into the network by connecting the network motif to the anchor node (selected in Step 1) and its neighbors. If the anchor node does not have n neighbors, then simply connect all its neighbors to the selected motif. Otherwise, randomly pick up n neighbors of its neighbors and connect them to the selected motif.
Step 4. Repeat this process till the network reaches the pre-defined size.
The above steps are repeated 50 times and all the results shown in the next section are averaged based on these 50 replications.
Having described the basic procedures in our simulation, next we explain the motivation and rationale behind each of these steps. In Step 1, the anti-preferential rule specifies that nodes are connected with a probability inversely proportional to the node's degree. We adopted this strategy based on the previously published observation that the number of interactions one protein can gain is restricted by its available surfaces.10 This is in contrast to the “preferential attachment” principle that had been used in some of the earlier network modeling studies, in which a new protein is preferentially added to a “hub” protein that already has many interacting partners, i.e. the probability of attachment is proportional to the node's degree.4 The “preferential attachment” model was successful in explaining the power-law distribution of network degrees, but it had difficulties in other aspects of the biological network modeling such as it failed to explain the modularity of the real network.11 In Step 2, we randomly choose a motif from the storage file and add it into the network. We only consider motifs of size 3 or 4 since it is more difficult for a large number of proteins to be incorporated into an existing network (e.g. horizontal gene transfer). We note that a smaller motif can be implicitly included as part of a larger motif in our process. Besides, all the network motifs have equal probabilities to be chosen on the basis of uniform distribution of the randomized process. Two reasons why we choose the motifs uniformly instead of taking their frequencies from the real yeast PPI data are because (1) motifs are only part of the entire set of subgraphs in the network and (2) uniform distribution does not introduce biased constraints on our model. In Step 3, we connect the external network motif with the anchor node following separate rules depending on whether the anchor node already has n connections (n is the size of the motif). If the anchor has fewer than n connecting neighbors, then we connect the anchor node and all of its neighboring nodes to all the nodes in the incoming motif; if the anchor node already has more than n neighboring nodes, then we randomly select n neighboring nodes from these nodes, and connect them and the anchor node to all the nodes of the imported network motif. The main idea of this strategy is that motifs are more likely to form or be integrated into bigger protein clusters, like protein complexes. By restricting the number of neighbors of the anchor nodes to be connected with the imported motifs, the current clusters will keep the probability to absorb other motifs due to the relatively low degree of other neighbors.
As mentioned above, we choose an anchor node in the existing network to attach the selected motif at each step. There are three ways to choose an anchor node: anti-preferential attachment, preferential attachment and random selection. The preferential attachment mechanism was based on the rationale that a new node tends to preferentially connect with the nodes with higher degrees.4 It has the advantage of power-law degree distribution but lacks modularity. Anti-preferential attachment was motivated by the rationale that a protein has only a limited amount of surface available to interact with other proteins; therefore when a new protein comes into the network it would have less chance to establish with a protein that already has many interacting partners.
We tested both models with all three different mechanisms for choosing the anchor node and found that our model can get a positive delta D value regardless of the strategies, whereas the CG model can only get a small positive delta D value with the anti-preferential mechanism (data not shown). In the meantime, we also found that our model yielded the largest delta D value with anti-preferential attachment, the second largest delta D value with random selection and the lowest delta D value with preferential attachment. Such observations suggest that our motif-based evolution model can capture the age preference of the yeast PPI network and is more robust compared to the CG model in choosing the anchor nodes.
Clustering coefficient is the measurement of the cohesiveness between the neighbors of a node. Suppose the degree of node v is kv and the number of actual existing edges between its neighbors is Ev, then the clustering coefficient Cv is
The clustering coefficient of a network is the average of the clustering coefficient of all nodes in this network, and C(k) is the average clustering coefficient of nodes with degree k.
Average degree is the average degree of all the nodes in the network.
Degree distribution is the probability of nodes with degree k in the network.
The age-dependency of interaction density Dm,n is defined as
The average interaction density gradient ΔD is defined as
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c4mb00230j |
This journal is © The Royal Society of Chemistry 2014 |