Francisco A.
Rodrigues
a and
Luciano
da Fontoura Costa
*b
aInstituto de Física de São Carlos, Universidade de São Paulo, Av. Trabalhador São Carlense 400, Caixa Postal 369, CEP 13560-970, São Carlos, São Paulo, Brazil
bNational Institute of Science and Technology for Complex Systems, USA. E-mail: luciano@if.sc.usp.br
First published on 29th September 2009
Protein–protein interaction networks were investigated in terms of outward accessibility, which quantifies the effectiveness of each protein in accessing other proteins and is related to the internality of nodes. By comparing the accessibility between 144 orthologproteins in yeast and the fruit fly, we found that the accessibility tends to be higher among proteins in the fly than in yeast. In addition, z-scores of the accessibility calculated for different species revealed that the protein networks of less evolved species tend to be more random than those of more evolved species. The accessibility was also used to identify the border of the yeast protein interaction network, which was found to be mainly composed of viable proteins.
Another new area of science, systems biology, focuses on the systematic study of complex interactions in biological systems.6–9 Basically, cellular organization can be divided into three main levels of interaction: genes, proteins and metabolites.10Genes are regulated by transcription factors, the proteome organizes itself into a protein interaction network, and metabolites are interconnected through an intricate network of metabolites. At the protein level, molecules bind to one another while respecting shape and affinity constraints in order to control biochemical reactions and provide the physical scaffolding for life. Therefore, the integrated understanding of such networks holds the key for seminal advances in biology.
The interactions between proteins are particularly important in defining biological functions (e.g.ref. 11 and 12). For example, signals coming from the exterior of a cell are mediated by protein–protein interactions. This process, called signal transduction, plays a fundamental role in many biological systems and diseases. Proteins might interact for a long time to form part of a protein complex. Alternatively, they can interact only briefly with another protein in order to modify it (e.g. a protein kinase will add a phosphate to a target protein). Many partial protein interaction maps for several eukaryotic species are now available,11,13–15 motivating several studies aimed at analyzing the structure and evolution of such networks.16–18 Conversely, the study of protein–protein interaction networks in terms of simulated dynamics has been addressed only more recently, e.g. through self-avoiding random walks.19
In the current work, in order to obtain further insights about protein interaction networks, we consider the dynamics of self-avoiding random walks, which involves agents moving through a network without visiting any vertex more than once. The choice of this particular type of non-linear dynamics in our analysis is justified biologically because it is naturally related to sequential proteinactivation, such as in signal transduction.20 The signals are carried by messenger proteins, which transmit a signal from one part of the cell to another, e.g. from the cytosol to the nucleus. In addition, insulin receptor substrate proteins define critical interactions for transmitting the signal downstream. These intracellularprotein–protein interactions are essential in transmitting the signal from the receptor to the final cellular species, such as translocation of vesicles containing GLUT4 glucose transporters from the intracellular pool to the plasma membrane, activation of glycogen or protein synthesis, and initiation of specific genetranscription.21 Because of the purposeful nature of their interactions, such dynamics can be suitably modeled by self-avoiding random walk dynamics.
Self-avoiding walks are highly dependent on the network structure and are thus able to sense specific structural patterns. In addition, self-avoiding random walks necessarily generate paths of limited length in finite-sized biological networks, while traditional random walks would yield highly redundant paths of infinite length (implied by repetitions of the same interactions). The quantification of the properties of random walk dynamics can be done by considering different network dynamical measurements.22 The choice of such measurements is typically performed in terms of the properties one wants to analyze in the network. For instance, while the outward activation is related to the influence of proteins along the network path,19 the accessibility quantifies the effectiveness of proteins in interacting with all the other proteins in the network. Previous investigations19 identified important relationships between the outward activation and protein lethality, in the sense that lethal proteins tend to present higher outward activation than viable proteins.
The outward accessibility is also related to the internality of nodes in a network, since nodes with the smallest accessibility values tend to belong to the borders of networks,23 while those with large values define the interior of networks. For instance, in Fig. 1, the accessibility of the protein YFR021w (dark gray node) is higher than that obtained for protein YLR309c. Note that these proteins occupy different regions of the network. The estimation of accessibility also allowed us to identify, possibly for the first time, the borders of the protein–protein interaction networks. This is particularly useful because the borders of complex systems are known (e.g.ref. 23) to be capable of substantially biasing the characterization of real-world systems, as the components at the borders tend to exhibit structural and dynamical features which are distinct to those observed in the rest of the network. The identification of the borders of protein–protein interaction networks is also interesting in itself, as the border proteins are potentially less important for the overall biological processes. Since nodes with high accessibility for a given path length tend to visit, on average, all reachable nodes at that length in the shortest period of time during a random walk, the most critically important proteins are likely to present the highest outward accessibility values, i.e. to belong to the interior of the network.
![]() | ||
Fig. 1 Illustration of the concepts of outward accessibility. While in (a) the accessibility of the protein YFR021w (dark gray node) is equal to 0.0022 for h = 2 because of the equal transition probabilities, in (b) the accessibility of protein YLR309c is smaller (equal to 0.0013) because of the rather different transition probabilities for h = 2. |
The present work reports an investigation of the accessibility between proteins in protein–protein interaction networks. By comparing the accessibility of orthologproteins (also called interlogs) in the yeast and fruit fly, we observed that the accessibility tends to be higher between the proteins in the fly than in the yeast. At a higher topological scale, calculating the average accessibility in the networks of four species, namely Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans and Homo sapiens, and comparing them in terms of z-scores, we verified that higher z-scores are obtained for the most evolved species. So, while the protein network of the H. sapiens tends to present the highest accessibility values compared to its randomized counterpart, the yeast presents the smallest z-score values. We also determined the border of the yeast protein–protein interaction network and found that most proteins at that place tend to be viable (i.e. non-essential).
The present article starts by presenting, in introductory and didactic fashion, the concepts related to network dynamical analysis, with special attention placed on the concept of accessibility, and proceeds by reporting and discussing the results with respect to several species.
The characterization of networks can be performed in terms of structure and dynamical measurements.4,5 A simple topological measurement is given by the number of connections of a given node i, called the degree, which can be computed as ki = ∑jaij. In brief, the degree of a node is simply the number of its connections. Despite the intrinsic simplicity of this structural feature, some studies have revealed that the degree is highly associated to protein functionality, such as lethality (e.g.ref. 16 and 19). Another important structural measurement related to protein functions is the betweenness centrality of a node,24 which is defined as
![]() | (1) |
In addition to structural measurements, networks can be characterized in terms of dynamical features.4 The choice of a particular dynamics should be compatible with functions to be observed in each specific case. For instance, it is suitable to represent the breakdown disruption in power distribution grids in terms of cascade failure dynamics.25 In the case of proteins, the dynamics of interactions between them, such as in signal transduction,20 can be approximated in terms of self-avoiding random walks dynamics,19 which involves agents moving through the network without visiting any vertex or edge more than once. This type of dynamics allows more purposeful spreading of activations, avoiding the many backward, repeated activations which would be otherwise obtained by using traditional random walks. Although biological networks do involve a relatively small number of backward activations, for simplicity’s sake in the present work we focus attention on purely self-avoiding random walks. It would be possible, though, to modify the simulations to allow any desired level of backward protein interaction.
A non-preferential self-avoiding random walk is obtained by having a moving agent start a walk at a specific node and then to proceed to other nodes by taking the outgoing edges, except those leading to already visited nodes, with uniform probability. The probability of arriving at a node i after the moving agent started at node j, h steps distant from i, is given by the respective probability of transition, henceforth expressed as Ph(j,i). These transition probabilities can be progressively calculated by dividing the probability Ph−1(i,k), where k is connected to i and j, by the number of neighbors of k that have not been visited yet. For instance, in Fig. 2, the probability to go from node 1 to 10 is given by the probability to go from 1 to 7 divided by the number of nodes which the walk can propagate without repeating any edge, i.e.P(1,10) = P(1,7)/2 = (1/4)/2 = 0.125. For the first move, we have P1(i,j) = 1/ki. The probability of arriving at a node i after having started h steps before from node j is given by the sum of the probabilities of all paths of length h between those two nodes.
![]() | ||
Fig. 2 Examples showing the calculation of the probability of transition with respect to the continuous and dashed arrows. The probability to go from node 1 to 2, P(1,2), is given by one divided by the number of neighbors of node 1. Similarly, to go from 1 to 4, the probability is given by P(1,2) divided by 5, i.e. the number of nodes connected to node 2 excluding node 1, which has already been visited. The other probabilities are calculated similarly. |
The probability of transition provides an important resource for quantification of random walk dynamics. Different topological measurements, such as activation, diversity and accessibility22 can be adopted, depending on the type of phenomena to be investigated. The activation quantifies the extension of the random walks initiating at each node. The activation has been recently considered in protein lethality analysis.19 On the other hand, the accessibility between nodes can be characterized by an outward accessibility measurement, which quantifies the effectiveness of a node i in accessing all the other nodes in the network under specific dynamics (in our case self-avoiding random walks). The outward accessibility of node i after h steps is defined as
![]() | (2) |
![]() | (3) |
Networks can be compared globally in terms of the outward accessibility of their nodes. Since the networks can present different numbers of nodes and vertices, it is fundamental to resort to analyzes which do not depend on network scale. A possible approach to perform this task is to compare the real networks with their randomized counterparts. To describe the deviations of the observed interaction frequencies from the random expectation, we can consider the z-score, which is calculated as28
![]() | (4) |
Another important feature of the outward accessibility is related to its ability to detect the borders of networks.23 More specifically, peripheral nodes have been found to present low accessibility values since they do not have many options for the random walk other than to access the internal nodes of the network. In contrast, non-border nodes tend to have more effective and balanced access to the most part of the network, resulting in higher accessibility values.
After calculating the accessibility of all the 144 orthologproteins, we obtained the cumulative distributions of accessibility depicted in Fig. 3, for h = 2,…,5. It is clear from these results that the accessibility tends to be larger for the proteins in the fruit fly network than for the orthologproteins in the yeast network. Fig. 4 illustrates this trend with by showing the local structure around a pair of orthologproteins (in red) identified in these species. Since the fly presents a more complex biological network than the yeast, it is possible that more effective and balanced access to the most part of the network is required for its proper operation. Thus, the dynamical outward accessibility measurement can be used to obtain insight into the level of modifications underwent by protein networks along the evolutionary process.
![]() | ||
Fig. 3 Comparison of the cumulative accessibility distribution between 114 putative identifiable orthologs in the yeast and fruit fly. |
![]() | ||
Fig. 4 Example of the variation in the structure of an interlog present in the yeast (a) and the fruit fly (b), indicated by the red nodes. The nodes distant one edge from the interlogs are shown in blue and those distant two edges, in yellow. While in (a) the protein YDR142C presents OA2 = 0.009, in (b) the orthologprotein FBgn0035922 presents OA2 = 0.059. |
In order to compare the accessibility for ortholog and non-ortholog proteins, we obtained the distributions presented in Fig. 5. As observed before, the orthologproteins in the fly tend to have higher outward accessibility than those orthologproteins in the yeast. In addition, orthologproteins present smaller accessibility than non-ortholog proteins. This tendency was observed for all values of h. Therefore, proteins with small accessibility values, which tend to be at the border of networks,23 seem to be more conserved than those at the centre. This effect could be related to the small influence that proteins at the border should suffer from other proteins. In fact, proteins at the centre of networks tend to have many connections and paths to other proteins. However, although this finding is a potentially interesting result, it should be borne in mind that our results may change for more complete databases of orthologproteins. Note that the fraction of identified proteins in our data is small, mainly due to the relatively small overlap between the yeast and fly databases.
In order to verify the variation of accessibility between different species, we analysed the following databases: (i) S. cerevisiae: composed of 2708 proteins and 7123 protein–protein interactions;13 (ii) D. melanogaster: formed by 1345 node and 3172 edges;15 (iii) C. elegans: composed of 2528 nodes and 3865 interactions;14 and (iv) Homo sapiens: with 1549 nodes and 2755 edges.11 Because each of these networks have different numbers of nodes and edges, we perform the comparison in terms of the respective z-scores.28 We performed 100 randomizations of the networks and calculated the mean and standard deviations. z-Scores larger than zero indicate that the outward accessibility in the real networks is larger than those observed in the random respective counterpart. On the other hand, values of z-score smaller than zero imply that the networks exhibit accessibility smaller than that observed in the random counterparts. Calculating the average accessibility for these networks and for their random counterparts, we obtained the z-scores presented in Fig. 6 and described in terms of different values of h. Note that the order of the curves partially reflects the complexity of the organisms for h = 6, since H. sapiens presents the highest accessibility and S. cerevisiae the smallest values. The increase in the accessibility through evolution may be a consequence of the increasingly biological complexity required by the respective species. The abrupt variation in the z-scores at h = 3, defining valleys except for S. cerevisae, could be a consequence of the fact that the highly evolved species have more complex signalling circuitry, which results in higher accessibility at bigger steps. Note that H. sapiens present the highest z-score for h = 6.
![]() | ||
Fig. 6 z-Score values calculated for S. cerevisiae (black squares), D. melanogaster (red circles), C. elegans (blue diamonds) and H. sapiens (yellow squares). |
Since protein interaction networks are known to present sampling bias,33 we performed a perturbation analysis in order to verify the stability of the accessibility measurement. In this way, we removed 10% of proteins randomly and compared the obtained accessibility with those obtained for the original networks. We verified that the variations are smaller than 10% for h = 2 and smaller than 5% for h > 2. Therefore, the sampling bias does not appear to influence substantially our results.
It is also interesting to analyse the overall distribution of protein accessibility in the interaction networks. In order to do so, we represented each network node i as a vector whose elements correspond to the outward accessibility at each distance h, (i) = {OA1(i),…,OA6(i)}. These vectors were then projected into the two-dimensional space by considering principal component analysis (PCA), which optimally reduces the dimensionality while completely removing the correlations between the data.34,35 We checked the percentage of dispersion along the two first axes by taking into account the coefficient d = (λ1 + λ2)/∑iλi. This coefficient allows quantification of the dispersion of the outward accessibility for each species along the two first axes. We obtained the following coefficients: (i) yeast: d = 0.87, (ii) fly, d = 0.71, (iii) worm, d = 0.65, and (iii) human, d = 0.73. Therefore, the simplest organism presents the largest dispersion.
The biological importance of a protein can be associated to its relative position in the respective network. For instance, more central proteins tend to propagate their influence more effectively along the network through smaller and more numerous paths than the peripheral proteins. So, it could be expected that the more important a protein is, the smaller the probability that it will be at the border of the network. Since protein–protein interaction networks tend to display properties which are typical of geographical networks (e.g.ref. 5 and 36), accessibility measurements are a particularly suitable approach to border detection in these networks.23,37 In order to verify the relationship between the accessibility and degree or betweenness centrality, which can also be used for border definition, for the adopted networks, we calculated the Pearson correlation coefficient between such measurements. Fig. 7 presents the Pearson correlations in terms of h, which shows that the correlations tend to decrease substantially as the distance h is increased. This means that the accessibility can provide information complementary to that supplied by the degree or betweenness centrality.
![]() | ||
Fig. 7 Pearson correlation coefficients between the accessibility and degree k, given by the number of connections of a node, (black circles) and between the accessibility and betweenness centrality B, given by the fraction of shortest paths passing by a node, (gray circles) for the yeast protein interaction network. |
We obtained the borders of the adopted protein–protein interaction networks by applying the accessibility concepts and methods described by Travençolo et al.23 Thus, we determined that the border of the yeast S. cerevisae network corresponds to those proteins with the smallest accessibility for each value of h. Note that for each value of h, we obtained a different threshold Th, i.e.T2 = 0.058, T3 = 0.010, T4 = 0.0039, T5 = 0.002, and T6 = 0.001. We verified that the number of border proteins corresponds to about 5% of the total number of proteins for all distances h.
We used the S. cerevisiaeprotein–protein interaction networks from Krogan et al.,13 which is a highly reliable protein interaction map obtained by tandem affinity purification. We considered the core data set of the database,13 which comprises 2708 proteins and 7123 protein–protein interactions. Among these proteins, 648 are known to be lethal, 1918 are viable and 142 are unknown proteins. The protein lethality and viability were identified by using data from the Munich Information Center for Protein Sequences (MIPS).38 We calculated the fraction of lethal and viable proteins at the border for h = 1,…,6, as illustrated in Fig. 8. The percentage of lethal proteins at the border is approximately conserved, varying from 13% (for h = 1) to 19% (for h = 4). Note that the percentage of lethal proteins in the whole network is equal to 24%, i.e. the probability of finding a lethal protein at the border is almost half that of finding it in the whole network. It is interesting to note that the fraction of lethal proteins at the border tends to be similar for all distances h, as shown in Fig. 8.
![]() | ||
Fig. 8 The fraction of lethal (black circles) and viable (gray circles) proteins at the border of the network in terms of the distance h. Note that most viable proteins tend to belong to the border of the network whatever the value of h. |
In the current work, we investigated protein–protein interaction in terms of the outward accessibility exhibited by each node for several distance values h. Three related investigations were reported in this work. Initially, we determined the interlog proteins in the fruit fly and yeast and verified that proteins in the fly tend to present higher accessibility values than those in the yeast. This result suggests that the conserved proteins are more likely to be internal to the fly network than in the yeast. Next, by comparing four different species in terms of z-scores, which allows comparison between networks with different sizes and numbers of connections, we found that the z-scores tend to partially reproduce natural evolution, with the protein network of H. sapiens presenting the highest z-score while that of S. cerevisase yielded the lowest one. By projecting the outward accessibilities for several values of h by principal component analysis, we verified that the obtained distributions tend to be similar for all species, suggesting a universal feature shared by many species. Finally, we investigated the essentiality of proteins at the border of the yeast protein interaction network. That border was found to be formed by the proteins presenting the lowest accessibility. We verified that the border proteins tend to be viable, with the probability of finding a lethal protein at the border being almost half that of the whole network.
Further investigations can be performed by considering other dynamical measurements. For instance, it is possible to consider random walks with different propagation probabilities depending on the functionality of each protein or group of proteins. In addition, it would be particularly interesting to repeat the described accessibility analysis for other biological networks, such as transcription-regulatory and metabolic networks.
This journal is © The Royal Society of Chemistry 2009 |