Gipsi
Lima-Mendez
* and
Jacques
van Helden
*
Bioinformatique des Génomes et des Réseaux-BiGRe, Université Libre de Bruxelles, Campus Plaine, CP 263, Boulevard du Triomphe, B-1050 Bruxelles, Belgium. E-mail: gipsi@bigre.ulb.ac.be; jvhelden@ulb.ac.be
First published on 2nd October 2009
For almost 10 years, topological analysis of different large-scale biological networks (metabolic reactions, protein interactions, transcriptional regulation) has been highlighting some recurrent properties: power law distribution of degree, scale-freeness, small world, which have been proposed to confer functional advantages such as robustness to environmental changes and tolerance to random mutations. Stochastic generative models inspired different scenarios to explain the growth of interaction networks during evolution. The power law and the associated properties appeared so ubiquitous in complex networks that they were qualified as “universal laws”. However, these properties are no longer observed when the data are subjected to statistical tests: in most cases, the data do not fit the expected theoretical models, and the cases of good fitting merely result from sampling artefacts or improper data representation. The field of network biology seems to be founded on a series of myths, i.e. widely believed but false ideas. The weaknesses of these foundations should however not be considered as a failure for the entire domain. Network analysis provides a powerful frame for understanding the function and evolution of biological processes, provided it is brought to an appropriate level of description, by focussing on smaller functional modules and establishing the link between their topological properties and their dynamical behaviour.
Gipsi Lima-Mendez | Gipsi Lima-Mendez earned her BS degree in biochemistry at the University of Havana in Cuba. During those years she fell in love with the evolution of biological systems at the molecular level. With the advent of genomics, she decided bioinformatics would be her ‘lab tools' to address general questions in biology. She gained her PhD in Bioinformatics at the Université Libre de Bruxelles in Belgium. As a graduate student, she joined the lab of Professor Jacques van Helden and dedicated her doctoral research to the study of bacteriophage evolution. Bacteriophages (phages) are genetic mosaics resulting from homologous and illegitimate recombination with other phages and with the bacterial genomes. Because classical phylogenetic approaches do not apply to these systems, she used graph-analysis to model the evolutionary relationships between bacteriophages and designed a reticulate system for their classification. She also developed an algorithm to predict phage sequences in bacterial genomes (prophages). Currently, she is interested in the impact of bacteriophages on bacterial genome evolution, at the protein and regulatory levels. |
Jacques van Helden | Jacques van Helden is currently Chargé de cours at the Université Libre de Bruxelles (Belgium) and head of the group “Genome and Network Bioinformatics”. He trained as a bioengineer with a PhD in developmental genetics. The main research activities of his group consist in implementing, evaluating and applying bioinformatics approaches to analyze regulatory sequences and molecular networks. They have developed software tools to detect cis-regulatory elements in genomic sequences (Regulatory Sequence Analysis Tools, http://rsat.ulb.ac.be/rsat/), to infer metabolic pathways using weighted path finding in metabolic networks, and to analyze molecular interaction networks (Network Analysis Tools http://rsat.ulb.ac.be/neat/). |
Box 1. Network topology semanticsNode degree: number of edges linked to a node. The count can be restricted to incoming edges (in-degree), outgoing edges (out-degree) or include both (total degree).Hub: highly connected node. Distance: the distance between two nodes is the number of edges in the shortest path between them. Node eccentricity: length of the longest of all shortest paths between a given node and any other node.^{67} Characteristic path length: number of edges of the shortest paths between two nodes averaged over all pairs of nodes.^{1} Network diameter: length of the longest among all shortest paths between node pairs. This is equal to the maximal eccentricity over all nodes of the network. Note that the term “diameter” has mistakenly been used to denote the average length of the shortest paths between all the pairs of nodes,^{5}i.e. the characteristic path length. Network radius: minimum value of eccentricity over all nodes.^{67} Power law: a polynomial relationship between two quantities:
Poisson distribution: discrete distribution defined by a single parameter λ (lambda), indicating its mean value. Erdös–Renyi (ER): stochastic model generating graphs where each pair of nodes has the same probability of being linked by an edge. The degree distribution of ER graphs typically follows a Poisson distribution, as exemplified in Fig. 1C–F. Scale-freeness: a probability function p(x) of a variable x is scale-free if, for any value of b, it satisfies the condition:
Clustering coefficient: the clustering coefficient of a node is the fraction of connections among all possible connections between its neighbours. In a non-directed graph without self-loops, a node has N neighbours, the number of possible connections between them is N(N − 1)/2. Small world network: the term, coined by Watts and Strogatz, refers to networks that are highly clustered (high average clustering coefficient), like regular lattices, yet with small average shortest path length, like random networks.^{1} The shortest distance between two vertices increases logarithmically with the number of nodes n (as for random graphs).^{69} Humphries defines a parameter S to measure the small-worldness of a network.^{70}
Interaction density and interaction density gradient: These measures were introduced recently^{19} to compare different proposed models of PPI network growth. Depending on the model attachment rule, a different pattern of connections will be observed between groups of nodes of different ages. For example, under the preferential attachment model, new nodes connect more likely to older nodes, since the latter have higher connectivity. The interaction density D_{m,n} between two (age) groups m and n is the ratio of observed interconnecting edges between the groups (l_{m,n}) out of all possible edges between them (E_{m,n}), normalized according to the total number of edges (L) and nodes (N) in the network: Network modules: “Patterns of interconnections that recur in many different parts of a network at frequencies much higher than those found in randomized networks”^{41} |
Surprisingly, most initial claims about topological properties were proposed on the simple basis of graphical representations, but were contradicted as soon as the models were challenged by actual statistical tests.^{13,14} Would the “universal laws” merely be myths according to the sensu lato definition, i.e. “widely held but false beliefs”?^{15}
Furthermore, several hypotheses about the functional and evolutionary implications of those network properties are based on analyses led at a high abstraction level, but their relevance rapidly fades out as soon as the nodes (genes, proteins, metabolites) and their interactions are inspected with more details. Despite their elegance, the evolutionary scenarios derived by transposing theoretical generative models onto biological networks are reminiscent of the sensu stricto definition of myth, i.e. “a traditional story, esp. one concerning the early history of people or explaining some natural or social phenomenon”.^{15}
Despite the lack of consistency between theoretical models and data, new papers are steadily published, suffering from the same flaws, in apparent ignorance of the serious concerns raised by several authors.^{13,16–19} To justify the observed discrepancies between theoretical models and biological networks, some authors invoke the incompleteness of network annotations. When the “universal laws” are contradicted by the facts, the first reflex is to question the quality of the data rather than the validity of the models. We are thus in the typical situation of a dogma: “a principle or a set of principles laid down by an authority as incontrovertibly true”.^{15}
In this article, we review the main concepts having emerged from topological analysis of biological networks, and discuss the controversial issues about their statistical validity, as well as their functional and evolutionary interpretation.
The illusion of the power law partly came from several representation issues. Firstly, in seminal articles,^{5,12} the power law is illustrated by plotting the degrees (k) and their probabilities P(k) on logarithmic scales (inset of Fig. 1B), whereas the Poisson is illustrated with linear scale (inset of Fig. 1C). This way to oppose two models is obviously misleading: alternative distributions should be displayed consistently with the same scale, either linear (Fig. 1Aversus C or E) or logarithmic (Fig. 1Bversus D or F).
Fig. 1 Power law versus Poisson distributions. Random simulations based on various models (power law, Poisson) fitted with their respective theoretical distributions. A, B: power law function y = ax^{γ} with γ = −2.5 and a = 1, displayed with linear (A) and logarithmic (B) scales, respectively. C, D: Poisson fit on the degree distribution of a random ER graph with an average of 20 connections per node with linear (C) and logarithmic (D) scales. E, F: Poisson fit on a random ER graph with an average of 1 connection per node displayed with linear (E) and logarithmic (F) scales, respectively. On each graph, the dotted line represents the number N(k) of nodes having degree k, and the plain line the inverse cumulative distribution, i.e. number of nodes (iCumN) with degree greater than or equal to k. Insets B and C: in the seminal paper on the topology of metabolic networks,^{5} the power law was illustrated with logarithmic scales, whereas the Poisson law was depicted with linear scales, and with a high mean value. |
Secondly, the illustration of the ER model is usually based on a Poisson distribution with a high expected mean (λ parameter), irrespective of the mean degree of the networks to be analyzed. However, this parameter has an important effect on the shape of the distribution: symmetrical bell shape for high λ values (Fig. 1C and D), but strongly asymmetrical for lower values (Fig. 1E and F). When contrasting the two a priori models, the λ parameter of the Poisson should thus be adapted to the mean degree of the observed network.
Another representation issue is that the degree values are sometimes regrouped by class intervals,^{5} thereby enforcing the apparent linearity on the log–log plot (inset of Fig. 2A), whereas displaying the raw distribution highlights the wide dispersion of the right tails (Fig. 2A), denoting the fact that the hubs are statistical outliers of the alleged power law. Indeed, when a straight line is fitted on the non-binned distribution (Fig. 2A, dotted line), the fit appears very poor, because its slope is strongly affected by the hubs. Strikingly, a better fit is obtained when we discard the 30 most connected nodes from the network (Fig. 2A, dashed line). The same trend is observed when the power law is fitted onto the inverse cumulative distribution rather than on the density function (Fig. 2B). Considering that the power law property of the metabolic networks has always been attributed to the presence of “hub compounds”, it is somewhat paradoxical that the fit looks valid only if those hubs are removed from the graph.
Fig. 2 Fitting of power law on the degree distributions of various biological networks. The abscissa represents node degrees (k), the ordinate the frequency of nodes having that degree (P(k)). Squares: density function; plain curve: inverted cumulative distribution function (iCDF); dotted: power law fitted onto the data; dashed: Poisson distribution fitted onto the data. A: metabolic network from the KEGG database, where nodes correspond to compounds, and their degree is the number of reactions in which they participate. Theoretical distributions fitted onto the density function. Note the discontinuity between the core of the distribution and its right tail, appearing as a bump on iCDF. Inset A: reproduction of the figure published to support for the power law character of metabolic networks.^{5} Note that the fact to regroup degrees into classes (“binning”) masks the discontinuity between the core of the distribution and its right tail. B: the same metabolic network with theoretical distributions fitted onto the iCDF. C, D: distributions of outgoing (C) and incoming (D) degrees in the regulatory network built from RegulonDB. Outgoing degrees (C) indicate the number of target genes per transcription factor. Incoming degrees (D) indicate the number of regulators per regulated gene. E, F: protein interaction networks from the high-throughput experiment of Gavin et al. (E)^{73} and Krogan et al. (F),^{74} respectively. |
The fitting of a power law onto other types of networks is even less convincing. The analysis of the regulatory network of Escherichia coli reflects the presence of many specific transcription factors (having between 1 and 15 target genes) and some global factors involved in the regulation of many genes (Fig. 2C). The incoming degree distribution (Fig. 2D), which indicates the number of regulating factors per gene, shows an asymmetric bell-shaped distribution (square) which is reminiscent of Poisson law. It however shows a poor fit with the Poisson distribution (dashes-dots). Protein interaction networks obtained from high-throughput experiments display a curved shape (Fig. 2E and F), which can hardly be confused with the straight line expected from a power law.
In summary, careful analyses fail to confirm the power law distribution of degrees in biological networks. Even more, the variability between the degree distributions observed in different networks (metabolic, regulatory, protein interactions) rules out the hope to discover any universal law that would describe them altogether.
A first remark is that scale-freeness does not apply to a network as a whole, but to some of its properties. In fact, to speak about “scale-free networks” is completely misleading since it would imply that a subset of the network would have an identical structure as the whole network (fractal images are the typical illustration of this concept).
It is thus important to specify which property of a network is supposed to be scale-free, and this is frequently not clear in the papers speaking about scale-freeness. The topological property that is generally claimed to be scale-free is the power law character of the degree distribution, and, in some articles, the scaling exponent (which corresponds to the slope of the regression line on the log–log graph). The scale-freeness of the power law has been tested by selecting random sub-networks from artificial networks whose degree distribution follows a power law. It has been shown that the degree distribution of such sub-networks retains the power law shape, but not the scaling exponent.^{21}
Han and co-workers performed an extensive study of the effect of sampling on artificial networks generated with various degree distributions: Poisson (Erdös–Renyi model), exponential, power law, or truncated normal. Interestingly, they showed that sub-networks tend to exhibit a power law distribution, irrespective of the topological property of the larger network they were sampled from. They conclude that the apparent power law property observed in some biological networks might result from a sampling artefact, rather than reflecting some property of the complete network. The distribution of the complete network can thus not be estimated from the distribution of sub-networks, preventing to draw general conclusions about parameters estimated from incomplete datasets. This confirms that the concepts of “power law” and “scale-freeness” should not be considered as synonymous.
Path number | Path |
---|---|
1 | D-Glucose → R04094 → H_{2}O → R02682 → ethanol |
2 | D-Glucose → R00300 → NADH → R00754 → ethanol |
3 | D-Glucose → R00534 → H_{2}O → R02359 → ethanol |
4 | D-Glucose → R02558 → H_{2}O → R02682 → ethanol |
5 | D-Glucose → R00304 → H_{2}O → R02359 → ethanol |
6 | D-Glucose → R02558 → H_{2}O → R02359 → ethanol |
7 | D-Glucose → R05142 → H_{2}O → R02682 → ethanol |
8 | D-Glucose → R00534 → H_{2}O → R02682 → ethanol |
9 | D-Glucose → R01444 → H_{2}O → R02682 → ethanol |
10 | D-Glucose → R04006 → H_{2}O → R02359 → ethanol |
In another study,^{4} the obvious trap of the pool metabolites was avoided by suppressing a selection of “hub compounds” from the network. Path finding in such a filtered graph returns slightly more relevant pathways, but only when they comprise a small number of steps.^{22,23}
Alternative methods were designed to increase the relevance of the pathways inferred by path finding, by tracing the transfers of atomic groups between reactions,^{18} by weighting the graphs in order to penalize highly connected compounds,^{23} or by restricting path finding to valid reactant pairs.^{24,25} When path finding is adapted in such ways to better correspond to biochemical pathways, distances between compounds show a significant increase, indicating that the metabolic world is not so small.^{18}
The small diameter of metabolic networks was proposed to reflect the capability of cells to convert compounds into each other within a few reactions, thereby ensuring their robust response to environment variations.^{5} Error tolerance was related to the capability of living cells to survive random deletions of metabolic enzymes, whereas “attacks” targeted towards the hub compounds would “disintegrate [the network] into isolated clusters that are no longer functional”. As soon as we consider the nature of the nodes in the metabolic network, this rough transposition of computer network-derived properties onto metabolic networks is devoid of sense. Firstly, the tolerance to random deletions is far from trivial: the classical approach used by biochemists to discover enzyme-coding genes was to perform a random mutagenesis and to select mutants showing an auxotrophic phenotype . Such mutants lose their ability to synthesize a given compound, because the only path leading to this compound has been disrupted by the deletion of a single enzyme. Although the missing compound has generally very few links in the metabolic network, auxotrophy often results in lethality, unless the missing compound is provided in the culture medium. Metabolic networks are thus not so robust to random deletions.
The concept of “attacks” targeted to the hubs is even more questionable, because mutations affect genes (and thus the enzymes they code for), but cannot directly target metabolites. Pool metabolites appear as “hubs” in the metabolic network because they can be produced and consumed by several hundreds of different reactions, which are catalyzed by distinct enzymes. The suppression of a single hub like H_{2}O from the metabolic network would thus require deleting or inactivating several hundred enzyme-coding genes. After a handful of such mutations, the cell would already suffer from the depletion of its main enzymatic products (which are generally poorly connected compounds) and die, so that it is unconceivable to suppress, by natural or even directed mutations, a pool metabolite from the network. Thus, the concepts of error tolerance and vulnerability to attacks simply do not apply to metabolic networks.
In protein networks, the correspondence between mutations and node deletions is more straightforward than in metabolic networks. Jeong and co-workers showed that the hubs of PPIs correspond to essential proteins.^{7} By combining an analysis of network topology and temporal profiles of gene expression, Han and co-workers distinguish two subtypes among the highly connected proteins:^{27} “party hubs interact with most of their partners simultaneously, whereas date hubs bind different partners at different times or locations”. The distinction between those subtypes is supported by an independent analysis of structural interfaces between proteins,^{28} revealing that the relation between high degree and essentiality is stronger for proteins having multiple interaction interfaces (consistent with the concept of party hubs) than for those with only one interface (consistent with date hubs). It is not surprising that deletions of proteins involved in many interactions, either because they form large protein complexes or are involved in multiple processes, are likely to be deleterious. The apparent vulnerability of PPI networks to hub removal obviously results from the particular functions of each of these proteins and the biological processes in which they participate rather than to some general small world character they would confer to the network. In particular, it has to be noted that PPI networks integrate various types of interactions, going from stable protein complexes to transient interactions intervening in signal transduction pathways. Distance-related concepts such as pathway distance and “small worldness” may be relevant for signal transduction pathways, but these only represent a subset of the data. A deeper insight into the mechanisms underlying the relationship between topology and essentiality will thus require a case-by-case analysis of protein functions in the context of the processes in which they participate.
Based on this generative model, several authors hypothesized that the power law structure of biological networks results from a tendency of new nodes (metabolites, proteins, genes) to establish interactions with more ancient nodes. Evelyn Fox Keller questions the general validity of this reasoning, since other models would as well generate networks with power law degree distributions, albeit their underlying topologies might be very different.^{16} There is thus a trivial logical fallacy under the reasoning: the fact that preferential attachment generates power law does not mean that power law implies preferential attachment (A → B ≠ B → A). The claim that a given biological network evolves by preferential attachment must thus be supported by other arguments than simply the shape of the degree distribution.
If we examine the raw metabolic network, preferential attachment can certainly not be considered as a general explanation for the top-ranking metabolites. The identity of the “hubs” (Table 2) provides a direct explanation for their high degree: they are either inorganic compounds (e.g.water, oxygen, CO_{2}, H_{2}O_{2}), or cofactors (ATP, NAD, SAM). Each of these molecules is involved in a specific type of chemical modification applied to a large diversity of substrates: H_{2}O is involved in hydrolysis and (de)hydration, ATP is the main currency for energy transfer, SAM is the methyl carrier, etc. Fell and Wagner proposed the preferential attachment model to metabolic networks from which pool metabolites had been filtered out: If, early in the evolution of life, metabolic networks grew by adding new metabolites, then the most highly connected metabolites should also be the phylogenetically oldest.^{4,30} Indeed, this scenario seems reasonable for some of the highly connected compounds involved in intermediary metabolism, e.g.oxaloacetate, pyruvate, glutamate, as well as some amino acids pointed by the authors. A strict application of this model would however lead to impossibilities, since it would imply that ATP appeared before adenosine, S-adenosyl-L-homocysteine before cysteine, etc. The preferential attachment model may thus partly explain some relationships between central and peripheral metabolism, but should certainly not be considered as the reason for the topological properties of the network (hubs, degree distribution).
Rank | ID | Name | In-degree | Out-degree | Total degree | Metabolic function (from ref. 72) |
---|---|---|---|---|---|---|
1 | C00001 | H_{2}O | 769 | 1444 | 2213 | Hydrolysis, hydration |
2 | C00080 | H+ | 809 | 460 | 1269 | Proton pumps (e.g.respiratory chain, photosynthesis) and other redox reactions |
3 | C00007 | Oxygen | 43 | 817 | 860 | Electron acceptor |
4 | C00006 | NADP+ | 318 | 406 | 724 | Coenzyme: electron acceptor |
5 | C00005 | NADPH | 405 | 316 | 721 | Coenzyme: electron donor in anabolism |
6 | C00003 | NAD+ | 160 | 503 | 663 | Coenzyme: electron acceptor in catabolism |
7 | C00004 | NADH | 497 | 158 | 655 | Coenzyme: electron donor |
8 | C00002 | ATP | 17 | 449 | 466 | Coenzyme: energy donor |
9 | C00011 | CO_{2} | 378 | 49 | 427 | Last product of oxidation, precursor of photosynthesis |
10 | C00009 | Orthophosphate | 315 | 78 | 393 | Product of ATP, ADP and AMP hydrolysis. |
11 | C00010 | CoA | 242 | 127 | 369 | Coenzyme: universal acyl donnor |
12 | C00008 | ADP | 313 | 20 | 333 | Product of ATP hydrolysis and substrate for ATP synthesis |
13 | C00014 | NH_{3} | 253 | 43 | 296 | Source of N for all organisms incapable of fixating N_{2}. Product of aa and nucleotide catabolism, urea cycle. |
14 | C00013 | Pyrophosphate | 256 | 30 | 286 | Product of ATP hydrolysis |
15 | C00019 | S-Adenosyl-L-methionine (SAM) | 6 | 239 | 245 | Coenzyme: methyl donor |
16 | C00021 | S-Adenosyl-L-homocysteine | 227 | 9 | 236 | Subproduct of methylation by SAM |
17 | C00015 | UDP | 216 | 6 | 222 | Coenzyme: carrier of hexose groups |
18 | C00027 | H_{2}O_{2} | 142 | 21 | 163 | Redox reactions |
19 | C00026 | 2-Oxoglutarate | 33 | 125 | 158 | Participates in the citric acid cycle. Transfer of amino groups in aa and nucleotide catabolism. |
20 | C00020 | AMP | 144 | 14 | 158 | Product of ATP/ADP hydrolysis and substrate for ATP/ADP synthesis |
21 | C00022 | Pyruvate | 101 | 50 | 151 | Final product of glycolysis and some aa metabolism, e.g., Ala, Cys, Ser. Gluconeogenesis. |
22 | C00024 | Acetyl-CoA | 35 | 101 | 136 | Coenzyme: acetyl donor |
23 | C00025 | L-Glutamate | 83 | 46 | 129 | Transfer of amino groups in reactions of aa and nucleotide metabolism, intermediate in Pro, Arg, Gln, His, degradation/biosynthesis, precursor of glutathione, ornithine, GABA, Ser and Gly biosynthesis (NH_{3} donor) |
24 | C00036 | Oxaloacetate | 29 | 14 | 43 | Participates in the citric acid cycle and gluconeogenesis. Precursor of Asp. Produced by several anaplerotic reactions. |
The preferential attachment model has also been proposed for protein interaction networks. Eisenberg and Levanon^{31} tested the validity of this model by partitioning all the proteins of the yeast Saccharomyces cerevisiae into 4 age groups, estimated from the taxonomical range in which they were found: Saccharomyces only, all fungi, fungi + plants, or fungi + plants + bacteria, respectively. Their study clearly shows that the average degree is higher for older than for newer proteins. A first concern should be raised about the design of this test. Even though the mean differences may differ between age classes, this is not a proof for the preferential attachment model. Indeed, since power law distributions are intrinsically characterized by the presence of statistical outliers (the “hubs”), the arithmetic mean is a poor estimator of the central tendency of the degree distribution. In other words, the fact that the mean degree is higher for proteins of older groups might result from the very high degree of a few ancient proteins (“hubs”) involved in primordial functions having evolved during early forms of life,^{32,33} and would thus not support a general rule of preferential attachment.
Rather than comparing the means, the test should thus rely on the medians (which are robust to outliers and thus better suitable for highly skewed distributions), or, even better, on the whole distribution. Under the preferential attachment model, nodes would progressively acquire links during evolution, and the entire distribution would thus be shifted towards higher degrees for older proteins, as compared with newer proteins. As a matter of illustration, we analyzed the degree distributions per age group using a literature-curated (LC) and a high-throughput (HTP) PPI from a more recent study.^{19} The inverse cumulative distributions (Fig. 3) indeed reveal differences between age groups, but the relationship is not as simple as expected from a preferential attachment to the most ancient proteins. In the literature-curated network (Fig. 3B), the most recent proteins (found in fungi only) present the same distribution as the most ancient ones (those found in archaea, bacteria and eukaryotes), whereas a right-hand side shift is observed for proteins found in eukaryotes only, and in eukaryotes + archaea, respectively. The same trend is perceptible in the high-throughput network (Fig. 3C), even though the most ancient proteins show a slight increase in degree compared to the most recent ones. The fact that proteins found specifically in eukaryotes and/or archaea have more connections might result from an over-representation, in these datasets, of proteins involved in processes involving many protein interactions (e.g.cell cycle, transcription machinery, etc.).
Fig. 3 Degree distributions in the interactome for proteins of different classes of age (data from ref. 19). A: degree per protein in the literature-curated (abscissa) versus high-throughput (ordinate) networks. Note that the hubs are completely different between these two networks. B, C: inverse cumulative distributions (iCDF) of degrees of proteins partitioned into different age groups for the literature-curated (B) and high-throughput (C) network, respectively. The horizontal bar (Freq = 0.5) indicates the median degree of each age class (the abscissa of its intersect with each iCDF). The horizontal dotted line indicates the third quartile, which separates the 25% most connected from the 75% less connected nodes. |
The duplication–divergence model (or families of models) explains the topology of protein interaction networks based on genetic mechanisms underlying genome evolution.^{30,34–37} The hypothesis is that partial and/or whole genome duplications must have a direct impact on the evolution of protein interaction networks. Under this model, immediately after gene duplication, both duplicates interact with all the former neighbours of the parent gene. Later mutations in one of the redundant copies provoke a loss of some or all of its interactions. The model is supported by several observations: paralogous proteins are more likely to share partners than randomly chosen proteins,^{34}proteins sharing partners are more likely to be paralogs^{28} and a proportion of protein complexes have similarities to other complexes.^{38} However, the partners acquired by this mechanism alone would compete for the same (duplicated) interface.^{28} Network rewiring is necessary to introduce novel interactions (rather than merely duplicate existing ones)^{37} and is thought to occur mainly by exon shuffling of genes encoding for multimeric proteins.^{34}
Despite the popularity of the duplication–divergence model, no consensus exists yet on how the protein network evolves. Recently, four alternative generative models (preferential attachment, duplication–divergence, anti-preferential attachment and crystal growth) were compared in their capability to reproduce the topology and age-dependence of interaction patterns observed in the yeast protein interaction network.^{19} Age-dependence of interaction patterns of the real and simulated networks was evaluated using a measure of the interaction density (D) between different age groups and the network-wise propensity for a new node to connect with older nodes (average interaction density gradient, ΔD) (see Box 1 for definitions). The duplication–divergence model seems to reproduce the topology of the yeast PPI network but not its age-dependence interaction pattern. In the yeast PPI network, most links are made between proteins belonging to close age groups (ΔD > 0). This feature is only observed in the network generated following the crystal growth model (which is the only other reproducing the PPI network topology), although the pattern of interaction density between the different age groups does not reproduce that of the yeast network.
In summary, it seems that each of the generative models proposed so far captures a subset of the topological properties of protein interaction networks, but none of them is able to account for all topological aspects.
This can for example be done by incorporating biochemical knowledge into metabolic networks: the relevant pathways can be inferred by tracing the exchanges of atom groups between compounds,^{39} or by decomposing reactions into reactant pairs.^{24,40} In PPI networks, the incorporation of structural analysis has already improved our understanding of the network evolution.^{28}
Ultimately, understanding the wiring of biochemical networks will sooner or later require us to integrate the different layers of biological processes (genetic, protein–protein, metabolic), and to map them onto the specific cellular compartment and tissues where they take place.
Transcription regulatory networks were the first to be targeted from a module perspective. A systematic study of the transcription network of E. coli led to the identification of recurrent motifs^{41} (see Box 1 for the definition) that were further found in regulatory networks of other organisms (yeasts, plants and animals) and in other types of biological networks.^{42} The recurrent presence of these motifs in a variety of biological networks has been proposed to be due not only to conservation but also to convergent evolution under the effect of functional selection.^{42,43} The criterion for considering that a motif is over-represented or not is itself debatable, and the significance of some recurrent motifs may have been over-estimated due to inappropriate null models for network randomization.^{44,45} Nevertheless, such studies are of interest because they bring back the focus from global networks to local structures that can be related to specific information-processing units.
Actually, the relationship between network motifs and their dynamical behaviour has been tackled by geneticists since half a century: the first network motifs to be discovered were the feedback loops, whose effect was characterized by experimental and theoretical analyses of small genetic networks. In their historical article on the Lac operon,^{47} Jacobs and Monod not only demonstrated the existence of genetic regulation (repression), but also pointed out the essential role of the positive feedback to ensure multistationarity, i.e. the existence of two alternative cellular states (induced or repressed, respectively). In the early 70’s, Kauffman^{48,49} and Thomas^{50} modelled genetic networks with Boolean approaches. Thomas further defined a logical formalism based on multi-value variables that allowed him to systematically analyze the role of feedback loops in regulatory networks,^{51,52} and demonstrated that the presence of positive feedback loops (i.e. a loop containing an even number of negative interactions) is a necessary condition to generate multistationarity (differentiation, cell memory), whereas negative feedback loops (odd number of negative interactions) ensure sustained oscillations and homeostasis (see ref. 53 for a recent review). The respective roles of positive and negative feedback loops are confirmed by innumerable examples of regulatory circuits involved in controlling metabolism, development, immune system, etc.
On the way back from theory to wet biology, mathematical modelling can also be the starting point to pinpoint a set of molecules and interactions that will be further studied using classical molecular genetics methods. Synthetic biology applies the theoretical concepts to design artificial genetic systems that can be empirically tested in living cells. Small circuits that we designed following this approach include a positive loop acting as a genetic toggle between two alternative stable states,^{54} or a negative loop generating an oscillating behaviour.^{55} Artificial regulatory interactions can also be inserted into existing biological systems in order to decipher their function and evolution, by engineering small circuits^{56} or even rewiring the entire regulatory network.^{57}
Albeit the action of individual motifs on small genetic systems has been well described, much remains to be done before we understand the rules underlying the combination of multiple such motifs in large networks. A great challenge for the future will be to bridge the gap between mathematical modelling of small circuits and integrative analysis of large networks. Instead of considering network biology as a new and thus separate field, combination of graph theory with other established approaches in mathematical biology, and their confrontation with prior biological knowledge are critical elements if we aim to fully understand, model and design biological systems.^{58}
ER | Erdös–Renyi |
PPIs | Protein–protein Interactions |
LC | Literature-curated PPI network |
HTP | High-throughput PPI network |
Footnote |
† This article is part of a Molecular BioSystems themed issue on Computational and Systems Biology. |
This journal is © The Royal Society of Chemistry 2009 |