Nicole
Pearcy
,
Jonathan J.
Crofts
* and
Nadia
Chuzhanova
School of Science and Technology, Nottingham Trent University, Nottingham, NG11 8NS, UK. E-mail: jonathan.crofts@ntu.ac.uk; Tel: +44 (0)115 848 8419
First published on 2nd October 2014
At the systems level many organisms of interest may be described by their patterns of interaction, and as such, are perhaps best characterised via network or graph models. Metabolic networks, in particular, are fundamental to the proper functioning of many important biological processes, and thus, have been widely studied over the past decade or so. Such investigations have revealed a number of shared topological features, such as a short characteristic path-length, large clustering coefficient and hierarchical modular structure. However, the extent to which evolutionary and functional properties of metabolism manifest via this underlying network architecture remains unclear. In this paper, we employ a novel graph embedding technique, based upon low-order network motifs, to compare metabolic network structure for 383 bacterial species categorised according to a number of biological features. In particular, we introduce a new global significance score which enables us to quantify important evolutionary relationships that exist between organisms and their physical environments. Using this new approach, we demonstrate a number of significant correlations between environmental factors, such as growth conditions and habitat variability, and network motif structure, providing evidence that organism adaptability leads to increased complexities in the resultant metabolic networks.
Metabolic networks have been the focus of a large number of studies (see, for example, the review by Lacroix et al.13 and references therein) and several important structural characteristics have been evidenced. For example, modularity, i.e. the propensity for a network to organise into nearly-independent structural units, has been shown to be a prevalent feature within metabolic networks,14,15 and has, for example, been related to important biological properties such as robustness16,17 and evolvability.10,11,16 However, metabolic networks are by no means perfectly modular; their inter-module connectivity is relatively high, leading some authors to conclude that these networks are better described as being hierarchically structured,14 that is metabolic networks may be considered to possess fractal-like properties, such as self-similarity. Indeed, the existence of many small highly integrated units, which then group together to form larger modules and so on, provides one possible explanation for the high level of inter-modular connectivity observed in these networks.
Another popular approach for analysing metabolic networks is provided by network motifs,18i.e. recurrent, statistically significant subgraphs. Motifs are of particular interest since they are typically associated with certain biological functions, and their relative over-abundance is considered to be an evolutionary result reflecting their “importance” to the organisms involved.19,20 Moreover, they constitute the basic structural units from which complex metabolic networks are formed, and thus provide a simplified framework for probing large-scale topologies. For example, in a recent study Shellman et al.21 successfully captured key evolutionary differences between metabolic networks from the six different kingdoms of life, employing network motif analysis. Another example, highlighting the considerable potential of such an approach, is provided by the work of Asgari and colleagues,22 in which the authors employ recent advances in the theory of network controllability as a method for improving drug–target discovery techniques. In particular, they suggest that network motifs provide ideal ‘driver’ candidates, which can be employed to manoeuvre the system of interest into certain desirable states.
Here, we introduce a novel technique for comparing biological networks of varying size based on local network structure. In particular, we propose a new embedding technique based on low order network motifs. In our approach, each network is mapped to a point in a high-dimensional vector space, the dimension of which depends on the number of motifs considered (n = 212 in our work as we consider all 3 and 4 node motifs23). By using a suitably defined low-rank approximation, we are able to combine the 212 motif frequency scores into a single network specific measurement, which allows us to compare and contrast networks in terms of just a single parameter. Using this new measure we investigated 383 bacterial metabolic networks with identified growth conditions, as well as a smaller subset of 115 networks classified according to the amount of variability present within their natural habitats, and found a number of significant correlations between network motif structure and fluctuations in environmental conditions.
and
denotes the mean and standard deviation of the rate of recurrence of the jth motif in an ensemble of randomised networks.3
In this way, for each network of interest we can compute a feature vector, zi, whose elements are the z-scores of each network motif. For example, if, as in this work, we consider all 3- and 4-node motifs then the result is a vector zi ∈
212 representing the ith network.
Note that it is typically the case that the networks we wish to compare are of varying order and as such we need to take care that network size does not bias any results. To handle this issue one can consider instead of the z-scores defined above, a so-called significance profile26 defined by
In the work presented here, we threshold the network significance profiles such that any entries si,j > 0 are set to zero as we are only interested in those motifs that are over represented. Motifs that are under represented are known as anti-significant motifs, or anti-motifs, and although we do not consider them in this study, the approach forwarded here can easily be extended to that case. This results in a matrix
| S = [s1,…,sm]T ≥ 0, |
To analyse the matrix S we use a matrix decomposition to compute a low-rank approximation of the data.25 Since our data is non-negative, it is natural to decompose it using a non-negative matrix factorisation27 (NNMF) (for algorithmic details see the Methods section). Such an approach is akin to a principal component analysis, that reduces the dimension of the problem, thus allowing us to detect important network features. Mathematically, we approximate S as follows:
| S ≈ WH, | (1) |
m×k and H ∈
k×212 are non-negative matrices. Here, k is the rank of the approximation and m the number of networks being considered. Importantly, both the columns of W and the rows of H can be used to reveal important network features.28,29 Note, that in all of our experiments, the factorisation in (1) was carried out using k = 3, and W, H were chosen so as to minimise the residual25| ||S − WH||F. |
The approach can be concisely summarised into the following three basic steps (see Fig. 1 for a schematic description):
Step 1: For each metabolic network compute the significance profile, si ∈
212, consisting of the normalised significance scores for each of the 212 three- and four-node motifs.
Step 2: Compute a low-dimensional (k ≪ 212) representation of the thresholded matrix of significance scores, S = [s1,…,sm]T, using a non-negative matrix factorisation.
Step 3: Use the columns/rows of W/H to determine important network features.
| P(i,j) = si,j · h1,j. | (2) |
m×212 (m = 115 or m = 383 here), whose rows encapsulate the network motif structure of each metabolic network, and whose columns provide information pertaining to the relative importance of specific motifs across the network ensemble.
In the experiments in the next section, we derive a global significance score for each network by summing the rows of P as follows
![]() | (3) |
| Environment | Number of nodes | Number of edges | ||||
|---|---|---|---|---|---|---|
| Min | Median | Max | Min | Median | Max | |
| Obligate (34) | 78 | 273 | 620 | 91 | 340 | 840 |
| Specialised (5) | 442 | 480 | 541 | 566 | 641 | 692 |
| Aquatic (4) | 541 | 580 | 647 | 700 | 751 | 868 |
| Facultative (41) | 90 | 652 | 809 | 101 | 890 | 1160 |
| Multiple (28) | 430 | 615 | 800 | 560 | 821 | 1119 |
| Terrestrial (3) | 557 | 689 | 693 | 779 | 944 | 966 |
| Total (115) | 78 | 541 | 809 | 91 | 730 | 1160 |
| Environment | Number of nodes | Number of edges | ||||
|---|---|---|---|---|---|---|
| Min | Median | Max | Min | Median | Max | |
| Aerobic (154) | 65 | 605 | 892 | 74 | 809 | 1210 |
| Facultative (180) | 78 | 602 | 816 | 91 | 825 | 1168 |
| Anaerobic (49) | 307 | 488 | 681 | 381 | 645 | 969 |
| Total (383) | 65 | 581 | 892 | 74 | 789 | 1210 |
Metabolic processes can be modelled using simple graphs in a number of ways33 and it is important to choose an appropriate representation. The most common representation is the substrate–product graph whereby nodes and edges correspond to metabolites and reactions, respectively. Note, that a potential caveat of such an approach is that it can lead to the detection of erroneous pathways (see, for example, the discussion in Montañez et al.34). However, since we are not considering a path analysis here and for the ease of comparability with related studies, we consider the substrate–product representation in all of our experiments. Moreover, ubiquitous metabolites such as H2O, ATP and NADH were removed from the analysis as they tend not to be involved in higher order functions, and if included, typically lead to physiologically meaningless pathways. Finally, to further simplify the analysis, we consider only the largest connected component for each network. This avoids, for example, issues that arise when constructing randomised networks through the rewiring of metabolic networks consisting of a number of disconnected components. For further details of the network construction the interested reader is referred to the papers of Takemoto and colleagues.11,30,35
It is worth noting that, in addition to the modelling issues touched upon above, there are general limitations to any complex network study of metabolism due to noisy and incomplete metabolic maps (e.g. missing/spurious links), the omission of reaction stoichiometry data and incomplete reaction reversibility data. Nevertheless, the approach taken here is standard within the field and provides a global picture of the biochemical systems under investigation.
As an illustration of the approach introduced in the previous section, we carried out two experiments with the aim of testing the hypothesis that organism adaptability is manifested via the network motif structure of the corresponding metabolic networks.
Fig. 2 shows a plot of the mean global significance score, 〈Pglobal〉, versus environmental variability for the 115 different bacterial networks. Note that the average here is taken over each of the six environmental classes: obligate, specialised, aquatic, facultative, multiple and terrestrial. Importantly, we found that motif frequency, and thus network complexity, increased significantly with environmental variability. The lowest motif frequency was found for the bacteria within the obligate class, followed by a relatively steep increase to the specialised and aquatic classes, then higher again for the facultative and multiple classes, and then highest for the terrestrial class. The group differences shown in Fig. 2 are statistically significant (Kruskal–Wallis test, p < 10−9).
This result provides evidence supporting the view that variability in an organisms habitat has important consequences for the topology of the resultant metabolic networks, and is consistent with previous studies10,12,36,37 that have demonstrated important links between the metabolic networks of organisms and their biochemical environments. In the current context, these results can be understood as an evolutionary effect due to the greater uncertainty that accompanies an increasingly fluctuating environment: greater numbers of 3- and 4-node motifs lead to larger numbers of cycles, i.e. closed paths, and thus to increased redundancy in the metabolic network, which in turn promotes greater adaptability and robustness.
Fig. 3 shows a plot of the mean global significance scores versus growth conditions for the 383 different bacterial species. Interestingly, we found that networks that have evolved in the presence of oxygen, i.e. aerobes and facultative aerobes, have a significantly larger number of network motifs. The group differences shown in Fig. 3 were found to be significant (Kruskal–Wallis test, p < 10−4).
![]() | ||
| Fig. 3 Relationship between growth conditions, in particular oxygen requirements, and mean global significance score 〈Pglobal〉. Vertical bars represent standard errors. | ||
The results shown in Fig. 3 are in agreement with recent studies (see, for example, the paper by Raymond and Segré38) demonstrating that bacterial networks that are exposed to oxygen are able to form additional pathways, compared to those that are oxygen deprived. In particular, the study by Raymond and Segrè38 found that aerobic bacteria have approximately a 1.5 fold increase in the number of reactions and metabolites relative to anaerobic bacteria, resulting in the expansion of metabolic networks evolving in the presence of oxygen, and thus supporting the view of oxygen induced network complexity.
, that is, the column sum of the matrix P defined in eqn (2) – recall that the columns of P contain information specific to individual motifs. Moreover, by restricting the sum above to a particular subgroup of interest (specialised, obligate, multiple, etc.), it is possible to detail the extent to which any particular motif featured within that group. In the following we consider a motif to be significant within a particular group, if the mean local significance score of that motif (restricted to the group of interest) is at least 2 standard deviations greater than the mean score across the entire network ensemble. Note that a list of all 3- and 4-node motifs is provided in the ESI.†
The increased numbers of network motifs present within the varied class indicates a potentially significant growth in network redundancy within those organisms inhabiting fluctuating environments, and can be considered as further evidence of so-called functional redundancy mediated robustness,40 that is, the observed perseverance of systems level redundancies prevalent in metabolic, as well as more general, cellular networks. More specifically, of the 4 additional significant motifs found in the varied class, motifs 14 and 15 may be considered variants of the single-input motif, motif 62 a bi-parallel fan, and motif 26 a multi-input motif, all of which have been implicated as potential indicators of network redundancy. For example, in the context of metabolism the single-input motif consists of a substrate X that is consumed in multiple reactions, the result of which are the products Y, Z,…; whilst the bi-parallel fan implies the presence of multiple, or compensatory, pathways whose efficiencies may vary according to alterations in environmental conditions. Indeed, these findings are in agreement with a number of recent studies relating genetic robustness and organism adaptability,40,41 and suggest that bacteria that live in more variable environments typically display a greater abundance of redundant metabolic reactions.
In addition to the topological differences observed between varied and specialised bacteria, we found that the distribution of those metabolites occurring within motif structures present across the entire network ensemble, i.e. motifs 5 and 9, also differed significantly. Fig. 4 and 5 show the mean frequency for metabolites occurring within motif 5 for the 115 metabolic networks, again grouped into the specialised (blue bars) and varied classes (red bars). Note that the frequencies plotted in Fig. 4 and 5 have been normalised to remove any bias due to network size (see Methods section for further details), and metabolites are displayed in decreasing order according to the varied class. Fig. 4 displays the distribution for those 263 metabolites that occurred at least once within motif 5 across the two classes under consideration. Interestingly, we see that the distribution for the varied class is relatively broad, with a large number of metabolites occurring with a relatively low frequency, whereas the distribution for the specialised class is more akin to a scale-free or power-law distribution, consisting of a small set of relatively high frequency metabolites. Note that similar results were found for motif 9 (see ESI†).
Next we used a Chi-square test (Fisher's Exact test, p < 0.01) to explore the differences in proportions of the individual metabolites between the two groups. Fig. 5 shows the 47/263 metabolites for which a significant difference in proportions was found, again displayed according to decreasing frequency of the varied class. Metabolites displaying the most significant differences (Fisher's Exact test, p < 10−5) included (2R)-2-hydroxy-3-(phosphonooxy)-propanal, tetrahydrofolate and isopentenyl diphosphate, all of which were overrepresented in the specialised group compared to the varied group. Note that the aforementioned overrepresented metabolites are required for biosynthesis of various amino acids, folates and terpenoids and are also responsible for the regulation of carbohydrate metabolism in many bacterial species.
Fig. 6 and 7 show the distribution of metabolites across motif 5 for the two groups, ordered according to decreasing metabolite frequency for the aerobic class (blue bars). Note, that the aerobic class exhibits a fairly broad distribution, whilst the anaerobic distribution tails off slightly quicker, in a similar but less pronounced manner to that displayed by the specialised bacteria in Fig. 4. Fig. 7 shows those metabolites that displayed a significant group difference. Interestingly, the majority of metabolites, some 37/52, were found to be overrepresented in the aerobic group compared to the anaerobic group, the most significant of which were isopentenyl diphosphate, fatty acid, trans,trans-farnesyl diphosphate, phosphatidylethanolamine, phosphatidylserine, L-threonine, L-2-amino-3-oxobutanoate, phosphatidylcholine, 2-acyl-sn-glycero-3-phosphocholine, L-2-lysophosphatidylethanolamine, 3′,5′-cyclic GMP (Fisher's Exact test, p < 10−5). These metabolites are known to be involved in the biosynthesis of a range of amino acids and secondary metabolites. Again, similar results where found for motif 9 (see ESI†).
n×m, W ∈
n×k and H ∈
k×m) via successive iterations of the equationsNote that due to the iterative nature of the scheme, the resulting decomposition can vary depending upon (a) the choice of initial matrices W0, H0; (b) the choice of the parameter k, which is usually not obvious a priori, and tends to be based on heuristics such as the number of expected clusters in the data, or a visual inspection of the scree plot. In our experiments, we ran the algorithm with 100 different random initial conditions, choosing as our factorisation the matrices W, H for which the residual ||A − WH||F was minimised, according to the Frobenius norm. Moreover, we repeated the experiments for a range of different k values (up to and including k = 25) to test the robustness of our results. We found the effects of varying k to be inappreciable, in the sense that the patterns reported were reproduced for most values of k.
Footnotes |
| † Electronic supplementary information (ESI) available. See DOI: 10.1039/c4mb00430b |
| ‡ http://www.mathworks.com/ |
| This journal is © The Royal Society of Chemistry 2015 |