David R.
Johnson
a,
Damian E.
Helbling
b,
Yujie
Men
c and
Kathrin
Fenner
*cd
aDepartment of Environmental Microbiology, Eawag, Dübendorf, Switzerland
bSchool of Civil and Environmental Engineering, Cornell University, Ithaca, NY, USA
cDepartment of Environmental Chemistry, Eawag, Dübendorf, Switzerland. E-mail: kathrin.fenner@eawag.ch; Fax: +41 58 765 5802; Tel: +41 58 765 5085
dDepartment of Environmental Systems Science, ETH Zürich, Zürich, Switzerland
First published on 25th March 2015
There is increasing interest in using meta-omics association studies to investigate contaminant biotransformations. The general strategy is to characterize the complete set of genes, transcripts, or enzymes from in situ environmental communities and use the abundances of particular genes, transcripts, or enzymes to establish associations with the communities' potential to biotransform one or more contaminants. The associations can then be used to generate hypotheses about the underlying biological causes of particular biotransformations. While meta-omics association studies are undoubtedly powerful, they have a tendency to generate large numbers of non-causal associations, making it potentially difficult to identify the genes, transcripts, or enzymes that cause or promote a particular biotransformation. In this perspective, we describe general scenarios that could lead to pervasive non-causal associations or conceal causal associations. We next explore our own published data for evidence of pervasive non-causal associations. Finally, we evaluate whether causal associations could be identified despite the discussed limitations. Analysis of our own published data suggests that, despite their limitations, meta-omics association studies might still be useful for improving our understanding and predicting the contaminant biotransformation capacities of microbial communities.
Water impactOne of the main challenges in contaminant biotransformation research is to identify the genes or gene products that cause or affect particular biotransformations. Meta-omics association studies are rapidly gaining attention as a possible approach to address this challenge, but meta-omics association studies have inherent limitations of both technical and biological natures. While the technical limitations have been discussed in detail (e.g., accuracy of functional annotations, sequencing depth, etc.), the biological limitations remain largely unaddressed. In this perspective manuscript, we describe general biological scenarios that could prevent meta-omics association studies from identifying the genes or gene products that cause particular contaminant biotransformations. We next explore our own published data to test the relevance of the discussed biological scenarios. We finally synthesize our findings and present our perspective about the potential of meta-omics investigations to investigate contaminant biotransformations in the face of their inherent biological limitations. |
Conventionally, establishing causal relationships between contaminant biotransformations and genes or gene products has been achieved by characterizing microorganisms in pure or enrichment cultures where the contaminant of concern serves as a growth substrate and the responsible genes or gene products could be directly enriched and characterized.8–10 An important limitation of this approach is that it is susceptible to culturing biases and can lead to the enrichment of microorganisms, genes, or gene products that are environmentally irrelevant.11 A second limitation is that the approach is often not appropriate for co-metabolic contaminant biotransformations, which are likely important biotransformation mechanisms for trace organic contaminants.12,13 The main problem is that co-metabolic biotransformations do not support growth, thus making it challenging to directly enrich the responsible microorganisms, genes, and gene products. This problem also affects other recent methodological advances in the field of contaminant biotransformation research, such as stable isotope probing (SIP) or microautoradiography combined with fluorescence in situ hybridization (MAR-FISH). These methods rely on the incorporation of isotope-labelled compounds into new biomass14,15, and are therefore not likely to be helpful for identifying the biological determinants of co-metabolic biotransformations.
Given these limitations along with the increasing accessibility of high-throughput sequencing and mass spectrometry techniques, there is growing interest in using molecular data generated via meta-omics methodologies (i.e., methodologies that attempt to characterize the complete set of genes, transcripts, or enzymes of a community) to elucidate causal associations with biotransformations.16–20 The general strategy is to isolate and characterize aggregate DNA, RNA, or proteins from in situ environmental communities and use the abundances of genes or gene products to establish associations that reflect the communities' potential for biotransforming one or more contaminants (referred to here as a meta-omics association study). In this context, we use the term “association” to refer to a statistical relationship between two variables, which may be described quantitatively (e.g., a linear or monotonic relationship) or qualitatively (e.g., a co-occurrence relationship). The associations can then be used to generate hypotheses about possible causal relationships between contaminant biotransformations and particular genes or gene products. Important advantages of meta-omics association studies are that they avoid culturing biases, do not require that the contaminants of interest be used as growth substrates, and may help to identify the responsible organisms.
One general scenario is “intracellular hitchhiking” (Fig. 1A). Consider a microbial strain that carries a gene or gene product (designated as G1) that causes or promotes a particular contaminant biotransformation (Fig. 1A). Because G1 causes or promotes that biotransformation, we might expect a causal association between the abundance of G1 and the rate of that particular biotransformation (Fig. 1D; the relationship is depicted as linear for simplicity, but could be of any monotonic form). However, the same strain that carries G1 likely carries many other genes or gene products (designated as G2 to Gn) that cause or promote entirely unrelated functions. For example, G2 might be an enzyme that biotransforms a different substrate but continues to be synthesized even when that substrate is not present within the cell's local environment (i.e., the enzyme is constitutively expressed).24–26 The consequence is that, even though causal relationships do not exist between G2 to Gn and the biotransformation of interest, the co-occurrence of G2 to Gn and G1 within the same cell could generate large numbers of genuine but non-causal associations (Fig. 1D; the relationships are again depicted as linear for simplicity, but could be of any monotonic form). Considering that a single microbial strain typically carries several thousand genes and gene products, the size of G2 to Gn could be exceedingly large and “intracellular hitchhiking” could result in far more genuine but non-causal associations than causal associations.
A second general scenario is “intercellular facilitation” (Fig. 1B). Consider again a microbial strain that carries G1 that causes or promotes a particular contaminant biotransformation (Fig. 1B). We might again expect an association between the abundance of G1 and the rate of that particular biotransformation (Fig. 1D). However, the same strain that carries G1 might perform another function that positively affects the growth of a second microbial strain. For example, the strain that carries G1 might secrete a metabolite that promotes the growth of the second strain.27,28 If the second strain carries other genes or gene products (designated as G2 to Gn) that do not affect the biotransformation of interest, the abundances of G2 to Gn might nevertheless associate with the rate of that biotransformation even though they do not cause or promote that biotransformation (Fig. 1D). The result is again a potentially large number of genuine but non-causal associations. Moreover, for every additional “intercellular facilitation”, there is a new set of genuine but non-causal associations that could emerge by “intracellular hitchhiking”, thus leading to potentially large numbers of genuine but non-causal associations.
A third general scenario is “habitat co-occurrence” (Fig. 1C). Consider two different microbial strains that co-occur together in a particular habitat but do not otherwise interact with each other. For example, the two strains might be particularly well adapted to a specific environment such as plant root surfaces, arctic lakes, or hot springs. One strain carries gene or gene product G1 that causes or promotes a particular contaminant biotransformation while the other strain carries genes or gene products G2 to Gn that do not cause or promote that biotransformation. The consequence of habitat co-occurrence is that, while only G1 causes or promotes that biotransformation, genuine but non-causal associations could occur between the abundances of G2 to Gn and the rate of that biotransformation. This scenario is especially likely when meta-omics association studies are conducted across one or more environmental gradients, which is often the case.29 Moreover, for every additional co-occurring strain there are again new sets of possible genuine but non-causal associations that could emerge by “intracellular hitchhiking” and “intercellular facilitation”, thus leading to even larger numbers of genuine but non-causal associations.
While the above arguments may appear pessimistic, we presented these arguments as if only one microbial strain carries G1, and therefore only one strain is responsible for a particular contaminant biotransformation. This may not be the typical case, and instead a large number of different strains might carry G1 and contribute to that particular contaminant biotransformation. If G1 were widely distributed among different strains (i.e., if there were many strains that carry G1), then this could prevent the emergence of some genuine but non-causal associations. For example, consider intracellular hitchhiking. If many strains carry G1, but carry somewhat different compositions of G2 to Gn, then this could weaken or prevent the emergence of genuine but non-causal associations with any particular member of G2 to Gn. Therefore, it remains unclear, and most likely depends on the functions examined, how pervasive genuine but non-causal associations may be when using meta-omics association studies.
To test for evidence of pervasive non-causal associations, we examined data from our own recent research on contaminant biotransformations by activated sludge communities. We performed a meta-transcriptome association study where we used readily available sequencing methodologies to quantify the associations between the abundances of 5200 different transcripts and the biotransformation rate constants for atenolol among ten different wastewater treatment plant (WWTP) communities. All of the original data have been published elsewhere29–31 and are publically available (MG-RAST project number 6015 using the SEED subsystems database and an e-value cutoff of 10−5). We reasoned that, if the three general scenarios described for Limitation 1 are pervasive, then the distribution of significant associations should be skewed towards positive associations (i.e., all three of the general scenarios generate genuine but non-causal positive associations). In contrast, if the three general scenarios described for Limitation 1 are no more pervasive than scenarios that could generate negative associations, then the distribution of significant associations should be distributed about zero (i.e., there should be an approximately equal number of positive and negative associations). Indeed, we observed data that is consistent with the former expectation. The distribution of correlation coefficients with the biotransformation rate constants for atenolol showed a clear bias towards positive values (Fig. 2A) and the mean value of 0.16 was significantly greater than zero (P < 10−16; one-tailed, one-sample student's t-test). Moreover, when we randomized the biotransformation rate constants of atenolol across the ten WWTPs and re-calculated the correlation coefficients, the distribution of correlation coefficients was centered about zero (Fig. 2B) and the mean was not significantly different from zero (P > 0.05; two-tailed, one-sample student's t-test). These outcomes therefore provide support that the three general scenarios described for Limitation 1 are of potential concern and may indeed generate significant numbers of genuine but non-causal associations.
One general scenario is uncontrolled biological variation (Fig. 3A). As an illustrative example, consider a situation where there are two variants of the enzyme G1 (designated G1a and G1b) that catalyze a particular contaminant biotransformation, but each variant is expressed preferentially in different microbial communities (Fig. 3A). If the catalytic activities of G1a and G1b were identical, then we would expect an association between the total abundance of G1 (i.e., the sum of G1a and G1b) and the rate of that particular biotransformation among the different microbial communities (Fig. 3B; the relationships are depicted as linear for simplicity, but could be of any monotonic form). However, if the catalytic activity of G1a were greater than that of G1b, then the association between the total abundance of G1 and the rate of that particular biotransformation may weaken or, in an extreme case, disappear (Fig. 3B; although community B expresses large numbers of G1b, it has a low biotransformation rate because of the poor catalytic activity of G1b). Such a scenario is biologically plausible, as different variants of the same class of enzymes can have surprisingly different catalytic activities.32
A second general scenario is that the abundance of the catalytic enzyme does not determine the rate of a particular contaminant biotransformation (Fig. 3C). Instead, other factors may determine the rate of that particular biotransformation. For example, the rate might be determined by the accumulation of metabolic intermediates within the cell that repress the activity of the catalytic enzyme (i.e., product inhibition).24 In this case, the rate might be determined by the abundance of downstream enzymes that consume the intermediates (Fig. 3C, enzyme G2). Alternatively, the rate might be determined by the availability of co-factors required for enzyme activity33 or by the transport of the contaminant into the cell.34 For all of these cases, the abundance of the genes or gene products for the catalytic enzyme may not associate with the rate of that particular biotransformation (Fig. 3D), regardless of the fact that the catalytic enzyme causes that particular biotransformation.
Finally, a third general scenario is that proportional relationships might not exist between different levels of genetic information processing, enzyme synthesis, and enzyme activity. A wide range of transcriptional, translational, and post-translational regulation mechanisms are known that may prevent the number of genes, transcripts, or enzymes from associating with enzyme activities.35 In other words, two communities with identical abundances of a particular gene or enzyme might nevertheless have substantially different enzyme activities. In extreme cases, these regulatory mechanisms could completely prevent an association from emerging between the abundances of genes or gene products and enzyme activities.
The conventional approach to address this problem is to adjust the required significance level for multiple hypothesis testing. The simplest (but among the least powerful) method is the Bonferroni correction, which controls the family-wise error rate.36 As an illustrative example, assume that we want to test each individual hypothesis at a significance level of 0.05. In order to maintain this individual significance level after multiple hypothesis testing, we would define an effective required significance level as the desired significance level for an individual hypothesis test divided by the number of hypotheses tested. Thus, if the desired significance level for an individual hypothesis test is 0.05, then the effective required significance level is 0.05/5200 or 9.6 × 10−6.
Unfortunately, most meta-omics association studies with microbial communities do not analyze sufficient numbers of independent samples (designated as n) to obtain P-values that are equal to or smaller than this value. As a concrete example, we measured the correlation coefficients between the abundances of each of the 5200 transcripts from our previous study and the rate of ammonia removal (available for nine of the ten activated sludge communities37). In this case, we had prior knowledge that the abundance of ammonia monooxygenase transcripts causally associated with the rate of ammonia removal.30 Given this prior knowledge, we asked the following question: for the association between the number of ammonia monooxygenase transcripts and the rate of ammonia removal, how many independent activated sludge metatranscriptomes (n) would we have had to sequence in order for the correlation coefficient to be significant after accounting for multiple hypothesis testing? We can readily estimate this because the P-value solely depends on the magnitude of the correlation coefficient and n. We specified the desired P-value at 9.6 × 10−6 and measured the magnitude of the correlation coefficient (rho = 0.78, unpublished data), thus leaving n as the only unknown variable. We found that n = 24, which means that we would have had to sequence at least 24 activated sludge metatranscriptomes for the correlation coefficient, and thus the known causal association, to be statistically significant. While sequencing the metatranscriptomes of 24 activated sludge communities is within the capabilities of some environmental microbiology laboratories, it far exceeds the amount of sequencing that is typically generated for most studies in the field. If this level of sequencing were not accessible, then studies must rely more heavily on careful experimental design, sample selection, and data processing to maximize the accuracy of quantifications, and thus generate stronger associations.
In summary, our own data indicate that, despite the above limitations, meta-omics association studies might indeed allow us to uncover candidate genes or gene products that are likely to cause or promote specific micropollutant biotransformations. If combined with rational approaches to limit the number of candidate genes, e.g., based on a comparison of reaction similarity with known enzymatic reactions38,39 to limit the number of hypotheses that are tested, we believe that meta-omics association studies are a promising approach to understand and predict variability in contaminant biotransformation performance among different microbial communities.
This journal is © The Royal Society of Chemistry 2015 |