Huange
Wang
*a,
Joao
Paulo
a,
Willem
Kruijer
a,
Martin
Boer
a,
Hans
Jansen
a,
Yury
Tikunov
b,
Björn
Usadel
c,
Sjaak
van Heusden
b,
Arnaud
Bovy
b and
Fred
van Eeuwijk
a
aBiometris, Wageningen University and Research Centre, PO Box 16, 6700AA Wageningen, The Netherlands. E-mail: hw428@cam.ac.uk
bPlant Research International, PO Box 386, 6700AJ Wageningen, The Netherlands
cInstitute for Biology I, RWTH Aachen University, Worringer Weg 3, 52074 Aachen, Germany
First published on 2nd September 2015
Modeling genotype–phenotype relationships is a central objective in plant genetics and breeding. Commonly, variations in phenotypic traits are modeled directly in relation to variations at the DNA level, regardless of intermediate levels of biological variation. Here we present an integrative method for the simultaneous modeling of a set of multilevel phenotypic responses to variations at the DNA level. More specifically, for ripe tomato fruits, we use Gaussian graphical models and causal inference techniques to learn the dependencies of 24 sensory traits on 29 metabolites and the dependencies of those sensory and metabolic traits on 21 QTLs. The inferred dependency network which, though not essentially representing biological pathways, suggests how the effects of allele substitutions propagate through multilevel phenotypes. Such simultaneous study of the underlying genetic architecture and multifactorial interactions is expected to enhance the prediction and manipulation of complex traits.
Interactions between and within multilevel phenotypic and omics traits can be learnt by probabilistic graphical models (PGMs), which typically unravel probabilistic conditional independence structures of multiple variables. A particular type of PGMs, namely Gaussian graphical models (GGMs, also known as “covariance selection” or “concentration graph” models),1 has become popular in computational systems biology. GGMs are claimed to be superior to the well-known correlation networks (also called “relevance networks”), as they are based on partial correlations and thereby distinguish between direct and indirect associations.2
The metabolome is of great importance in crop plants, as metabolite concentrations reflect the developmental stage of plants and determine to a great extent many quality traits such as nutritional value and sensory attributes. Recent advances in plant metabolite profiling, including gas chromatography-mass spectrometry (GC-MS), liquid chromatography-mass spectrometry (LC-MS) and nuclear magnetic resonance (NMR), have enabled large-scale analyses that reveal quantitative variation in the metabolic content of various species.3 Accordingly, it has become feasible to investigate associations between metabolites.
Beyond associations, dependencies among metabolites are of interest to plant biologists for understanding adaptation and survival in relation to primary and secondary metabolism. The metabolome is recognized as a highly interactive system, where a metabolite variation may lead to a chain reaction: changes in the concentration of a metabolite alter the concentrations of some other metabolites through specific regulatory pathways. A few methods have been presented to uncover dependencies among associated traits, using previously determined QTLs.4–9 All these approaches require at least one unique QTL for each trait studied. In practice, however, this prerequisite is often not satisfied. To cope with more general scenarios where some of the traits come without QTL or unique QTLs, a QTL + phenotype supervised orientation (QPSO) algorithm was recently proposed.10 This algorithm looks promising for learning dependencies between metabolites whose profiling is still expensive and time-consuming, so that sample sizes are typically small and the power to detect QTLs is subsequently limited.
In this paper, we combine three GGM approaches with the QPSO algorithm to model genotype–phenotype relationships with consideration for the intermediate metabolite variations. Our integrative method is demonstrated through a practical case study, in which we obtain a dependency network involving 24 sensory traits, 29 metabolites and 21 QTLs identified for those sensory traits and metabolites in ripe tomato fruits. In the first place, a high-confidence true positive undirected network, which represents direct associations within and between metabolites and sensory traits, is learnt by the three GGM approaches including: (i) lasso-based neighborhood selection11 (LBNS) in combination with a stability approach to regularization selection12 (StARS), (ii) the PC-skeleton algorithm13 and (iii) the lasso14 in combination with stability selection15 (SS). In the second place, given the undirected network and QTLs previously identified for the sensory traits and metabolites, edge directions (i.e., the directions of associations) are inferred by the QPSO algorithm. In the third place, each sensory trait and metabolite is regressed on its QTLs and inferred parent nodes (i.e., nodes with outgoing edges pointing to this sensory trait or metabolite). The fitted regression coefficients provide more details regarding the estimated dependencies: “+” – positive, “−” – negative, and their absolutes values – the strength of dependencies.
It is known that tomato sensory traits are co-determined by metabolites.16–18 A major concern of plant breeders and physiologists is, thus, how to first identify metabolites and corresponding genomic regions that are responsible for variations in sensory traits, and next develop a strategy for the genetic improvement of certain sensory traits jointly. Our proposed method provides a way to investigate the dependencies within and between metabolites and sensory traits. The estimated dependencies which, though not equal to biological pathways, suggest how the effects of allele substitutions propagate through metabolites to sensory traits. This information should help breeders and physiologists to predict and manipulate the target sensory traits.
On all plants, 29 metabolites and 24 sensory traits were scored on ripe fruits, which were harvested and prepared as described in ref. 19. Metabolic profiling was carried out in two ways: volatiles were measured using a head space Solid Phase Microextraction-Gas Chromatography-Mass Spectrometry (SPME-GC-MS);19 sugars and acids were quantified using the method of GC-MS of trimethylsilyl ester derivatives.20 All metabolites were identified at level 1 annotation21 using authentic chemical standards analyzed at identical experimental conditions, except beta-damascenone, which has a level 2 identity: NIST mass spectral library 2010 (Mainlib) match 911 (0-1000) and the library retention index deviation of 4 (http://www.nist.gov/srd/nist1a.cfm). All metabolites have corresponding CAS ID numbers (Sheet 1 of Supplementary Dataset S1, ESI†). Sensory profiles were obtained by a trained panel of judges for just 16 out of the 48 genotypes for each cherry × round population. The judges evaluated each genotype for a set of sensory traits including smell, taste, aftertaste, and mouthfeel experience. All sensory attributes were scored on a scale of 0 to 100. In addition to the metabolites and sensory traits, brix was measured for each genotype using a refractometer (GMK-701R; Nie-Co Products, Aalsmeer, NL). Metabolite abundances were transformed to log10 scale for statistical analysis. Prior to network reconstruction, genotypic means for the sensory traits, brix and metabolites were calculated using mixed models, which contained corrections for measurement time (for brix and metabolites), judge (for the sensory traits), population (for all traits) and the presence/absence of the Rin mutation (for all traits). Rin is the recessive ripening-inhibitor mutation that inhibits ripening,22 and was present in all crosses involving parent R075. Fruits from plants that are homozygous for Rin do not ripen and have lower concentrations of metabolites. The corrected genotypic means (Sheet 2–4 of Supplementary Dataset S1, ESI†) were used for further analysis.
A multi-trait QTL mapping strategy was implemented following the idea described in ref. 23 and 24. This strategy assumes that a single biparental offspring population was present. We turned the four cherry × round F2 populations into a single biparental F2 population by interpreting the two cherry parents to represent a first single parent and the two round tomato parents to represent a second single parent. Phenotypes were then regressed on genetic predictors, i.e. independent variables expressing molecular marker information. Genetic predictors were based on the expected number of alleles coming from the round parents, i.e. conditional QTL probabilities given flanking marker information using a Hidden Markov model.25 Parametrization was such that positive regression coefficients, QTL allele substitution effects, would point to the round allele as increasing the level of the trait, whereas negative QTL effects would imply that the cherry allele increased the trait. In comparison to ref. 23 and 24, for the current multi-trait QTL model we took care to allow for population specific intercepts for each trait. Another deviation from ref. 23 and 24 was that we included a trait specific correction for the presence/absence of the Rin mutation. Our multi-trait QTL model for a vector of trait responses was therefore as follows: traits = population specific trait intercepts + trait specific RIN corrections + trait specific QTLs + trait specific residuals. The trait specific residuals were modeled with trait specific variances and correlations. Multi-trait QTL models were fitted on each of three groups of traits: (1) volatiles; (2) sugars and acids; (3) sensory attributes. The multi-trait QTL modeling was done in GenStat 16 (http://www.vsni.co.uk/software/genstat/). Positions of QTLs identified for the traits studied are summarized in Table 1, and the QTL data are provided in the two spreadsheets of Supplementary Dataset S2 (ESI†).
SNP/QTL | Chromosome | Position | Metabolite | SNP/QTL | Chromosome | Position | Metabolites | SNP/QTL | Chromosome | Position | Sensory traits |
---|---|---|---|---|---|---|---|---|---|---|---|
rs8314 | 1 | 10.35 | cis-3-Hexenol | rs2050 | 5 | 39.26 | 6-Methyl-5-hepten-2-one | rs7448 | 2 | 63.5 | Scent_aromaintensity |
rs6454 | 1 | 128.97 | trans-2-Heptenal | rs2050 | 5 | 39.26 | Geranylacetone | rs7448 | 2 | 63.5 | Scent_tomato |
rs6454 | 1 | 128.97 | alpha-Terpineol | rs2050 | 5 | 39.26 | Aspartic acid | rs7448 | 2 | 63.5 | Taste_sweet |
rs6687 | 2 | 76.3 | cis-3-Hexenol | rs2050 | 5 | 39.26 | Citric acid | rs7448 | 2 | 63.5 | Taste_tomato |
rs6687 | 2 | 76.3 | myo-Inositol | rs2050 | 5 | 39.26 | Malic acid | rs7448 | 2 | 63.5 | Taste_unripe |
rs6687 | 2 | 76.3 | 3-Methylbutanal | rs2050 | 5 | 39.26 | Fructose | rs7448 | 2 | 63.5 | Taste_spicy |
rs6687 | 2 | 76.3 | Citric acid | rs2050 | 5 | 39.26 | Glucose | rs7448 | 2 | 63.5 | Aftertaste_sweet |
rs6687 | 2 | 76.3 | Glucose | rs4202 | 6 | 4.03 | Glutamic acid | rs8396 | 3 | 103.79 | Mouthfeel_moist |
rs6687 | 2 | 76.3 | Fructose | rs4202 | 6 | 4.03 | Sucrose | rs8396 | 3 | 103.79 | Mouthfeel_solid |
rs6687 | 2 | 76.3 | Sucrose | rs4202 | 6 | 4.03 | Brix | rs8396 | 3 | 103.79 | Aftertaste_rough |
rs6687 | 2 | 76.3 | Brix | rs3540 | 6 | 14.14 | cis-3-Hexenal | rs7775 | 7 | 56.23 | Scent_smoky |
rs6691 | 4 | 55.68 | 1-Penten-3-one | rs3540 | 6 | 14.14 | 1-Penten-3-one | rs7775 | 7 | 56.23 | Aftertaste_sweet |
rs6691 | 4 | 55.68 | beta-Damascenone | rs3540 | 6 | 14.14 | Hexanal | rs8016 | 8 | 11.04 | Scent_smoky |
rs7153 | 4 | 86.64 | Sucrose | rs3540 | 6 | 14.14 | Benzyl alcohol | rs8591 | 8 | 58.98 | Taste_pungent |
rs7153 | 4 | 86.64 | Malic acid | rs6254 | 6 | 47.72 | beta-Ionone | rs7089 | 10 | 7.73 | Scent_aromaintensity |
rs7153 | 4 | 86.64 | Aspartic acid | rs6254 | 6 | 47.72 | 6-Methyl-5-hepten-2-one | rs7089 | 10 | 7.73 | Scent_sweet |
rs7153 | 4 | 86.64 | Citric acid | rs6254 | 6 | 47.72 | Methyl_salicylate | rs7089 | 10 | 7.73 | Taste_sour |
rs7153 | 4 | 86.64 | Glutamic acid | rs6254 | 6 | 47.72 | Hexanal | rs7089 | 10 | 7.73 | Aftertaste_sour |
rs2050 | 5 | 39.26 | Methyl_salicylate | rs8092 | 6 | 56.64 | Citric acid | rs7089 | 10 | 7.73 | Aftertaste_salty |
rs2050 | 5 | 39.26 | Phenylethyl_alcohol | rs8092 | 6 | 56.64 | Malic acid | rs8434 | 11 | 25.94 | Scent_spicy |
rs2050 | 5 | 39.26 | cis-3-Hexenal | rs9061 | 9 | 87.48 | Methyl_salicylate | rs8434 | 11 | 25.94 | Taste_unripe |
rs2050 | 5 | 39.26 | 3-Methylbutanal | rs9061 | 9 | 87.48 | Guaiacol | rs8434 | 11 | 25.94 | Mouthfeel_solid |
rs2050 | 5 | 39.26 | 1-Penten-3-one | rs9084 | 9 | 94.11 | Citric acid | ||||
rs2050 | 5 | 39.26 | beta-Ionone | rs9084 | 9 | 94.11 | Malic acid | ||||
rs2050 | 5 | 39.26 | 3-Methyl-1-butanol | rs6903 | 11 | 86.04 | Phenylethyl_alcohol | ||||
rs2050 | 5 | 39.26 | 2-Methyl-1-butanol | rs6903 | 11 | 86.04 | Benzeneacetaldehyde | ||||
rs2050 | 5 | 39.26 | 2-Isobutylthiazole | rs6495 | 12 | 26.59 | beta-Damascenone |
![]() | ||
Fig. 1 A schematic diagram of the proposed integrative method for learning dependency networks from the sensory, metabolic and QTL data. |
![]() | ||
Fig. 3 A dependency network of 24 sensory traits and 7 QTLs detected in ripe tomatoes. Red edges connect QTLs to their target traits; green edges represent the dependencies between sensory traits. Line style and thickness are determined by the fitted coefficients of each sensory trait being regressed on its QTLs and inferred parent nodes. Solid and dashed linestyle schemes are identical to those in Fig. 2. |
![]() | ||
Fig. 4 A dependency network of brix, 29 metabolites, 24 sensory traits and 21 QTLs detected in ripe tomatoes. Black edges represent dependencies of sensory trait on metabolites (via brix). Red, blue and green edges, together with their linestyle and thickness are identical to those in Fig. 2 and 3. |
A major challenge when applying lasso-based approaches to graphical modeling is to specify the regularization parameter that controls the sparsity of the resulting graph: larger amounts yield sparser graphs whereas smaller amounts lead to denser graphs. To come up with a general solution that is especially suited to high-dimensional problems, Liu et al. proposed StARS: a stability approach to regularization selection.12 StARS implements subsampling26 to draw a finite number of subsamples (overlapping subsamples are allowed) and constructs a GGM for each subsample. StARS starts with a strong regularization and gradually reduces it until the resulting graphs are simultaneously sparse and replicable across all subsamples. An implementation of LBNS in combination with StARS is available in the R package ‘huge’, which involves a variability threshold with two alternatives 0.1 and 0.05.27 Application of the two thresholds to both metabolic and sensory data suggested that 0.1 would be a better choice in this study (see Section 5.2 for details).
Fig. 2 indicates a separation between primary and secondary metabolism, i.e., sugars and acids on the left whereas volatiles on the right. Further, (1) sugars and a sugar alcohol, myo-inositol, were grouped together; (2) acids were gathered and linked to sugars; (3) most volatiles interacted, and a few of them were connected with sugars/acids.
Metabolic profiling of ripe tomatoes was carried out at single time points after harvest, that is, it did not produce time series data. The dependency network (Fig. 2) learnt from non-sequential metabolic data cannot be interpreted as a metabolic pathway; instead, it represented directed associations at the level of mean metabolite abundances. These dependencies, though essentially different from pathways, still provide hints on how the effects of allele substitutions propagate through a set of metabolites. For example, genotypic changes at locus rs6691 shall alter the concentration of 1-penten-3-one. This will probably subsequently affect the concentrations of beta-ionone, cis-3-hexenal and aspartic acid. Conversely, variations in the concentration of 1-penten-3-one are unlikely to affect the concentration of trans-2-hexenal, since trans-2-hexenal was found a parent node of 1-penten-3-one in the dependence network.
A better understanding of the dependence structure underlying multiple traits contributes to a better manipulation of those traits. Assume we want to regulate the concentration of beta-ionone, we should control genotypes at loci rs6691 and rs3540 in addition to those at rs2050 and rs6254. The reason is that any allele substitution leading to a change in the concentration of 1-penten-3-one might then alter the concentration of beta-ionone.
Fig. 3 is helpful to predict the simultaneous influence of various allele substitutions on multiple sensory traits. Assume that a genotypic change at locus rs7448 raises the level of scent_tomato. Accordingly there might be a decrease in scent_smoky, and further, an increase in scent_sweet. However, to finely predict one or more phenotypes, a comprehensive consideration of multiple allele substitutions is usually required. For instance, an increase in scent_tomato is not necessarily coupled with a decrease in scent_smoky. This is because apart from QTL rs7448, which had direct negative effect on scent_tomato and, subsequently, indirect positive influence on scent_smoky, scent_smoky was found also being regulated by another two QTLs rs7775 and rs8016. Analogously, scent_sweet was directly or indirectly determined by 5 QTLs, including rs7089, rs8434, rs7448, rs7775 and rs8016.
In addition to the aforementioned positive directed associations between metabolites and sensory traits, three negative dependencies were respectively found between eugenol and aftertaste_fresh, 2-methyl-1-butanol and aftertaste_chemical, as well as 2-methyl-1-butanol and aftertaste_sweet. The latter two are in agreement with the fact that 2-methyl-1-butanol is often found in fruits (NCBI PubChem) and that it seems to improve or partially impart an Italico-cheese flavor (US 3978242 A), which would not be perceived as a chemical taste but rather associated with natural products.
By taking into account the directed associations from metabolites to sensory traits, we were able to get a more realistic estimation of the dependence structure underlying those sensory traits. An example is that in Fig. 3 aftertaste_sour is present as a parent node of taste_sour, while in Fig. 4 a reversed dependency, which seems more logical, is achieved simply because an additional determinant citric_acid has been introduced to taste_sour.
There was a separation between primary and secondary metabolites. This of course makes sense considering the structural function of primary metabolites and the auxiliary function of secondary metabolites. Interestingly, within the primary metabolites, sucrose was the parent of fructose which in turn was the parent of glucose. This may be due to the enzymatic action of invertase which splits sucrose into glucose and fructose. The direct link between sucrose and glucose was recovered as an indirect one, which is potentially due to the additional action of sucrose synthase utilizing fructose and UDP-glucose.
It is noteworthy that in Fig. 4 fructose and glutamic acid were present as parent nodes of myo-inositol which in turn was the parent of sucrose. Metabolically myo-inositol is synthesized from glucose-6-phosphate viaD-myo-inositol 3-phosphate.36 But since neither glucose-6-phosphate nor D-myo-inositol 3-phosphate were quantified in this study, the network reconstruction and orientation algorithms might have compacted the network. Whilst this leaves the link from glutamic acid unexplained, it seems like a good testable hypothesis for the sugars and the sugar alcohol myo-inositol, which could be further explored.
For glutamic acid a direct and strong influence was observed from aspartic acid. Metabolically this might be explained by the enzymatic action of aspartate aminotransferase that converts glutamic acid and oxaloacetate into 2-oxoglutarate and aspartate. Indeed, aspartate aminotransferase has already been implicated in glutamate content in red tomato fruits.37 Comparatively, the impact of malic acid on aspartic acid seems less obvious. That said, an RNAi approach against PEPCK revealed strongly increased aspartic acid levels coinciding with reduced malate levels. However, silencing of NADP-malic enzyme in the same study showed less aspartic acid and somewhat lower malic acid levels in one transgenic line.38
Turning to volatiles as flavor carrying compounds it is obvious that the most-likely carotenoid derived volatiles beta-damascenone39 and beta-ionone40 were linked because of the common precursor beta-carotene. However, the deduced influence of one on the other might only be explained by hidden variables such as the actual enzyme activities and actual carotenoid concentrations not measured here. Also it is intriguing that 6-methyl-5-hepten-2-one and geranylacetone, both being interconnected, were not linked to the former pair of volatiles despite them also being carotenoid volatiles. The different differential behaviors of these two pairs of volatiles were also observed in earlier studies,39 and it has been reported that 6-methyl-5-hepten-2-one likely stems from lycopene.41 We therefore suspect the difference is attributed to distinct precursors. Apart from these carotenoid derived metabolites, the synthesis of phenylethyl alcohol from benzeneacetaldehyde42 was recovered in our analysis.
Regarding the linked metabolites 3-methyl-1-butanol and 3-methylbutanal, they are most likely leucine derived, whilst the associated 2-methyl-1-butanol likely stems from isoleucine. Also the association between 2-isobutylthiazole and 3-methyl-1-butanol was observed before.19,39 Thus this whole sub-cluster of metabolites is derived from or associated to branched chain amino acids. The current model for the biosynthesis of leucine-derived flavor imparting compounds assumes a decarboxylation to an aldehyde followed by a reduction. The truth, however, is that the alcohols should derive from the aldehydes.
StARS has been shown to outperform the conventional regularization parameter selection methods, including AIC, BIC and cross-validation, in the reconstruction of high-dimensional graphs.12 In view of this, we exploited StARS to set regularization in LBNS. The R package “huge” implements StARS with two optional variability thresholds: 0.1 and 0.05. We tested both thresholds on the metabolic and sensory data separately, and found that 0.05 led to a bit sparser graph than 0.01 (Fig. S1A vs. S1B, Fig. S2A vs. S2B, ESI†). As we aimed to extract a consensus network, the variability threshold of 0.1 was then used in StARS to ensure that given the same dataset, edges obtained by LBNS can overlap, to a large extent, with those learnt by the PC-skeleton algorithm.
The PC-skeleton algorithm also requires a pre-specified parameter, i.e. the significance level of conditional independence tests. We tested the two most common significance levels, 0.01 and 0.05, on the metabolic and sensory data separately. Results on the same datasets indicated that the significance level of 0.05 recovered a few more edges than the level of 0.01 (Fig. S3A vs. S3B, Fig. S3C vs. S3D, ESI†). Again, to reach as many as possible consensus edges, we took the significance level of 0.05 in this study.
Our strategy, which first overfits an undirected graph by LBNS + StARS and then screens out the unlikely edges by comparison with the outcome of the PC-skeleton algorithm, was also tried on the mixture of metabolic and sensory data. Surprisingly, only a few links between metabolites and sensory traits were discovered by either method (see black edges in Fig. S4A and B, ESI†). After discarding edges unique to one graph, we were left with merely eight common links (see the boldfaced black edges in Fig. S4A or B, ESI†). This implies that the above strategy, when being used to decipher the relationships between substances of different nature, is very likely to produce an underfitted graph. We then tried a third method, i.e. regressing every sensory trait on the metabolites by the lasso + SS, to get the directed graph in Fig. S4C (ESI†). To draw safe conclusions but without losing too much useful information, we extracted those edges that appeared between metabolites and sensory traits at least twice over Fig. S4A–C (ESI†). Finally, 12 edges satisfying this criterion were reported (see black edges in Fig. 4).
We have identified a total of 21 QTLs for the 29 metabolites and 24 sensory traits. Most of the QTLs were found to have pleiotropic effects; in particular, a few of them, such as rs2050, rs6687, rs7089 and rs7448, served as hubs in the resulting dependency network (Fig. 4). A particularly noteworthy phenomenon was that a number of directed triangles appeared in Fig. 4, especially around the hubs. One may doubt whether the QTL really affects so many traits? Does its impact on a downstream trait actually pass through the upstream traits? Moreover, will two directly associated traits become independent of each other given their common QTL? A possible solution to these detailed questions is the triad analysis, which aims at identifying causal relationships in configurations consisting of two traits and one QTL.44,45
Though both SS and StARS can choose a proper regularization for high-dimensional sparse linear regression, they are essentially different. Given the same training dataset, StARS tolerates false positives (false edges in the reconstructed graph) but not false negatives (true edges absent in the reconstructed graph) and thus leads to a dense graph with high recall but relatively low precision (in the context of graphical modeling, recall refers to the fraction of true edges that are recovered in the resulting graph; precision refers to the fraction of recovered edges that are actually true); SS, contrariwise, allows false negatives but not false positives and therefore results in a sparse graph with high precision but comparatively low recall.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c5mb00477b |
This journal is © The Royal Society of Chemistry 2015 |