Simeone
Zomer
a,
Sarah J.
Dixon
a,
Yun
Xu
a,
Susanne P.
Jensen
b,
Huitu
Wang
c,
Clare V.
Lanyon
d,
Anthony G.
O'Donnell
d,
Anthony S.
Clare
c,
L. Morris
Gosling
b,
Dustin J.
Penn
e and
Richard G.
Brereton
*a
aCentre for Chemometrics, School of Chemistry, University of Bristol, Cantocks Close, Bristol, UK BS8 1TS
bEvolutionary Biology Group, University of Newcastle upon Tyne, Newcastle upon Tyne, UK NE2 4HH
cSchool of Marine Science and Technology, Ridley Building, Newcastle University, Newcastle upon Tyne, UK NE1 7RU
dSchool of Biology and Psychology, Division of Biology, IRES, University of Newcastle upon Tyne, Newcastle upon Tyne, UK NE1 7RU
eKonrad Lorenz Institute for Ethology, Austrian Academy of Sciences, Savoyenstr. 1a, A-1160 Vienna, Austria
First published on 19th November 2008
House mice (Mus domesticus) communicate using scent-marks, and the chemical and microbial composition of these ‘extended phenotypes’ are both influenced by genetics. This study examined how the genes of the major histocompatibility complex (MHC) and background genes influence the volatile compounds (analysed with Gas Chromatography Mass Spectrometry or GC/MS) and microbial communities (analysed using Denaturating Gradient Gel Electrophoresis or DGGE) in scent-marks produced by congenic strains of mice. The use of Consensus Principal Components Analysis is described and shows relationships between the two types of fingerprints (GC/MS and DGGE profiles). Classification methods including Support Vector Machines and Discriminant Partial Least Squares suggest that mice can be classified according to both background strain and MHC-haplotype. As expected, the differences among the mice were much greater between strains that vary at both MHC and background loci than the congenics, which differ only at the MHC. These results indicate that the volatiles in scent-marks provide information about genetic similarity of the mice, and support the idea that the production of these genetically determined volatiles is influenced by commensal microflora. This paper describes the application of consensus methods to relate two blocks of analytical data.
(1) Can we distinguish (and predict) the genetic strain of a mouse, including strains that differ genetically only at the MHC locus (‘MHC congenics’), using the volatile compounds in scent-marks measured by Gas Chromatography Mass Spectrometry or GC/MS?
(2) Can we also distinguish mouse strains using the microbial communities in scent-marks, using molecular genetic profiles (Denaturating Gradient Gel Electrophoresis or DGGE)?
(3) How does the MHC compare to the rest of the genome (background genes) in its influence on urinary volatiles (GC/MS) and microflora (DGGE)?
(4) Is there a relationship between the pattern of volatiles (GC/MS) and pattern of microflora (DGGE)?
Mouse scent-marks are likely to be multivariate in nature, in that no one single compound is uniquely responsible for the signal – rather like recognising a human face – it is a combination of features that allow for unique recognition. Chemometrics approaches work on multivariate models, that is models taking into account the correlation between the variables, and as such have an important role to play in biology where there are likely to be interactions between variables.
In this study, samples of scent-marks were obtained by inserting glass slides into mouse cages containing a male mouse, which are usually scent-marked by individual mice. This is a form of olfactory communication that provides information about signaller quality to competitors and mates.16 From these samples of scent-marks, DNA was extracted and the microbial community was genetically fingerprinted using DGGE (16S rRNA gene) to obtain a profile of the commensal microflora,14 and the chemical profile was analysed using GC/MS after sampling with solid phase microextraction (SPME).
The data analysis procedure began with a feature extraction process where raw microbiological and chemical data were summarized in two distinct data matrices.
Each matrix is composed of sample vectors whose elements identify a set of variables common to all samples in the population, which correspond to the relative intensity of gel bands (DGGE) and the areas of common chromatographic peaks (GC/MS). Following feature extraction, a series of processing steps including scaling, normalization and standardisation are applied in order to maximize the interpretability of the information embedded in the matrices. The last part of the procedure involves a variety of multivariate pattern recognition techniques to extract the information of interest from the processed data with which to answer the relevant questions. The main trends emerging from the two blocks are investigated using Consensus Principal Components Analysis (C-PCA)17–19 that identifies the common directions of variation and can be posed in relation to factors of interest such as background genetics and MHC-haplotypes of mice. The exploratory analysis is then completed with an estimation of the degree of overlap between the microbiological and chemical profile to verify the microflora hypothesis. Afterwards, other techniques such as Partial Least Squares Discriminant Analysis (PLS-DA)20–24 and Support Vector Machines (SVM)25,26 were applied to determine whether the fingerprint recorded can be used for prediction purposes. Results indicate that the mouse strains emerge clearly in both profiles recorded and that this information can be used to produce predictive models. The strains of the mice contribute to shape both the GC/MS and DGGE profiles, with the whole strain differences having a stronger effect than MHC. Finally the profiles show a certain degree of overlap.
Chemical and microbiological data are both acquired for 34 mice (dataset α) divided in the four groups indicated above (9 S1H1, 9 S1H2, 8 S2H1 and 8 S2H2). In addition, further GC/MS analysis is carried out on another 58 mice (dataset β, 13 S1H1, 11 S1H2, 15 S2H1 and 17 S2H2). These two datasets are used for two different purposes. Dataset α allows an initial exploration of the data, where main trends can be searched for after merging the microbiological and chemical information and can be subsequently posed in relation to the genetics of mice, i.e. their strain as well as MHC-haplotypes. A consensus analysis enables to verify which measure in each block contributes to the patterns in the data and most importantly whether and to what extent, similar trends are reproduced in the microbiological and chemical profile. This may provide clues to verify the possible shaping of the commensal microflora by the MHC-haplotypes.
Dataset β, which consists of more instances, can be used to build up a statistical model to verify whether the chemical profile can be used to predict the genetics of mice. Dataset β will train a classifier that will be tested on the instances of dataset α.
In DGGE, the variables correspond to the intensities of the bands identified along the gel. Feature extraction is straightforward as it requires a digitization of the image of the DGGE gel (Bionumerics gel analysis software, Applied Maths, Sint-Martens-Latem, Belgium) where lanes are aligned relative to internal markers and the intensity of the bands is measured and quantified.30,31
For GC/MS, variables correspond to the areas of the peaks having a common chemical origin in different chromatograms of all samples32,33 to form peak tables. Although faster alternatives are available for data screening (e.g. using Total Ion Currents or Single Ion Chromatograms), this method is preferred because it directly relates to the chemical components in the mixture, hence making feasible the identification of potential marker compounds in a subsequent phase of the analysis. However, the procedure is laborious, particularly because chromatograms may contain several hundreds of compounds per sample, the signal-to-noise ratio may be low, and peaks may partially overlap, so peaks must be compared on the basis of their spectral and retention characteristics. The procedure involves four main stages and the basis of the first three steps is described in detail elsewhere34 and takes advantage of the MS information to align peaks in different chromatograms and form a peak table. The main steps are summarized below.
(1) Smoothing of single ion chromatograms. Noise is smoothed with a DB5 (Daubechies wavelet) wavelet filter and three levels of decomposition, chosen based on the data under investigation: because the data contain a significant amount of high frequency noise which is caused by the column (high bleeding) such a wavelet appeared to be the most effective one with very little distortion.
(2) Determining the position and intensity of peaks. Next the first derivative is calculated using a Savitzky–Golay five-point quadratic filter.35 The profile is then processed using a switch-position loop. A peak start is taken as whenever the first derivative becomes positive. A peak maximum is taken when two sequential first derivative points are negative and either the peak is at least seven datapoints wide or the peak is at least four times higher than the lowest non-zero value in the chromatogram. The second criterion is necessary to detect partially overlapped peaks found on the shoulder of other peaks that are not sufficiently large. A peak end is taken as when two sequential first derivative points are positive. This step is performed on all the mass ions and peaks with same retention time for different ions are then merged together. The combination of their intensity is used to estimate their area while the combination of ions is assigned as the spectrum of the compound. The procedure is then applied on all samples and the dataset matrix is deduced by means of a pair-wise comparison of peaks across samples on the basis of their spectral profile. The intensity of peaks is determined by our previously published method.34
(3) Matching peaks from a common origin. The procedure starts from the first peak of the first sample, looking for peaks within a defined retention time window across the second sample. If a peak with a similar MS is found, the areas are allocated in the same column of the matrix (same variable), otherwise the unmatched peak is regarded as originating from a new compound and is placed in a new column and assigned as a new variable. The procedure is repeated for all peaks detected in all samples. Various further failsafe mechanisms are incorporated into our method.34
(4) Removing rare peaks. Because a number of unique peaks are detected in only a small portion of samples and have low diagnostic power for a class, only those detected in at least four samples are retained for further analysis.
A series of additional processing steps must be applied to make the application of the following pattern recognition techniques for exploration and classification effective. It is desirable to scale the data using a logarithmic transformation in order to reduce the influence of large peaks, and to reflect the fact that intensities are often log-normally distributed. However, an additional problem then relates to the sparseness of XM and XC, as both matrices contain a relatively high proportion of zeros because some bands and peaks are found only in a portion of the samples: we find in many areas of metabolomic profiling that marker peaks are present only in a portion of samples. In order to compute logarithms of the peaks that are either absent or below the detection we replace the zeros with half of the lowest intensity peak detected in the dataset. Next, each sample is normalized along rows so that the sum of variables equals one: this allows maximizing the comparison between samples, removing unwanted effects such as the quantity of material sampled and analysed. Finally data are standardised along columns to attribute to each variable an equal weight.
In the last few years, modifications of the PCA have been introduced to break the descriptors into several blocks to improve the interpretation of the models, one of which is consensus PCA (C-PCA).18,19,38 In our case it is initially of interest to verify what is the ‘consensual’ view obtained from merging the information available from XM and XC and to determine what is the contribution of each of the two profiles. In C-PCA, the data matrix X is broken down in different blocks (XC and XM in our case) and a consensus direction among all the blocks is sought. The method starts by selecting a super-score t that is multiplied on all blocks to give the block loadings. The block loadings are normalized to length one and then multiplied through the respective blocks to give the block scores. The block scores are combined in the super-block T. Then a PCA iteration on T is performed to give the super-weight wT that is normalized to length one and multiplied onto the super-block T to obtain an improved estimate of the super-score t. This is repeated until convergence of t. The super-score T summarizes the main trends between all blocks and block scores TM and TC represent the individual contribution of each block that are related to the former by means of the weight vector w (Fig. 1). The algorithm adopted here is the one originally proposed that incorporates a block-scaling to balance the contribution of XC and XM relative to the number of variables. In addition, to stabilise the algorithm, the super-scores are initialised as suggested18 with the first score vector calculated from the super-matrix determined as follows:
![]() | ||
Fig. 1 Principles of C-PCA. The scores obtained from each data-block t1…tN, are joined into a matrix from which a super-score is extracted to describe the overall trend in the data. |
The first technique used is PLS-DA. SVM is used as an alternative in its basic form (LIN-SVM) and with a radial basis kernel (RBF). Further details about SVMs are discussed elsewhere.25,26,40
The optimal parameters of the model (number of components for PLS-DA, penalty error C for LIN-SVM, C and radial width σ for RBF-SVM) are determined by a leave-one-out procedure: a model with a certain parameter setting (i.e. a given number of PLS components for PLS-DA or kernel parameters, i.e.C and σ, for SVM) is built using 57 of the 58 samples of dataset β and its prediction ability tested on the sample left out, with the procedure repeated for each sample. For PLS-DA, the number of PLS components varied from one to ten and for SVM, C and σ were varied according to a grid search procedure. The parameters of the best model, i.e. the one with the lowest error rate, is then used to build the predictive model on the full dataset β and this model was tested on the 34 samples that compose dataset α.
![]() | ||
Fig. 2 DGGE (XM) matrices obtained by feature extraction. |
![]() | ||
Fig. 3 C-PCA. The scatter plot of super-scores tsup (middle) are derived from the scores tm of block XM (on the left) and the scores tc of the block XC (on the right). Symbols for the four groups SnHn where n = 1 or 2 are defined in the top graph. |
![]() | ||
Fig. 4 Size of the super-scores tsup (top) and weights of the block scores (bottom) tm (grey) and tc (black). |
Whereas supervised feature selection could be performed to obtain better apparent graphical separation between the groups, such supervised methods can lead to overfitting of the data and are only legitimate if one is sure in advance that there will be a separation. It can be shown41 that using a completely random dataset, by choosing the variables that appear most diagnostic, apparently good separation can be obtained. In hypothesis-based biology, it is often dangerous to perform supervised feature selection prior to graphical visualisation of results, although legitimate to simplify the data using criteria unrelated to the hypothesis being studied; for example, we remove rare variables from the dataset that occur in less than four samples.
Summarizing, the results indicate that the genetic strain of the mice influences the composition of the commensal microbiota from scent-marks as well as the volatile compounds, with a stronger trend associating to the strain membership and a weaker but clearly visible trend associating with MHC-haplotype. Nevertheless, the weights (Fig. 4, bottom) for the chemical data suggest that a substantial amount of variation is not explained by these factors. This is further confirmed when calculating the amount of overlap between XM and XC, with the RV coefficient having a value of 0.47. The correlation value substantially increases when the blocks are reduced to their essential variability using the scores of the principal component space. Fig. 5 shows where the RV coefficient exceeds a value of 0.65 for each possible combination of the PCs. Each subset combination of PCs is considered because factors of higher order may relate to trends in the data that are unique for a block (e.g. instrumental drift in GC/MS), hence affecting the correlation negatively. It is possible to observe that the first two components are well correlated and PC2 from XM resembles PC2, PC3 and PC4 from XC. The correlation rises to a maximum of 0.84 for the configurations given by the first two PCs of both XM and XC. The matrices XM and XC show a degree of overlap, but also have a substantial variation that is specific to each block. This is expected, particularly for XC, where the matrix is composed by several hundred variables and may result in a richer and more complex pattern. Systematic variations that effect solely either XM or XC may decrease the amount of correlation. Examples may be the occasional presence of contaminants during the sampling phase or instrumental drift. In addition, it can be noticed that some genetically related factors may contribute to the distribution of only one of the two blocks, for example, volatiles that that do not interact with the metabolism of the microflora, or in turn, MHC-induced chemicals that interact with the microflora but that the sampling technique combined with GC/MS is unable to determine.
![]() | ||
Fig. 5 Map of the RV coefficients. |
The identification of potential chemical markers in XC relating to genetics can be readily carried out looking at the block loadings (Fig. 6). In PCA, the loadings represent the weights of the original variables in determining the block scores as illustrated in Fig. 3. Relevant variables for a given group observed in the plot of the scores locate in an analogous axis position in the loadings plot. It was observed that the block scores of the first two components allow a separation between S1 and S2 along the top left–bottom right diagonal and a (milder) separation between H1 and H2 along the bottom left–top right diagonal. Some potential markers, located at the extremes of such directions in the loadings plots are taken and their quantitative distribution examined in Fig. 7. Peak nos. 458 and 74 display a distribution consistent with S1 and S2 respectively, while peak nos. 534 and 528 are related to H1 and H2. The mass spectra of these four compounds are presented in Fig. 8.
![]() | ||
Fig. 6 Block loadings of XC. |
![]() | ||
Fig. 7 Potential markers for MHC and strains in the GC/MS data. |
![]() | ||
Fig. 8 Mass spectra of four potential marker peaks. |
Correct classification (%) | ID misclassified samples | |||||
---|---|---|---|---|---|---|
Training set | Test set | |||||
Strains | Haplotypes | Strains | Haplotypes | Strains | Haplotypes | |
LIN-SVM | 100 | 100 | 97.06 | 88.24 | No. 23 | Nos. 2, 3, 12, 34 |
RBF-SVM | 100 | 100 | 100.00 | 85.29 | N/a | Nos. 2, 3, 4, 12, 34 |
PLS-DA | 92.86 | 100 | 97.06 | 85.29 | No. 34 | Nos. 2, 14, 18, 28, 34 |
All three classifiers perform very similarly, with no substantial differences. Auto-prediction on the samples of dataset β, which were used to build the model, returns a perfect classification. Most notably, testing on a separate dataset α indicates that strain membership (genetic background) can be better predicted than haplotype membership (MHC-haplotype). However, a rate of 85–88% for classifying by MHC-haplotype further indicates that this feature is consistently present in the data and can be implemented to generate a predictive model that performs satisfactorily.
The raw data analysed with DGGE and GC/MS are organised by means of a feature extraction process in two distinct matrices XM (DGGE) and XC (GC/MS) that can be multivariately modelled. Consensus PCA shows that the genetic background (S1 and S2, Fig. 3) has a strong influence in shaping both XM and XC but a milder trend is clearly observable also in reference to the MHC-haplotype. Matrix correlation analysis indicates that XM and XC are partially correlated, but also bring consistent complementary information. These findings are associations, and further experimental manipulations are necessary to determine whether genetics influences individual microflora in scent-marks, and whether these influence profiles of volatile compounds.
Classification by means of PLS-DA and SVM demonstrates that the chemical signatures in scent-marks provide a surprisingly effective fingerprint for predicting mouse strain, as we found that the strains are perfectly classified, and MHC-haplotypes were also predicted with high accuracy.
This journal is © The Royal Society of Chemistry 2009 |