Mohamed N.
Triba
*a,
Laurence
Le Moyec
b,
Roland
Amathieu
c,
Corentine
Goossens
a,
Nadia
Bouchemal
a,
Pierre
Nahon
d,
Douglas N.
Rutledge
e and
Philippe
Savarin
a
aUniversité Paris 13, Sorbonne Paris Cité, Laboratoire Chimie, Structures, Propriétés de Biomatériaux et d'Agents Thérapeutiques (CSPBAT), Unité Mixte de Recherche (UMR) 7244, Centre National de Recherche Scientifique (CNRS), Equipe Spectroscopie des Biomolécules et des Milieux Biologiques (SBMB), 74 rue Marcel Cachin, 93037, Bobigny, France. E-mail: mohamed.triba@univ-paris13.fr
bUniversité d'Evry Val d'Essonne, Unité de Biologie Intégrative des Adaptations à l'Exercice (UBIAE), U902, INSERM, Bd François Mitterrand, 91025 Evry Cedex, France
cService d'Anesthésie et des Réanimations Chirurgicales, Université Paris 12, Hôpital Henri Mondor, Assistance Publique des Hôpitaux de Paris (AP-HP), Créteil, France
dService d'Hépatologie et Université Paris 13, Hôpital Jean Verdier, Assistance Publique des Hôpitaux de Paris (AP-HP), 93143 Bondy Cedex, France
eLaboratoire de Chimie Analytique, AgroParisTech, 16 rue Claude Bernard, 75231 Paris, France
First published on 23rd October 2014
Among all the software packages available for discriminant analyses based on projection to latent structures (PLS-DA) or orthogonal projection to latent structures (OPLS-DA), SIMCA (Umetrics, Umeå Sweden) is the more widely used in the metabolomics field. SIMCA proposes many parameters or tests to assess the quality of the computed model (the number of significant components, R2, Q2, pCV-ANOVA, and the permutation test). Significance thresholds for these parameters are strongly application-dependent. Concerning the Q2 parameter, a significance threshold of 0.5 is generally admitted. However, during the last few years, many PLS-DA/OPLS-DA models built using SIMCA have been published with Q2 values lower than 0.5. The purpose of this opinion note is to point out that, in some circumstances frequently encountered in metabolomics, the values of these parameters strongly depend on the individuals that constitute the validation subsets. As a result of the way in which the software selects members of the calibration and validation subsets, a simple permutation of dataset rows can, in several cases, lead to contradictory conclusions about the significance of the models when a K-fold cross-validation is used. We believe that, when Q2 values lower than 0.5 are obtained, SIMCA users should at least verify that the quality parameters are stable towards permutation of the rows in their dataset.
Bibliographic database | No. of articles containing metabolomics and PLS | Using Metabo analyst | Using SIMCA | Using R | Using Statistica | Using Unscrambler |
---|---|---|---|---|---|---|
Science direct | 1117 | 51 | 464 (42%) | 49 | 37 | 37 |
Royal Society of Chemistry | 135 | 9 | 54 (40%) | 5 | 3 | 2 |
PLOS One | 245 | 13 | 108 (44%) | 28 | 16 | 3 |
Springer link | 654 | 32 | 274 (42%) | 45 | 22 | 18 |
ACS | 473 | 32 | 278 (59%) | 30 | 14 | 22 |
PLS/OPLS models try to find a linear relationship between a X predictor matrix (e.g. spectrometric data of biological samples) and a Y response matrix (e.g. clinical results, treatment…). In metabolomics, the X predictor matrix frequently has more columns (predictor variables) than rows (individuals). Because of this property of metabolomics data, PLS/OPLS models can easily be overfitted and their predictability overestimated.
The only way to reliably estimate the ability of the model to predict Y values of new individuals is to predict individuals from an independent dataset (i.e. that were not used to build this model). This can be achieved by splitting the dataset into a training set and a test set. The training set is used to build the model and the test set is used to estimate the predictability. However, the cost of this splitting is that the model is built with only a fraction of the information that is present in the whole dataset. This may reduce the ability of this model to correctly predict a new dataset. Thus, splitting the dataset into a training set and a test set can be done only if enough individuals are available to build a reliable model. As in univariate statistics, the significance of the results of multivariate models depends on the sample size. However, the minimum number of individuals needed to attain a given significance threshold for the PLS models is very application-dependent and no easily applicable rules have been proposed to estimate this number.7
When no test set is available, the cross-validation method is the main strategy proposed by commercial or academic statistical packages to assess the quality of a model. Different cross-validation procedures exist. The default SIMCA cross-validation is the so-called K-fold cross-validation. Results of the cross-validation procedure are summarized by the value of different quality parameters. The most frequently mentioned in the metabolomics literature are R2 and Q2 parameters (also called cross-validated R2). R2 measures the goodness of fit while Q2 measures the predictive ability of the model. R2 = 1 indicates perfect description of the data by the model, whereas Q2 = 1 indicates perfect predictability. R2 increases monotonically with the number of components (NC) and will automatically approach 1 if NC approaches the rank of the X matrix. Q2 will not necessarily approach 1. At a certain value of NC, Q2 reaches a plateau and usually will finally decrease with addition of more components. This indicates that at a certain degree of complexity the predictive ability of the model decreases.8 At this stage, it is very likely that the model is trying to fit dataset characteristics that are no longer representative of the studied population. A large discrepancy between R2 and Q2 indicates an overfitting of the model through the use of too many components. According to the SIMCA users' guide, Q2 > 0.5 is admitted for good predictability (SIMCA P12 users' guide, p. 514).9 It has been shown that in practice it is difficult to give a general limit that corresponds to a good predictability since this strongly depends on the properties of the dataset.8,10 For example, an acceptable Q2 threshold will strongly depend on the number of observations included. During the last few years, a large number of SIMCA PLS-DA/OPLS-DA models have been published with Q2 below 0.4 or even below 0.3 (for example, see ref. 11 and 12). These models with poor predictability are frequently validated by a permutation test that consists in comparing the Q2 obtained for the original dataset with the distribution of Q2 values calculated when original Y values are randomly assigned to the individuals.10 The cross-validation procedure also provides the possibility to calculate a p-value to estimate the significance of PLS/OPLS models (pCV-ANOVA).13
As recently published in this journal,14 metabolomic results based on PLS/OPLS models should always give the values of the quality parameters of the multivariate models. The number of components used in the final model, Q2 and pCV-ANOVA values should be presented to allow the reader to assess the quality of the model calculated by SIMCA. However, in this Opinion piece, we want to point out that in some cases, because of the way in which the default SIMCA cross-validation procedure selects members of the calibration and validation subsets, permutation of the rows of a dataset can result in variations in the values of the quality parameters. As a consequence, in these circumstances, different conclusions on the quality of the PLS/OPLS models may be drawn from the same dataset. In a first part, we will show that under some conditions a random permutation of rows in the dataset strongly affects the quality parameter values obtained when default SIMCA cross-validation settings are used. In a second part, we will discuss three different types of situations frequently encountered in metabolomics studies where the K-fold cross-validation procedure fails to calculate a Q2 that is not strongly dependent on the arbitrary order of the rows in a dataset.
Cross-validation allows to estimate the ability of a model to correctly predict the Y response matrix of new individuals. In the SIMCA software, cross-validation is also used to avoid overfitting by estimating the number of significant components (NSCs) to use in the model. Many cross-validation procedures are used in the metabolomic community (K-fold, Leave One Out, Monte-Carlo, 2CV, etc.). The default SIMCA cross-validation procedure is a 7-fold cross-validation8 where the dataset is split into 7 different subsets. For a fixed number of components (NC), the Y values of all individuals of each subset are predicted using a submodel built with the 6 other subsets (calibration subset). The differences between the predicted Y values and the observed Y values are used to calculate the QNC2 parameter for this number of components. The procedure starts at NC = 1 and is repeated by incrementing NC as long as the increase of QNC2 is larger than a limit value fixed by various rules.9
Each subset is constituted by selecting one row every seven rows in the dataset. The first subset is built with the individuals corresponding to rows 7, 14, 21 and so on. The second subset is constituted with the individuals corresponding to rows 1, 8, 15,…. The other subsets are built in the same way (Scheme 1a).
![]() | ||
Scheme 1 Selection of the individuals used to build the cross-validation subsets in SIMCA for the original dataset (a) and when the rows of this dataset are randomly permuted (b). |
Considering the way the subsets are built, it is clear that a permutation in row order of the X and Y dataset changes the individual positions and modifies the composition of these subsets (Scheme 1b). Thus, submodels and predicted Y values calculated during the cross-validation procedure are also affected by a permutation of rows.
The major consequences of this are:
– Row permutations can potentially change the number of components considered as significant (NSC) by SIMCA.
– For the same number of significant components, row permutations will change the value of the QNSC2 parameter.
– The CV-ANOVA p-value, which depends on the cross-validation procedures, is also affected by row permutations in the dataset.
– The conclusion of the permutation test can be different when the order of rows is changed.
Thus, a better estimation of the number of components could help to reduce the variability of the quality parameters. According to Wheelock and Wheelock,14 “the default automatic fitting in SIMCA extracts the maximal number of significant components, which in most cases results in an overfitted model”. These authors suggested that the optimal number of components (ONC) can be estimated by using the pCV-ANOVA parameter: when this ONC is reached then the addition of another component would increase pCV-ANOVA. As shown in Fig. 1c, values of pCV-ANOVA can strongly depend on the arbitrary order of the rows in the dataset. This dependence is also observed for a given number of components (colored lines). As a consequence, an ONC based on pCV-ANOVA may also strongly vary when the dataset lines are permuted. In order to estimate this variability, we performed 1000 permutations of the rows and, for each permutation, the value of ONC was determined by looking for the first local minimum of pCV-ANOVA when NC is incremented. We found a large variability of the ONC with rows rearrangement (Fig. S1, Supplementary Data A, ESI†). Thus, an ONC determined by using pCV-ANOVA can also strongly depend on the arbitrary order of the lines in the dataset if the K-fold cross-validation procedure is used. More generally, the number of components estimated by using parameters that depend on row order (such as Q2, pCV-ANOVA,…) can potentially exhibit a large variability with row permutations.
The datasets with the row arrangement corresponding to the lowest and the highest calculated values of QNSC2 (i.e. −0.09 and 0.42) were compared by calculating the quality parameters of the OPLS models. These permuted datasets are available as ESI† (Dataset2.xlsx and Dataset3.xlsx). We observed that, for the same experimental result, quality parameters of the two models (Table 2) lead to contradictory conclusions on the significance of the metabolic differences between the two classes. Contradictory conclusions are also obtained when permutation tests (random permutation of group affiliation) were performed on these two models (Fig. 1e and f). This particular dataset proves that, in some situations, quality parameter values calculated with the default SIMCA cross-validation procedure are strongly determined by chance. This result also suggests that performing row permutations allows an estimation of confidence intervals for the various quality parameters.
Permuted dataset 1 | Permuted dataset 2 | |
---|---|---|
NSC | 1 | 10 |
R 2 | 0.18 | 0.75 |
Q 2 | −0.09 | 0.42 |
p CV-ANOVA | 1 | 0.00004 |
CV-AUROC | 0.57 | 0.91 |
To illustrate this point, we modified the experimental results of a second metabolomics study where we evaluated the influence of HCC on the metabolism of cirrhotic patients.18 OPLS-DA was used to discriminate 33 patients without HCC from 33 patients with large HCC. Spectra were normalized with the probabilistic quotient normalization method. They were divided into 230 domains of 0.05 ppm and the water signal region was suppressed. The resulting X and Y matrices are available as ESI† (Dataset4.xlsx). The properties of this dataset correspond to the first situation mentioned by Eriksson et al.8 (i.e. interclass variability is large enough relative to the intraclass variability) and no strong effect of row permutations on the Q2 parameter was observed for this dataset (Fig. 2a). We modified this dataset until we reached the second condition mentioned by Eriksson and coworkers8 (i.e. non-homogeneous classes and large intraclass variability). The modifications introduced in the original dataset were chosen to simulate three types of circumstances frequently observed in metabolomic studies.
– The first situation is when the main source of variability in the dataset is uncorrelated with the Y response variable. This can be observed for example when incorrect sample normalization has been applied to correct for dilution effects. To simulate this situation we multiplied each line of the original dataset by a dilution factor randomly chosen between 1 and 50 (ESI,† Dataset5.xlsx). We randomly permuted the rows of the resulting dataset and calculated the NSC and Q2 values for each permutation (Fig. 2b). We observed a larger distribution of Q2 compared to Fig. 2a. For some permutations, no significant component was obtained.
– A second situation corresponds to the inaccurate labeling of group membership of individuals. The situation is known as class noise.19 It is frequently encountered in metabolomic studies applied to clinical problems especially when a reliable diagnostic tool is unavailable. To simulate this situation, 10% of the individuals of each group were incorrectly labeled in the original dataset (ESI,† Dataset6.xlsx). NSC and Q2 distributions after random permutations of rows were calculated (Fig. 2c). Here again, we observed a larger distribution of Q2 compared to Fig. 2a.
– Finally, a third situation is when the number of individuals used to build the model is too small. In this case, the probability to build by chance a sample with at least one non-homogeneous or non-representative class is not negligible even if classes are homogeneous in the population. We selected 8 individuals of each class from the original dataset and we randomly permuted the rows of the resulting 16 rows dataset (ESI,† Dataset7.xlsx). For each permutation we calculated the NSC and Q2 parameters (Fig. 2d). In this case, Q2 values spread from 0.23 to 0.92.
These results showed that when the situations mentioned above are encountered, quality parameters could be strongly affected by row permutations in the dataset if the K-fold cross-validation procedure is used. Moreover, many combinations of these three situations can be encountered in metabolomics studies.
The SIMCA software allows users to modify the cross-validation procedures by changing the number of cross-validation sets and/or selecting the individuals of each set. We believe that this possibility can also help users to estimate a confidence interval of the calculated quality parameters. The Leave One Out (LOO) procedure can also be tested on SIMCA by setting the number of subsets to the number of samples. This method does not depend on the order of the rows in the dataset, however, as pointed out by several authors,20,21 the LOO procedure can lead to over-fitting and over estimation of Q2.
Other cross-validation methods that (to our knowledge) are not yet implemented in SIMCA should be tested. We particularly recommend the double cross-validation (2CV) method.22,23 As the K-fold method, the 2CV uses all the available individuals to build the models and to estimate their predictability. However, in 2CV, the estimation of NSC and Q2 are decoupled. This is a very important issue since, as illustrated by the Fig. 1b, an overestimation/underestimation of NSC frequently leads to overestimation/underestimation of Q2. Thus, even if the double validation loop process is time consuming compared to the simple validation loop performed in the K-fold procedure, the risk of overestimation or underestimation of the predictability is reduced with the 2CV procedure.
Another interesting method is the Monte Carlo Cross-validation (MCCV) procedure.24 By randomly building many subsets with many combinations of individuals, this procedure averages the opposite effects of too optimistic and too pessimistic cross-validation submodels.
Finally, we want to remind readers that a truly reliable estimation of the predictability of a model is obtained with individuals that are independent of those used to build this model.25
PLS | Projection to latent structures |
OPLS | Orthogonal projection to latent structures |
DA | Discriminant analysis |
PCA | Principal components analysis |
NC | Number of components |
NSC | Number of significant components |
ONC | Optimal number of components |
LOO | Leave one out |
MCCV | Monte Carlo cross-validation |
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c4mb00414k |
This journal is © The Royal Society of Chemistry 2015 |