Anita
Rácz
*a,
Marietta
Fodor
b and
Károly
Héberger
a
aPlasma Chemistry Research Group, Research Centre for Natural Sciences, Hungarian Academy of Sciences, Magyar tudósok körútja 2, H-1117 Budapest XI., Hungary. E-mail: racz.anita@ttk.mta.hu
bSzent István University, Faculty of Food Science, Department of Applied Chemistry, Villányi út 29-43, H-1118 Budapest XI., Hungary
First published on 24th May 2018
Fat and dry material contents (connected to moisture) are one of the most important parameters in the quality control of butter, margarine and margarine spreads (dairy spreads). More than a hundred margarine samples were used to model their fat and dry material content based on Fourier transform-near infrared (FT-NIR) spectroscopy in transmission and reflectance modes for the quality control of margarine. We also carried out a systematic comparison of various modeling techniques such as PLS regression, principal component regression (PCR) and support vector machines (SVM). Moreover, three types of cross-validation, three types of variable selection and the effect of different spectral types (transmission and reflectance) were also compared with factorial ANOVA tests. We examined the effect of the applied datasets (calibration, test samples, and both sets) based on the original predicted values. Sum of ranking differences (SRD), a novel comparison tool, was applied for the task. We showed that the SRD values can be used as a promising and useful performance parameter for the ranking and evaluation of numerous regression models. Four datasets with 42–42 transmission and 34–34 reflectance models were used for the evaluations. Finally, we have found the best models in each case based on their SRD values. The properly validated SVM models proved to be the best for all of the four used datasets. Although the method comparison is data set dependent, the suggested methodology is applicable generally and unambiguously. These final models can be used for fast and easy quality control of margarine samples instead of the time-consuming original analytical techniques.
Fat and dry material contents (connected to moisture) are one of the most important parameters of quality control of butter, margarine and margarine spreads (dairy spreads). The production process requires continuous control. The original analytical techniques for the determination of fat and dry material content are very time-consuming. The invention of these methods dates back to the nineties and the previous decade. Standard methods are unfortunately still based on these measurements. On the other hand, environmentally safer methods, which can decrease the amount of energy and the used solutions, are widely used nowadays instead of the original techniques. Fourier transform-near infrared (FT-NIR) measurements are one of these commonly used environmentally friendly and time-efficient substituents. FT-NIR is a non-destructive analysis for liquid, solid and colloidal (such as margarine) samples, and it can be applied as an on-line tool in the process control. In the past few decades several articles were published on this topic with the use of different spectroscopy related analytical methods for this area of food products. A short summary of these publications can be found in Table 1. It is interesting that the betterment of standard classical methods is based on exclusively spectroscopic methods. The majority of the related articles deals with classification and quantitative analysis (quality control) of these products.
Author | Product | Determination | Methoda |
---|---|---|---|
a Abbreviations: ATR = attenuated total reflectance; HATR = horizontal-ATR; MIR = mid-infrared spectroscopy. | |||
Evers et al.37 | Butter | Moisture, solid-not-fat (SNF) | Classical analysis (standard methods) |
van de Voort et al.38 | Mayonnaise, peanut butter | Fat, moisture | FT-IR |
van de Voort et al.39 | Butter | Fat, moisture | FT-IR (ATR) |
Safar et al.40 | Margarine, butter, edible oil | Classification of products | FT-IR |
Wilson41 | Margarine | trans-Fatty acid | HATR FT-IR |
Hernández-Martínez et al.42 | Margarine | trans-Fatty acid | HATR FT-IR |
Da Costa Filho43 | Edible oil | trans-Fatty acid | HATR FT-IR |
Rohman and Man44 | Edible oil and fat | Counterfeit of products | FT-IR (MIR) |
Vlachos et al.45 | Edible oil and fat | Counterfeit of products | FT-IR (MIR) |
Hermida et al.46 | Butter | Fat, moisture, solid-not-fat (SNF) | FT-NIR |
Yang, Irudayaraj and Paradkar47 | Edible oil | Classification of products | FT-IR, FT-NIR, Raman |
In our study we wanted not only to develop predictive models for the fat and dry material content of margarine spreads, but also to compare and make a final decision about the models based on sum of ranking differences (SRD) and ANOVA. Our aim was also to examine the difference or similarity between (i) different cross-validation techniques, (ii) different regression methods, (iii) different variable selection techniques and (iv) different NIR spectral modes (transmission and reflectance). It was important and interesting to see and evaluate how the other parameters depend on different spectral modes and how the different spectral modes affect the final models. We also wanted to search for better options instead of using only one of the opportunities of internal and external validation techniques. In this way we will provide a new perspective for the scientific community dealing with multivariate regression models.
The above mentioned parameters are essential for regression model building, and thus one can use our conclusions and findings to save more time, money and energy in other NIR spectroscopy related studies.
The regression models were optimized in the same way in each case. Derivation and mean centering were applied for the X variables and mean centering alone for the Y variables. A few examples of original and pre-processed spectra can be seen in Fig. 2.
In the case of transmission spectral datasets, 42–42 regression models (42 for the fat content and 42 for the dry material content, as well) were built and for reflectance spectral datasets 34–34 models were built. Different cross-validation techniques were used only in the case of transmission datasets; as a result the number of models (and combinations of parameters) is higher than in the case of reflectance spectra. The summary of the parameter combinations is provided in Table 2.
Regression method | Validation | Variable selection |
---|---|---|
PLS | Random 5-fold CV (5-CV RANDOM) | iPLS/iPCR25 |
PCR | Systematic 5-fold CV (5-CV SYST) | iPLS/iPCR50 |
SVM | Leave-one-out CV (LOO) | Genetic algorithm (GA) |
No selection (ALL) |
One example of the evaluated regression models can be seen in ESI, Fig. S1.†
In factorial-ANOVA at first we examined the effect of factors: three levels of cross-validation {leave-one-out (LOO) and fivefold cross-validation with systematic and random selection (5-CV random and 5-CV syst, respectively)} together with two levels of calibration methods (PCR and PLS regression). Then, we also compared the effect of cross-validation with four levels of variable selection techniques (VS): without VS (ALL), genetic algorithms (GA), and interval selection with splits 25 and 50 (I25 and I50, respectively). In both studies the effect of cross-validation methods remained insignificant (at the 5% level). Fig. 3 shows the results of factorial-ANOVA in the case of calibration methods (a and b) and variable selection techniques (c and d) together with cross-validations. It can clearly be seen in Fig. 3(a) and (b) that the PCR method is less certain because of the larger confidence intervals (95%). The PLS method was much more reliable in this sense for Q2 and RMSECV values as well. We can also see the difference between the confidence intervals (95%) of the models without any variable selection protocol or with variable selection. It means that the goodness of the models was increased with the variable selection techniques and the confidence intervals were decreased as well.
The smaller Q2 (and the larger RMSECV) necessitates some form of variable selection.
In the case of transmission datasets, the results can be seen in Fig. 4(a) and (b). Box and whisker plots were used for the visualization of SRD values (the smaller the better). Sevenfold cross-validation was also applied in the SRD protocol; thus the SRD values can be plotted in this type of graph. Fig. 4 shows that the best models were the PLS regression one with iPLS50 variable selection and SVM regression with a genetic algorithm. The latter one was among the best ones not just in one but both cases, because there was no significant difference between SVM-GA and the next one, which was a PCR model (significance was tested by a Wilcoxon matched pair test, α = 0.05). PLS and PCR models without any variable selection method were the worst ones. However, if we do not use any variable selection technique in the case of SVM, we can still obtain a reliable regression model. SVM-All models were at the third and fourth places. The R2 and Q2 values of the best models were checked: 0.986 and 0.970 for the PLS-iPLS50 model respectively. In the same way they were 0.990 and 0.989 for the SVM-GA model, respectively. The fact that SRD did not choose wrong models, also verifies the SRD approach, if we compare it with the original or commonly used performance parameters. Finally, we can conclude that we can determine fat and dry material content based on these chosen models successfully. It is not feasible to use all wavelengths; it is better to use fewer intervals.
In the case of reflectance datasets, the protocol of comparison was the same as above. Here, eleven and eight external test validation samples were used in the case of fat and dry material content models, respectively. The final results for fat and dry material content can be seen in Fig. 5. Here SVM-All models without any variable selection method were clearly the best ones amongst the others. These models in (a) and (b) cases were significantly better than the other models. The SVM-All model for fat content determination has an R2 value of 0.991 and a Q2 value of 0.982. In the case of dry material content, the R2 and Q2 values of the SVM-All model were 0.992 and 0.979, respectively.
It can also be seen that the difference between the worst PLS and PCR models was smaller compared to the previous cases, but still, these models were not applicable for a successful calibration. On the other hand, SVM models were validated properly and these models can be applied in quality control procedures as well. SVM does not need/involve variable selection.
The average of the frequently used performance parameters of the models can be found in the ESI (Table S2).†
Factorial-ANOVA was used with other indicators as well. The effect of the regression methods, different variable selections and spectral types were examined in the following procedure. For this analysis both sample sets and the external set alone were used with their predicted values. For both sample sets, the results can be seen in Fig. 7.
The plot has two splits: one for the transmission and one for the reflectance spectral type. It can be clearly seen that the use of variable selection can cause more difference in SRD values, especially in the case of transmission spectra. The effect of the different regression methods is also larger, if we used the transmission spectra, but SVM was clearly better than the other methods in both cases. ANOVA also proved that in the case of SVM alone, variable selection has no significant effect. On the other hand, a largest improvement can be achieved in the case of PCR models with the variable selections if transmission spectra are used. The confidence intervals (95%) are smaller in the case of reflectance spectra, but they were not significant in the other type as well. However, the effects of different regression methods, variable selections and spectrum types were significant.
The aforementioned effects were examined in the same way with the use of the external sample set alone. We wanted to see, what is the difference between the application of the two sets. The results can be seen in Fig. 8.
Here the shapes of the lines (reflectance – “U” shape and transmittance – distorted “U” shape) are somewhat similar, although the confidence intervals (95%) are much larger in all cases compared to the previous one. It is also interesting to see that the tendency is not the same in the case of the genetic algorithm: the SRD values increase a little for the reflectance spectra, but this increase is larger in the case of transmission spectra. However, ANOVA can detect significant differences between the models, variable selections and spectrum types, and it can be clearly seen that the decision about the best ones is not at all obvious. Thus, it also verifies the conclusion that we cannot make a decision based on only the external test samples and their results. This conclusion corresponds to our earlier findings on two different case studies.5
The role of internal and external validation in the validation of models and calculation of predictive performance is still a debated issue in the fields of machine learning, chemometrics and any kind of modeling discipline. In the literature one can find several studies, which prefer the internal validation over the external one.8–10 However, other papers emphasize the importance of external sets.11,12
The debate continues: external validation based on a single split of the data set might not be so good as previously thought: metrics calculated from the test set could lead to random decisions.5 External validation is considered as the gold standard for checking the predictive ability of QSAR models, and others still think cross-validation is better suited for checking the predictive ability of QSAR models in order to avoid the loss of information from splitting of the data set into training and test sets.13
We used internal validation (cross-validation) and external sample sets (new, commercial samples) for the validation of our models. Our opinion based on the results of this work and previous findings is that making a final conclusion based only on the external test set can be misleading. We can assume that these new samples have a fifty–fifty chance of being a part of the same distribution as the calibration model or a part of another distribution. Therefore, we can obtain very good external results and very bad ones with equal probability. If the external set belongs to the same distribution as the earlier samples, external validation cannot add any new information as compared to inner validation. If the external set has been drawn from a different distribution one cannot use the earlier developed model(s) for prediction (without updating). The external test set cannot provide such a robust and reliable result alone as the internal validation. Reversely, if a model has bad quality parameters in the internal validation section, usually it will not be able to predict external samples, either. In some seldom cases an external set may provide somewhat better performance.
In our opinion SVM is a very promising tool for multivariate modeling and we can easily exclude the opportunity of overfitting with a proper validation protocol. On the other hand, SVM needs more regularization parameters than PLS, but this can be handled with proper validation. However, we also found some cases, where SVM is worse than the other techniques. One can also find publications, which are denying the overfitting “feature” of the method (see e.g.Table 2 in ref. 14).
Our conclusion does not contradict the literature suggestions. We would not neglect the usage of an external set, but we have to be careful with it. A decision based solely on an external set is equivalent to delivering our models to a random choice. Our conclusion provides a new perspective to the debating situation in this issue.
The measured concentrations are given in w/w%. Measurement duplicates were used for each sample. If the value was significantly different from the nominal concentration, more additional duplicates were used for the analysis. The relative standard deviation was 1.59 w/w% and 1.30 w/w% for fat and dry material content respectively. The original samples were not always the same in the case of reflectance and transmission spectra, because there was a time shift between the two types of measurements, and some of the original samples were expired and thus we couldn't use them again. This cannot cause any problem in the measurement of authenticity, because the models should work on all the different samples that are commercially available.
The applied compounds for the experiments were ignited silica sand (puriss, Spektrum 3D, Hungary), ethanol (100 v/v%, Reanal, Hungary) and petroleum ether (40–65 °C, Reanal, Hungary).
In transmission mode (800–1100 nm or 12500–9000 cm−1) an outer transmission interface and Si-diode detector were used and the homogenized samples were placed in Petri dishes, as sample compartments. In this case the device scanned the sample 64 times and an average spectrum was constructed from the scans.
In diffuse reflectance mode a rotatable sample wheel and a PbS detector were used. In this case a part of the infrared light is absorbed on the layer of the sample, while the other part is reflected and it goes to the detector. In this case each spectrum was the average spectrum of 32 subsequent scans.
The comparison of the two different spectral types can be seen in the Results and discussion part as well (Fig. 2). Reflectance spectra are richer in peaks, but transmission spectra are used more often for this type of sample.
Every sample had two duplicates. The average of the duplicates' spectra was used for the multivariate calibration.
Principal component regression (PCR) is very close to multilinear regression (MLR) and principal component analysis (PCA). The basic idea of PCR is divided into two steps: (a) to calculate the principal components from the original variables and (b) use the new virtual variables (PC scores) for the regression model building with the typical and well-known MLR equation:
Y = Xb + E | (1) |
The advantage of this method is that it suppresses the spectral collinearity. However, there is no guarantee that the calculated PCs are correlated with the reference variable Y.18,19
Partial least-squares regression (PLS-R) is the most frequently used multivariate regression technique since the past few decades in the field of NIR spectroscopy. A tutorial paper of Geladi and Kowalski gives a very good explanation of this method.2 The increasing popularity of PLS dates back to that publication, since PLS regression can be considered as the basic tool for multivariate regression. The basic idea of PLS-R is a matrix transformation, which divides the original X and Y matrices into the multiplication of score and loading matrices in quite the same way as PCA works. They are called the outer relations.2,20 PLS regression can use the new “latent” variables (T and U) for the prediction of Y values.21 There is an inner relationship as well between the PLS components (U and T) of the X and Y matrices, which can be described with an equation similar to eqn (1). The determination of the number of components is an essential part of model building. We can easily overfit or underfit the models, if we do not pay attention to the harmony/parsimony tradeoff.22 A commonly used method for this purpose is the global or local minimum value of the root mean square error of cross-validation (RMSECV) or predictive error of sum of squares (PRESS).
We can find more opportunities in the extended literature of this field, for example the randomization test, the decision based on eigenvalues, etc.23,24 In this study, the first local minimum of RMSECV values was used, and in the lack of a minimum value, the starting point of a plateau was used based on visual inspection.
Support vector machine (SVM) regression was also used for the model building. This method belongs to machine learning techniques, and it is a younger one and not yet as popular as PLS or PCR. However, they can have high potential, because in the past few decades the developments and applications of machine learning (especially SVM) algorithms are rapidly increasing.25,26 This is the reason why we also wanted to test and compare this method with the others. SVM finds a relationship between the regressors and the Y values (dependent ones). SVM projects the original data into a space of higher (rather than lower) dimensions (feature space) using a suitable kernel function18,27 (the most popular functions include polynomial kernels and the Gaussian radial basis function). We have to note that SVM models can be very sensitive to overfitting, and several meta parameter combinations provide the same results; thus, a careful validation is advised.
Some complex methods for selection can be the genetic algorithm (GA) or interval PLS/PCR as well. Usually the spectra can be divided into several equal parts (e.g. 10, 20, and 40). Working with intervals or windows can be a better choice because the spectral wavelengths are not independent of each other.28,29 Interval selection is highly recommended especially in the case of GA.3 The final decision about the best parts can be made by RMSECV, R2 or its cross-validated counterpart (Q2). In our study GA, iPLS and iPCR with 10 and 25 intervals were used in the model building phase.
In the case of leave-one-out, all samples are excluded once and only once, whereas the others are used for calibration (see Fig. 9). It means that if the number of samples is N, we have to repeat the cross-validation N times.
In this study, Unscrambler X 10.3 (Camo Software, Oslo, Norway) was used for the regression model building and the validation of models.
There is no statistically significant difference between leave-one-out, randomized and systematic fivefold cross-validations. It means that based on our findings we can choose whatever the cross-validation type we want, and the results will not be significantly different. Furthermore, external sample sets alone can give uncertain and biased results with higher error values. However, external sets have different behavior; thus, the models should be applicable to them as well. To solve this problem, the final decision about the models based on FT-NIR spectra should be made by using external and internal samples and their predicted values together. For this purpose, the SRD values can be successfully used and can be considered as a novel performance parameter as well. They can give consistent, properly validated and reliable results about the models.
The effects of the applied regression models, variable selection techniques and spectral types were significant in each case. We can conclude that the variable selection techniques were useful in the case of transmission spectra and also in the case of the PCR method. Moreover, the effect of variable selection for SVM alone was not significant.
The applied statistical analysis protocol is applicable for other (even special or complicated) datasets as well. The SRD methodology is entirely general and it can be used not just as a performance parameter but in other comparison studies as well.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c8ay01055b |
This journal is © The Royal Society of Chemistry 2018 |