Heng
Xu
,
Wensheng
Cai
and
Xueguang
Shao
*
Research Center for Analytical Sciences, College of Chemistry, Nankai University, Tianjin, 300071, P. R. China. E-mail: xshao@nankai.edu.cn; Fax: +86-22-23502458; Tel: +86-22-23503430
First published on 15th January 2010
A weighted partial least squares (PLS) regression method for multivariate calibration of near infrared (NIR) spectra is proposed. In the method, the spectra are split into groups of variables according to the statistic values of variables, i.e., the stability, which has been used to evaluate the importance of variables in a calibration model. Because the stability reflects the relative importance of the variables for modeling, these groups present different spectral information for construction of PLS models. Therefore, if a weight which is proportional to the stability is assigned to each sub-model built with different group variables, a combined model can be built by a weighted combination of the sub-models. This method is different from the commonly used variable selection strategies, making full use of the variables according to their importance, instead of only the important ones. To validate the performance of the proposed method, it was applied to two different NIR spectral data sets. Results show that the proposed method can effectively utilize all variables in the spectra and enhance the prediction ability of the PLS model.
Another technique for improving the PLS modeling is to deal with the redundant variable. Generally, NIR data sets may have thousands of wavelengths, sometimes from hundreds or thousands of samples. Not all wavelengths in a spectrum, however, contain equivalent information relevant to the component of interest. Variable selection is a common way to gather wavelengths that do contain relevant information. Many variable selection methods have been developed, such as genetic algorithms (GA),19,20 uninformative variable elimination by PLS (UVE-PLS),21–25 interval PLS (iPLS),26,27 variable selection based on randomization test for PLS (RT-PLS),28 and variable selection based on truncation of weight vectors in PLS.29 In our previous works, an integration of the Monte Carlo (MC) technique and UVE was proposed and named as MC-UVE.25 These methods can significantly improve the performance of the calibration techniques by removing the irrelevant variables. On the other hand, some approaches have been proposed to extract the useful information from all variables regardless the relevancy.30,31 Because, in the spectra of complex samples, some useful information may be embedded in the background and noise components, it is therefore difficult to determine the relevancy of a variable.31,32 These approaches try to improve the quality of the calibration model by weighting all the variables instead of discarding some of them as done in wavelength selection methods.
In this study, a combined PLS model with variable grouping based on stability for multivariate calibration of NIR spectra is proposed. In the proposed method, all variables (wavelengths) are grouped by their stability and sub-models are built with the grouped variables. The same way as in MC-UVE25 is adopted to calculate the stability for each variable. The objective is to construct a model with good prediction performance by keeping all wavelengths and making the best use of the information from all variables. In order to demonstrate the performance of the method, two NIR spectral data sets are investigated. The results indicate that the proposed method is a feasible way to enhance the prediction quality of the PLS model.
In order to use full spectral information in a calibration model, a method of weighted PLS with variable grouping for NIR spectra analysis is proposed in this study and named as variable grouping (VG)-PLS. VG-PLS model is a combination of the sub-models of the grouped variables. If w is used to denote the weight of a sub-model, the linear combination of the sub-models can be represented as:
| ŷ = [ŷ1,ŷ2,ŷ3…ŷn]w | (1) |
![]() | (2) |
The detail procedures of the VG-PLS can be described as follows:
(1) By using Monte Carlo technique like in MC-UVE,25 the stability of the variables are calculated and will be used for grouping variables and deciding the weights of groups in the following steps.
(2) The variables in the spectra are ranked in a descending order of their stability.
(3) The variables are split into n groups. Each group contains almost the same number of variables following the order, and sub-models are built with the groups.
(4) The contribution of the sub-models to the combined model, i.e., the weights, is calculated by (2).
(5) The predictions of the prediction set are performed using the combined model, i.e., the variables of the validation spectra are split into n groups in the same order of step (3), then the n prediction values are produced by the n models, and finally a prediction is made by the weighted sum of the n values as shown in (1).
Clearly, m and n are two important parameters of the combined model, which will be discussed in the following sections. A validation set was used for optimization of the two parameters.
Data set 2 is supplied by a tobacco corporation, including the NIR spectra of 2199 tobacco lamina samples and the contents of sugar and nicotine. The spectra were measured on an MPA FT-NIR spectrometer (Bruker, Germany), sugar and nicotine contents were measured on an Auto Analyzer III (Bran + Luebbe, Germany) following the procedures of industrial standard method. Each spectrum is recorded in the wavelength range 3999.7–11995.3 cm−1 (2500.2–833.7 nm) with the digitization interval ca. 3.86 cm−1. Each spectrum is composed of 2074 data points.
Before calculation, multiplicative scattering correction (MSC)9,10 is applied to the spectra to reduce the difference in light scatter between samples. The spectra are divided into calibration, validation and prediction sets by the Kennard-Stone (KS) method.33 For the first data set, 50 and 15 samples are used as the calibration and validation sets, respectively, the left 15 samples are used as the prediction set. For the second data set, 1100 and 550 samples are used as the calibration and validation sets, respectively, and the other 549 samples are used as the prediction set. The calibration set is used for building the PLS model, the validation set is used for parameter optimization, and the prediction set is used for external validation of the method. In addition, it is worth noting that different latent variable (LV) numbers are used for the sub-models, because different groups may contain different information. MCCV with Osten's F criterion34 is used for determination of the LV number.
In order to assign an optimal weight for each sub-model, parameter m in (2) is investigated. A series of combined models are constructed with different number of sub-models, and used to predict the validation set, respectively. Fig. 1, 2(a) and 2(b) show the variation of the root mean square error of prediction (RMSEP) of the validation set along with the parameter m for prediction of starch, sugar and nicotine, respectively. It is clear that both figures show a similar variation trend. When m is 2, the RMSEPs reach at a minimum, and thereafter, the RMSEPs increase gradually. Thus m = 2 is used in this study.
![]() | ||
| Fig. 1 Variation of the mean RMSEPs and standard deviation with the value of parameter m for data set 1. | ||
![]() | ||
| Fig. 2 Variation of the mean RMSEPs and standard deviation with the value of parameter m for data set 2 of sugar (a) and nicotine (b). | ||
Moreover, with m = 2 and eight sub-models, the weights w of the sub-models for data set 1 is shown in Fig. 3. In the figure, it is obvious that the weights decrease along the order of the sub-models. The results mean that variables with big stabilities have large contributions to the model. This is consistent with the results obtained with the stability-based methods.21–25 On the other hand, results in the figure also indicate that the variables with small stabilities also have contributions to the model, even relatively less than those variables with big stabilities. Therefore, VG-PLS, which uses all the variables, should have an advantage in prediction ability.
![]() | ||
| Fig. 3 Distribution of the weights of the sub-models for data set 1. | ||
As for data set 2, the weights w of the sub-models for the sugar and nicotine are shown in Fig. 4(a) and (b), respectively. It can be seen that the distribution of weights for each sub-model is different from that of data set 1. In Fig. 4, it is obvious that the sub-model constructed by the first variable group has a big weight, the next three sub-models have relatively small weights, and other sub-models have very small weights. This may be accounted for by the large number of variables in data set 2, and the variables are sorted according to their importance to the model. Except the first four sub-models, the left sub-models mainly consist of the less relevant variables. This also indicates that reasonable weights are calculated for the sub-models, thus the prediction ability of the combined model can be improved with the advantage of using all the variables in the spectra.
![]() | ||
| Fig. 4 Distribution of the weights of the sub-models for data set 2 of sugar (a) and nicotine (b). | ||
For data set 1, the variation of the RMSEPs of the validation set versus the number of sub-models is plotted in Fig. 5. Each point in the figure is the average value of the RMSEPs over 100 runs and the error bar across the points is the standard deviation (σ). From Fig. 5, it seems when n is 3, the RMSEP reaches the minimal. As a comparison, the variation of the RMSEP of the validation set for the data set by MC-UVE-PLS, where only the variables of the first sub-model are used, is also plotted in the figure. From the figure, it seems that when the number of sub-models is small, the mean value and the standard deviation of VG-PLS and MC-UVE are almost the same. When the number of sub-models increases, however, the mean value and the standard deviation of VG-PLS is obviously smaller than the results of MC-UVE. The result reveals that the model built by VG-PLS is improved with better stability than MC-UVE.
![]() | ||
| Fig. 5 Variation of the mean RMSEPs and standard deviation with different number of groups (n) for data set 1. | ||
As in the same way done for data set 1, the variation of the RMSEPs of the validation set versus the number of sub-models for data set 2 is plotted in Fig. 6, in which (a) and (b) correspond to the sugar and nicotine, respectively. From Fig. 6(a), it appears that, at the beginning, both the mean value and the standard deviation are comparatively large. With the increase of n, however, the mean RMSEP decreases gradually and reaches a minimum at n = 8. This indicates that, when the combined model is built with 8 sub-models, the prediction ability of the model is best. Obviously, if fewer sub-models are built, e.g., 2 or 3, the advantage of grouping can not be seen because each group includes the variables with different stability. On the other hand, if more sub-models are used, fewer variables will be included in each sub-model, which may make the sub-models not predictable. Fig. 6(b) shows the variation of the RMSEPs of the validation set with the number of sub-models for the nicotine content of data set 2. It is clear that when n is 10, the RMSEP reaches a minimum. The number is slightly bigger than that in the sugar model, because the number of variables relevant to nicotine is relatively less compared with sugar.23,25 Therefore, n = 8 and 10 is used as the number of sub-models for the sugar and nicotine model of data set 2.
![]() | ||
| Fig. 6 Variation of the mean RMSEPs and standard deviation with different number of groups (n) for data set 2 of sugar (a) and nicotine (b). | ||
When compared with MC-UVE-PLS, it can be seen from the two curves in both Fig. 6(a) and (b) that the mean RMSEP of VG-PLS is slightly smaller than that of MC-UVE-PLS. The result reveals that, for data set 2, which is a data set of real complex samples, VG-PLS can obtain better results by making use of all variables with a suitable weighting strategy.
![]() | ||
| Fig. 7 The mean spectrum of calibration set and the wavelengths distribution in different groups for data set 1. | ||
![]() | ||
| Fig. 8 The mean spectrum of calibration set and the wavelengths distribution in different groups for data set 2 of sugar (a) and nicotine (b). | ||
Fig. 8(a) and (b) show the mean spectrum of the calibration set and the distribution of the wavelengths in different groups for data set 2 of the sugar and nicotine model, respectively. Due to the complexity of the samples and the large number of groups, the distribution seems complicated. However, for both the two models, the variables in the first group are mainly located in 4000–6000 cm−1.This result has good consistency with that obtained in our previous works by using MC-UVE25 and RT28 methods. Therefore, there is no essential difference between the wavelength selection and the variable grouping strategies. The former uses only the variables which are considered to be important, and the latter adjust the importance of the variables by the weights.
| Data set | Contents | Model | Number of groups | Latent variables number (LV) | RMSEP(σ)a |
|---|---|---|---|---|---|
| a RMSEP is the average value and σ is the standard deviation of the 100 RMSEPs. The RMSEP without σ is calculated with only one run because no stochastic factor is involved in the algorithms. b MC-UVE-1 means the MC-UVE reported in literature25 and the number of retained variables is found by searching the minimal RMSEPs at different number of variables, whereas in MC-UVE-2, the same variables as in the first sub-model of VG-PLS is used for comparison. v is the number of retained wavelengths by MC-UVE. | |||||
| 1 | Starch | PLS | 1 | 6 | 1.142 |
| MSC + PLS | 1 | 6 | 0.552 | ||
| MSC + MC-UVE-1b (v = 236) | 1 | 6 | 0.515 (0.049) | ||
| MSC + MC-UVE-2 (v = 233) | 1 | 6 | 0.520 (0.041) | ||
| MSC + VG-PLS | 3 | 6 6 5 | 0.535 (0.029) | ||
| 2 | Sugar | PLS | 1 | 13 | 1.99 |
| MSC + PLS | 1 | 13 | 1.71 | ||
| MSC + MC-UVE-1b (v = 360) | 1 | 13 | 1.61 (0.0056) | ||
| MSC + MC-UVE-2 (v = 259) | 1 | 13 | 1.63 (0.0019) | ||
| MSC + VG-PLS | 8 | 13 12 11 10 10 8 8 7 | 1.60 (0.0009) | ||
| Nicotine | PLS | 1 | 13 | 0.312 | |
| MSC + PLS | 1 | 13 | 0.303 | ||
| MSC + MC-UVE-1 (v = 215) | 1 | 13 | 0.290 (0.0012) | ||
| MSC + MC-UVE-2 (v = 207) | 1 | 13 | 0.291 (0.0013) | ||
| MSC + VG-PLS | 10 | 13 13 13 12 10 8 8 7 7 7 | 0.290 (0.0012) | ||
| This journal is © The Royal Society of Chemistry 2010 |