Weiting Wanga,
Yonghuan Yun*a,
Baichuan Dengab,
Wei Fanc and
Yizeng Liang*a
aCollege of Chemistry and Chemical Engineering, Central South University, Changsha 410083, P. R. China. E-mail: yizeng_liang@263.net; yunyonghuan@foxmail.com; Fax: +86-731-88830831; Tel: +86-731-88830824
bDepartment of Chemistry, University of Bergen, Bergen N-5007, Norway
cJoint Lab for Biological Quality and Safety, College of Bioscience and Biotechnology, Hunan Agriculture University, Changsha 410128, P. R. China
First published on 27th October 2015
Based on the theory that a large partial least squares (PLS) regression coefficient on autoscaled data indicates an important variable, a novel strategy for variable selection called iteratively variable subset optimization (IVSO) is proposed in this study. In addition, we take into consideration that the optimal number of latent variables generated by cross-validation will make a great difference to the regression coefficients and sometimes the difference can even vary by several orders of magnitude. In this work, the regression coefficients generated in every sub-model are normalized to remove the influence. In each iterative round, the regression coefficients of each variable obtained from the sub-models are summed to evaluate their importance level. A two-step procedure including weighted binary matrix sampling (WBMS) and sequential addition is employed to eliminate uninformative variables gradually and gently in a competitive way and reduce the risk of losing important variables. Thus, IVSO can achieve high stability. Investigated by using one simulated dataset and two NIR datasets, IVSO shows much better prediction ability than two other outstanding and commonly used methods, Monte Carlo uninformative variable elimination (MC-UVE) and competitive adaptive reweighted sampling (CARS). The MATLAB code for implementing IVSO is available in the ESI.
At present, many methods on variable selection have been employed in multi-component spectral data. In general, these methods can be classified into two categories as static and dynamic approach. The static approaches use one criterion for the whole data space, while the dynamic approaches take into account the result of each iteration. The static approaches include t-statistics and Akaike information criteria (AIC), uninformative variable elimination (UVE),8 Monte Carlo based uninformative variable elimination (MC-UVE),9,10 variable importance in projection (VIP),11 selectivity ratio (SR),12 and moving window partial least squares (MWPLS).13 The dynamic approaches consist of optimized algorithm-based such as genetic algorithm (GA),14–16 particle swarm optimization (PSO),17 firefly,18 ant colony optimization (ACO),19,20 gravitational search algorithm (GSA),21 and simulated annealing (SA).22 The variable selection methods, such as random forest,23 successive projection algorithm (SPA),24 iteratively retaining informative variables (IRIV),25 variable combination population analysis (VCPA),26 competitive adaptive reweighted sampling (CARS),27 interval random frog (iRF),28 are also the dynamic approaches. The theories of UVE, MC-UVE, CARS, and iRF come from that the larger the absolute regression coefficient on the autoscaled data is, the more important the variable is.8,29 In addition to regression coefficient, Kvalheim et al. discussed the usage of SR that can assist in improved algorithm for variable selection in latent variable regression model.30
Among all the methods upon regression coefficient, MC-UVE and CARS are adopted extensively in multivariate calibration models for their better prediction. In both MC-UVE and CARS, Monte Carlo sampling technique is applied to the sample space to establish a large number of sub-models, which assures that the number of samples selected randomly for modeling is strictly the same, for example, 80% of all samples is used to build the model. For MC-UVE, after N Monte Carlo sampling runs, one variable is evaluated according to a criterion which is equal to the ratio of the mean of the regression coefficients and its standard deviation. The variables with small criteria are eliminated. Unlike MC-UVE, in each iterative round, CARS removes the variables with small means of regression coefficients by the exponentially decreasing function (EDF) by force and adaptive reweighted sampling (ARS) competitively. However, it is the full spectrum in MC-UVE that is used to establish sub-models, which will lead to that the regression coefficients of the informative variables can be influenced by the uninformative variables. With regard to CARS, the enforced elimination of variables by EDF may lose important variables and further result in instability. Hence, in most cases the results achieved by MC-UVE and CARS are not satisfied enough.
In this study, a novel strategy for variable selection based on regression coefficient is proposed, called iteratively variable subset optimization (IVSO). At first, we introduce a new random sampling method, named weighted binary matrix sampling (WBMS),31,32 which is an improvement of the binary matrix sampling (BMS).25,33 Giving different weights to different variables, WBMS aims to make variables with larger weights more likely to be chosen. On the contrary, if the weight of one variable is small, it will be selected with little or even no possibility. Furthermore, combining WBMS with another strategy called sequentially addition, the variables with small criteria are deleted and a new variable subset is yielded. After N WBMS runs, N different variable subsets are obtained and the root mean squares error of cross-validation (RMSECV) is used as the objective function to search for the best variable subset. In addition, the regression coefficients of one variable in all sub-models are summed to evaluate its importance level. This data fusion step is a good option, as the noise cancels out and the systematic information accumulates. However, we find that the optimal number of latent variables generated by cross-validation will make a great difference to the regression coefficients, which is consistent with the viewpoint in ref. 34. Thus, the regression coefficients of the same variable in different sub-models cannot be calculated or compared directly due to the great difference. The strategy of normalization is applied to eliminate the influence. Tested on a simulated dataset and two NIR datasets, IVSO coupled with partial least squares (PLS) demonstrates better prediction ability and higher stability compared to the two outstanding methods above, namely MC-UVE and CARS. The result demonstrates that IVSO has the ability to eliminate uninformative variables gradually and gently in a competitive way, which can avoid those two problems of MC-UVE and CARS discussed above. It proves that IVSO is an efficient method for variable selection in multivariate calibration.
Additionally, it should be noted that IVSO is just evaluated by NIR datasets with PLS in this study, but it is a general strategy and can be combined with other regression and pattern recognition methods, and applied to other kinds of datasets, such as metabolomic and quantitative structure–activity relationship (QSAR).
It works as follows: assume that the weight of the ith variable is wi. At first, a binary matrix M with dimensionality N × P is generated, which contains either ‘1’ or ‘0’. ‘1’ represents that the responding variable is included for modeling, while ‘0’ represents non-sampling for the variable. In each column, there are Nwi ‘1’ and the left ones are all ‘0’. The procedure is displayed in Fig. 1, where the row of M is set to 5 and the column is 7 for simplicity. When sampling, the weights of some variables are too small to be sampled in any column. The first and second columns in Fig. 1 can represent this case. As the last column shows, if the weight of one variable is equal to 1, it will be sampled in each iterative round. Next, permuting each column in M generates a new binary matrix NM. Remarkably, after the permutation, the number of ‘1’ or ‘0’ in each column is kept unchanged.
In the matrix of NM, each row represents a sampling process for building a sub-model. It can be summarized that when Nwi of one variable is less than 1, it will be eliminated.
T = XW | (1) |
y = Tc + e = XWc + e = Xb + e | (2) |
In addition, the matrix X needs to be autoscaled to guarantee that each variable has the same variance before modeling. It should be noted that the regression coefficients mentioned in this study have been changed into the absolute value before calculating. Afterwards, the larger the regression coefficient is, the more important the variable is.
Moreover, it is found that the optimal number of latent variables generated by cross-validation will make a great difference to the regression coefficients and sometimes the differences can even vary by several orders of magnitude.34 Thus, the regression coefficients of the same variable in different sub-models may change a great deal and they cannot be calculated or compared directly. In this study, we employ the strategy of normalization to remove this influence. Assume that after building N sub-models, a regression coefficient matrix B (B = [b1, b2, … bN]T) is generated. The jth row vector in B, denoted by bj (1 ≤ j ≤ N) records the regression coefficients of the jth sub-model. The elements in the matrix B will be set to 0 if the responding variables are not included into the sub-models. The regression coefficient bij of the ith variable in the jth sub-model is normalized as follow:
cij = bij/max(bj) | (3) |
In each iterative round, the weight of the ith variable is defined as:
wi = si/max(s), i = 1, 2, …, P | (4) |
In each iterative round, sequentially addition can select a variable subset which contains informative variables. Thus, if some important variables are lost by WBMS, they still can be retained in the variable subset in the previous rounds by sequentially addition. When selecting the best variable subset among all iterative rounds, these lost variables still have the opportunity to be included into the ultimate variable subset. In this way, no loss of important variables can be assured. For the same reason, IVSO possesses high stability. Overall, IVSO has the ability to eliminate variables gradually and gently in a competitive way and reduce the risk of losing important variables.
Step 1: creating a binary matrix NM with dimensionality N × P for sampling by WBMS gives N sampling runs. In each column of NM, there are Nwi ‘1’ and the left ones are all ‘0’. If the Nwi of one variable is less than 1, it will not be sampled in any row. Record the number of variables which can be sampled by WBMS, namely L1.
Step 2: build N PLS sub-models to calculate the regression coefficient matrix B. Each row of the matrix B is normalized to generate the matrix C, as formula (2) did.
Step 3: each column of the matrix C is summed as the criterion of the corresponding variable, denoted by the vector s. Rank the L1 variables based on their criteria.
Step 4: build L1 sub-models through sequentially addition according to the rank. Take the variable subset in the sub-model with the lowest RMSECV value as the objective variable subset of this iterative round. Record this RMSECV value R and the length of this variable subset L2.
Step 5: the vector s is normalized to calculate weights as formula (3). The weights in this iterative round only work in the sampling of the next iterative round.
Step 6: repeat the steps 1–5 until L1 is equal to L2, then many variable subsets are obtained. The variable subset with the lowest R value is chosen as the ultimate variable subset of the algorithm.
Table 1 includes the results of the three methods on the three datasets. The mean and 95% confidence interval are given as well. The simulated dataset is investigated by comparing with MC-UVE, CARS, the full spectrum and the first 100 informative variables. Compared with the full spectrum, the RMSEC value of only the first 100 variables drops from 0.0644 to 0.0091 and the RMSEP value drops from 0.4043 to 0.0135, even using a smaller number of latent variable, 6. It demonstrates the importance and necessity for variable selection in multivariate calibration.
Methods | nVara | nLVsb | RMSEC | RMSEP | ||
---|---|---|---|---|---|---|
Min–Max | Average ± CIc | Min–Max | Average ± CI | |||
a The number of selected variables.b The number of selected latent variables of PLS.c 95% confidence interval (CI).d Results using the full spectrum with 200 variables by PLS.e Results using only the first 100 informative variables by PLS. | ||||||
Simulated | ||||||
PLSd | 200 | 10 | 0.0644 | 0.4043 | ||
PLSe | 100 | 6 | 0.0091 | 0.0135 | ||
MC-UVE | 78.2 ± 6.3 | 6 | 0.0087–0.0093 | 0.0089 ± 8.3 × 10−5 | 0.013–0.0135 | 0.0132 ± 5.8 × 10−5 |
CARS | 27.6 ± 4.8 | 6.3 ± 0.7 | 0.0080–0.0124 | 0.0100 ± 3.1 × 10−4 | 0.0118–0.0197 | 0.0155 ± 5.3 × 10−4 |
IVSO | 68.1 ± 2.1 | 6 | 0.0090–0.0093 | 0.0091 ± 1.5 × 10−5 | 0.012–0.0137 | 0.0125 ± 8.7 × 10−5 |
![]() |
||||||
Corn moisture | ||||||
PLS | 700 | 10 | 0.017 | 0.0237 | ||
MC-UVE | 70.4 ± 2.6 | 10 | 2.4 × 10−3 to 3.0 × 10−3 | 2.7 × 10−3 ± 4.1 × 10−5 | 2.8 × 10−3 to 3.7 × 10−3 | 3.2 × 10−3 ± 4.0 × 10−5 |
CARS | 3.4 ± 2.7 | 3.1 ± 1.9 | 2.4 × 10−4 to 2.7 × 10−3 | 4.6 × 10−4 ± 1.8 × 10−4 | 3.4 × 10−4 to 4.5 × 10−3 | 6.4 × 10−4 ± 2.8 × 10−4 |
IVSO | 2.3 ± 0.8 | 2.3 ± 0.8 | 2.6 × 10−4 to 2.8 × 10−4 | 2.8 × 10−4 ± 1.1 × 10−6 | 3.3 × 10−4 to 3.6 × 10−4 | 3.4 × 10−4 ± 1.4 × 10−6 |
![]() |
||||||
Wheat protein | ||||||
PLS | 175 | 10 | 0.3923 | 0.2382 | ||
MC-UVE | 10.6 ± 1.3 | 9.9 ± 0.3 | 0.3370–0.3657 | 0.3475 ± 0.0012 | 0.2466–0.2791 | 0.2532 ± 0.0027 |
CARS | 9.8 ± 2.8 | 8.2 ± 1.3 | 0.2501–0.3427 | 0.2969 ± 0.0054 | 0.1818–0.3535 | 0.2432 ± 0.0111 |
IVSO | 14.8 ± 3.0 | 7.5 ± 1.0 | 0.2415–0.2695 | 0.2641 ± 0.0030 | 0.2313–0.2363 | 0.2339 ± 0.0009 |
The statistical features of the results by different methods can be observed visually in the boxplot of Fig. 4, in which Fig. 4A displays the results of the simulated dataset. As it can be seen from Table 1 and Fig. 4A, IVSO shows the best performance with regard to the improving the prediction ability of the model and good stability. The 95% confidence interval of RMSEC and RMSEP results for each method shows that IVSO has no overlap with other methods. In addition, the selected latent variable of IVSO is 6, which is much smaller than that of the full spectrum.
The frequency distribution of selected variables within 50 times is displayed in Fig. 5. For different methods the selected variables all concentrate in the first 100 variables. Both IVSO and MC-UVE can select variables with high frequencies. However, the selected variables by CARS are of low frequencies and even no one variable can be selected by all 50 times, which reveals its instability. The fact is just consistent with its large confidence interval in Table 1 and standard deviation in Fig. 4A.
![]() | ||
Fig. 5 The frequencies of variables selected by different methods within 50 times on the simulated dataset. (A) MC-UVE; (B) CARS; (C) IVSO. |
Fig. 6A and B show the changing trend of the number of variables sampled by IVSO and CARS respectively. The arrow indicates the point reaching the optimal variable subset. As to MC-UVE, it is the full spectrum that is used to establish the sub-models, so no iterative round has ever occurred. For the simulated dataset in Fig. 6A, the number of sampled variables decreases to 100 in the 3rd and 4th iterative rounds, then the curve drops much more slowly. In stark contrast, the number of sampled variables of the simulated dataset in Fig. 6B varies tremendously in the front section of the curve. It is in the first iterative round that the number decreases to 97, which means that just the first iterative round can remove not only uninformative but also informative variables. In the 11th iterative round CARS achieves its optimal variables subset containing only 26 variables. It can be concluded that CARS eliminates variables too quickly and thus loses informative variables, which may result from the enforced elimination of variables by EDF. On the contrary, IVSO has the ability to eliminate uninformative variables gradually and gently, and achieve much higher stability.
Fig. 7 shows the RMSECV value of the variable subset chosen by sequentially addition in each iterative round, which is corresponding to the sampling curve of the simulated dataset in Fig. 6A. The ‘0’ iterative round stands for the process during which the weights of all variables are set to ‘1’ as initial values but all these variables are conducted by sequentially addition. From Fig. 7A, the RMSECV value of the simulated dataset drops first because of the existing of some uninformative variables and begins to rise again due to the missing of some informative variables. It demonstrates the good ability of IVSO to eliminate uninformative variables and keep informative ones.
The frequencies of variables selected by these methods are displayed in Fig. 8. Both CARS and IVSO mainly select variables of 1908 nm and 2108 nm, which have been discussed and proven as the key wavelengths by the literature of CARS experimentally and theoretically.27 These two wavelengths are relative with the water absorption and the combination of O–H bond.13 For CARS, it cannot select the key wavelength of 2108 nm in every iterative round. However, except for these two key variables, MC-UVE selects too many other variables with high frequencies.
![]() | ||
Fig. 8 The frequencies of variables selected by different methods within 50 times on the corn moisture dataset. (A) MC-UVE; (B) CARS; (C) IVSO. |
From the corn moisture dataset in Fig. 6A and B, we also can see that the number of variables sampled by IVSO in the previous rounds drops much more gradually and gently than that of CARS. In the latter rounds, though this number of IVSO changes more quickly, the key variables of 1908 nm and 2108 nm still can be retained in every iterative round due to sequentially addition. But CARS cannot do it.
In Fig. 7B, it reaches the optimal variable subset with the two key variables firstly in the 4th iterative round. Then the optimal variable subset keeps unchanged, so the RMSECV value is stable. From the RMSECV values, we can summary that the strategy of sequentially addition used in every iterative round makes the result stable.
![]() | ||
Fig. 9 The frequencies of variables selected by different methods within 50 times on the wheat protein dataset. (A) MC-UVE; (B) CARS; (C) IVSO. |
Although IVSO is worked with partial least squares (PLS) to select variables in this study, it also can be coupled with other modeling methods on regression or pattern recognition. Our future work will focus on investigating the application of IVSO in other fields, such as metabolomic and quantitative structure–activity relationship (QSAR).
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c5ra08455e |
This journal is © The Royal Society of Chemistry 2015 |