Xueguang
Shao
*,
Xihui
Bian
,
Jingjing
Liu
,
Min
Zhang
and
Wensheng
Cai
Research Center for Analytical Sciences, College of Chemistry, Nankai University, Tianjin, 300071, China. E-mail: xshao@nankai.edu.cn; Fax: +86-22-23502458; Tel: +86-22-23503430
First published on 4th October 2010
Near infrared (NIR) spectroscopy has been demonstrated as a powerful technique for both qualitative and quantitative analysis of complex systems in various fields. Calibration, however, is one of the important techniques needed to ensure the quality and practicability of the analyses. In this mini-review, recent developments in multivariate calibration methods for NIR spectroscopic analysis, including non-linear approaches and ensemble techniques, are briefly summarized. The advantages and disadvantages of these methods are compared and discussed critically.
Xueguang Shao | Xueguang Shao graduated from Liaocheng University in 1984 and received his PhD in chemistry from the University of Science and Technology of China (USTC) in 1992. He worked at USTC from 1992 to 2005, and is now a professor of chemistry at the College of Chemistry, Nankai University. His research interest is the development of chemometric methods for chemical studies, including analytical signal processing, modeling and molecular simulations. |
Xihui Bian | Xihui Bian received her B.S. degree and M.S. degree from Tianjin Polytechnic University in 2006 and Tianjin University in 2008, respectively. She is a PhD student of analytical chemistry in the laboratory of Chemoinformatics, Nankai University. Her research interest is the development of new chemometric methods for near infrared and Raman spectral analysis. |
Jingjing Liu | Jingjing Liu gained her B.S. degree (2005) from Anyang Normal University and M.S. degree (2008) from Henan Normal University. She is pursuing a PhD degree in the laboratory of Chemoinformatics, Nankai University. Her research interest is the development of new chemometric methods for classification and calibration. |
Min Zhang | Min Zhang graduated in 2008 from Tianjin University with a B.S. degree in chemical engineering and technology. She is currently pursuing a M.S. degree in analytical chemistry at the laboratory of Chemoinformatics, Nankai University. Her research interest is the development of new chemometric methods for modeling near infrared spectra of complex samples. |
Wensheng Cai | Wensheng Cai graduated from Anhui University in 1985 and received her PhD in chemistry from University of Science and Technology of China (USTC) in 1994. She worked at USTC from 1988 to 2005, and now is a professor of chemistry at the College of Chemistry, Nankai University. Her research interests are chemometrics and molecular simulations. |
The main role of multivariate calibration is to establish a regression model relating the measured NIR signals to certain properties of samples. The quantitative model is, then, applied to predict the same properties of samples not involved in the calibration set from their measured responses. Many methods such as multiple linear regression (MLR), principal component regression (PCR), partial least squares (PLS), artificial neural network (ANN) and support vector machine (SVM), etc., have been used for multivariate calibration in NIR spectral analysis.1–4 However, the reliability and accuracy of the predicted result depends to a great extent on the quality of the model. Therefore, much effort has been made to improve the quality of the models, including the optimization of calibration datasets,5 spectral signal processing,6 variable selection7 and the development of modeling strategies.8,9
Among the multivariate calibration methods, PLS has been the most commonly used modeling technique due to its practicability and versatility. PLS can be thought of as a combination of principal component analysis (PCA) and MLR, therefore it is able to handle data with strong co-linearity and noise, as well as in situations where the number of variables considerably exceeds the number of available samples. Furthermore, the only factor that needs to be optimized during the modeling is the principal component (PC) number or latent variable (LV) number. However, PLS is a linear technique and global calibration method. The non-linearity of NIR spectra, e.g., caused by optical scattering in diffuse reflectance spectra or interactions between the analytes in high concentration samples, may make PLS models poor in predictive ability. Moreover, samples for building a PLS model should be adequate for accurate and stable prediction. In this mini-review, recent advances in the modeling of NIR spectral data, including non-linearity approaches and ensemble techniques are summarized, and the advantages and disadvantages of these methods are discussed.
ANN, a mathematical model based on biological neural networks, can be viewed as a universal model-free approximator that can represent any non-linear function with sufficient accuracy by seeking the proper combination of several sigmoid functions. For this reason, ANN is often taken to solve problems with non-linear relationships. However, difficulties in the practical use of ANN are optimization of the tunable parameters, low speed of the training process and the problem of over-fitting. In such cases, genetic algorithm (GA), a powerful optimization technique, has been used to optimize the network topology.10,11 On the other hand, variable selection and dimension reduction methods based on GA have been developed to simplify and speed up the ANN method.12
SVM, a machine learning method arising from statistical learning theory and based on structural risk minimization, is highlighted to take the place once occupied by ANN due to its outstanding advantages such as special generalization ability and reliability for the multivariate problem, especially for small sample size problems. SVM consists essentially of support vector classification (SVC) and support vector regression (SVR). The theory and applications of the former have been maturity while the latter is relative weak either in scope or in depth. In order to make SVM easy to be comprehensively understood, Liang et al.13 provided the basic idea and its applications in chemistry and Brereton et al.3 gave a summary of SVM for classification and regression. As a powerful tool for regression, SVR has only recently been used for NIR spectroscopy analysis. Chan et al.14 measured trace gas components by using SVR and the results showed high feasibility and reliability. Ji et al.15 used ν-SVR to construct a calibration model between the soluble solids content of apples and acousto-optic tunable filter (AOTF) NIR spectra. Compared with PLS and BP-ANN, ν-SVR was superior especially in the case of fewer samples and treating the noise polluted spectra. Yan et al.16 proposed adaptive weighted least square support vector machine regression (AWLS-SVM) to eliminate the influence of unavoidable outliers to the model. SVR for functional data analysis (SVR-FDA) has also been introduced recently and used for nonlinear multivariate calibration problems.17 However, it should be pointed out that SVR has its limitation and disadvantages when dealing with spectral datasets. On the one hand, there is no theoretical guide for the optimization of the parameters, which may limit their applications. On the other hand, it is hard to benefit from the method for large datasets, and the training of large datasets is a time-consuming task compared with PLS. Therefore, SVR can be seen as a competitive and promising method for modeling small NIR datasets of nonlinearity.
By incorporating non-linear transformation techniques into the PLS framework, non-linear versions of PLS, including Polynomial PLS (PPLS),18 Spline PLS (SPLS),19 Quadratic Fuzzy PLS (QFPLS),20 ANN-NLPLS21 and kernel PLS22,23 were developed. These methods can successfully approximate complex non-linear relationships and have the power of PLS to combat over-fitting of linear models in the presence of excessive variability. The basic idea of PPLS, SPLS and QFPLS is to use a non-linear function based on the quadratic function, spline function and quadratic fuzzy function, respectively, to replace the inner linear function in PLS. Although non-linear function techniques are more progressive in comparison with the linear one, there are some drawbacks including the complexity of implementation and the over-fitting to the parameters of regression model. In ANN-NLPLS, spectra are firstly transformed from the input layer into the hidden outputs following the ANN architecture, and then a model is constructed between the outputs of the hidden layer and concentration of the components by PLS. Results revealed that ANN-NLPLS offers a substantial improvement in the ability to model the non-linearity and simultaneously circumvent the over-fitting frequently involved in ANN. KPLS takes advantage of the kernel function of SVM, the basic idea of which is mapping the original data into a high-dimensional feature space by a kernel function and modeling with kernel-transformed data. The advantage of this method is that it essentially requires only linear algebra, making it as simple as standard PLS and it can handle a wide range of non-linearity by different kinds of kernel functions and their changeable parameters.
The nonlinear relationship can be regarded as a combination of linear part and nonlinear part. Based on this consideration, a mixed method combined GA-PLS and PC-ANN was proposed for building the nonlinear model of NIR in the analysis of protein in milk powder.24 In this method, GA was firstly used to select the calibration region, and then PLS model was used to predict the linear part, and the principal components (PCs) were taken as the input of ANN model to predict the nonlinear part.
Another approach to avoid non-linearity is local regression, which is based on the idea that quantitative predictions can be obtained from a large library of samples by identifying library samples spectrally similar to the unknown sample and deriving a prediction from these similar samples only. Five classical local regression methods, i.e., Comparison Analysis using Restructured Near infrared and Constituent Data (CARNAC), Locally Weighted Regression (LWR), (LOCAL), Locally Biased Regression (LBR) and Local Lazy Regression (LLR) algorithm have been developed and reviewed.25 Perez-Marin et al.26 compared three non-linear approaches of LS-SVM, CARNAC and LBR, and the best result was obtained by using CARNAC. In one of our previous works,27 wavelet transform (WT) was introduced into the local regression. Discrete wavelet transform (DWT) was firstly utilized to compress the NIR spectra, and then the calibration sub-sets were individually selected for each prediction sample according to the Euclidean distance in wavelet domain. Most recently, a new local approach named Local Central Algorithm (LCA) is proposed,28 in which the Mahalanobis distance in PC space was used to select the calibration set and the final prediction is calculated using a central tendency statistic, the mean of the local neighbours. Local regression can avoid the nonlinearity in large datasets which is caused by the large concentration difference, however, further studies are still needed for the selection of the sub-set and as a way of performing subsequent regression.
There are three basic questions in ensemble modeling, i.e., the basic algorithm, generation and integration of the individual models (or member models). Theoretically, any modeling technique, such as MLR, PLS, SVR, ANN, etc., can be used as the basic algorithm, but PLS is the most commonly used one. The methods for generation of the member models include techniques of re-sampling,9,30,31 blocking,32 clustering33,34 and signal decomposition such as wavelet transform.35–37 Integration of the member models is key to ensemble modeling, which are generally implemented by combination of or statistically selecting from the results obtained by all the member models. The former is realized by simple average38 or weighted average,39 while the latter by weighted median,30 and different criteria such as prediction error,40 prediction residual error sum of squares (PRESS),36 hat matrix,41 nonnegative least square39etc., can be used to determine the weights for the member models.
Both theoretical and empirical researches have shown that the ensemble is valid only if the member models encode different aspects of whole calibration space and have a certain accuracy, which means a trade-off between so-called diversity and accuracy. Therefore, different ways for this purpose gave birth to different ensemble methods. Two popular approaches for creating diversity are bagging and boosting.42 In bagging, each member model is independent, i.e., samples in a calibration set are picked out randomly (with replacement) in the generation of member models, while in boosting, each member model is dependent on the previous ones, i.e., samples in a calibration set are picked out (with replacement) according to the probability obtained by previous member models under the ruler of “poor prediction result with high probability”. Simple average and weighted median are generally used for the ensemble prediction for the two strategies, respectively. Methods such as boosting PLS,40 boosting SVR43 and consensus LS-SVR9 have been proposed, and applied in the quantitative analysis of NIR spectra with the superiorities both in accuracy and stability compared with the methods based on single model. However, it is worthy of note that the sample selection method in boosting is sensitive to the outliers, since the samples with high probability are always selected in the construction of member models. Although robust versions of boosting PLS were developed to overcome the effect of outliers,44,45 efforts for improving the technique is still needed.
Recently, an alternative ensemble scheme based on a self-organizing map (SOM) named as SOMEPLS was proposed.33 In this method, all the calibration samples are clustered into the neurons by SOM, and then samples from different neurons are chosen to form a calibration subset. The results indicated that SOMEPLS can achieve better prediction accuracy than reference algorithms without increasing the complexity of the corresponding calibration model. However, it is not easy to determine the number of the neurons in practical uses, and the performance of the algorithm is dependent on the size calibration set.
NIR spectra are generally composed of hundreds or thousands of wavelengths. Another approach for generating member models by the spectra of different wavelength was proposed. Li et al.46 developed a random subspace (RS) based ensemble regression using MLR, which obviously improved the accuracy and robustness. Nevertheless, some member models were found to be very poor in predictive accuracy, because the wavelengths were selected in a completely random way. In order to reduce the blindness, controlled randomization was proposed in the wavelength selection.34 In this approach, variable clustering (VC) is also introduced to assist subspace generation. Using a similar idea, a weighted PLS by variable grouping strategy was developed in our work.47 Instead of using randomly selected wavelengths, ensemble models based on wavelength intervals may maintain the multi-channel advantage of multivariate calibration. Jiang et al.39 reported a method named MCCV stacked regression, in which iPLS is used for splitting the wavelength into different intervals for building the member model. Stacked moving-window partial least squares (SMWPLS)48 has been lately developed based on the algorithm. On the other hand, multi-block PLS (MB-PLS)32 has also been used for the quantitative analysis of NIR spectra.
DWT has been proved to be a powerful tool for signal decomposition and was successfully applied to chemical signal processing.49,50 By means of DWT, the NIR spectra can be decomposed into the components (or scale blocks) with different frequency, representing different information contained in the NIR spectra. By using DWTdecomposition two ensemble methods, named weighted multi-scale regression (WMR)36 and weighted MB-PLS,37 respectively, were developed. In WMR, PLS member models were built with the decomposed components, and a combined model was built by a weighted average. The weight of each model is determined by the PRESS value obtained with MCCV. In weighted MB-PLS, the ensemble is performed in regression coefficients. Both the interpretative and predictive ability were found to be improved.
This journal is © The Royal Society of Chemistry 2010 |