Multivariate calibration methods in near infrared spectroscopic analysis

Xueguang Shao *, Xihui Bian , Jingjing Liu , Min Zhang and Wensheng Cai
Research Center for Analytical Sciences, College of Chemistry, Nankai University, Tianjin, 300071, China. E-mail: xshao@nankai.edu.cn; Fax: +86-22-23502458; Tel: +86-22-23503430

Received 1st July 2010 , Accepted 30th July 2010

First published on 4th October 2010


Abstract

Near infrared (NIR) spectroscopy has been demonstrated as a powerful technique for both qualitative and quantitative analysis of complex systems in various fields. Calibration, however, is one of the important techniques needed to ensure the quality and practicability of the analyses. In this mini-review, recent developments in multivariate calibration methods for NIR spectroscopic analysis, including non-linear approaches and ensemble techniques, are briefly summarized. The advantages and disadvantages of these methods are compared and discussed critically.


Xueguang Shao

Xueguang Shao

Xueguang Shao graduated from Liaocheng University in 1984 and received his PhD in chemistry from the University of Science and Technology of China (USTC) in 1992. He worked at USTC from 1992 to 2005, and is now a professor of chemistry at the College of Chemistry, Nankai University. His research interest is the development of chemometric methods for chemical studies, including analytical signal processing, modeling and molecular simulations.

Xihui Bian

Xihui Bian

Xihui Bian received her B.S. degree and M.S. degree from Tianjin Polytechnic University in 2006 and Tianjin University in 2008, respectively. She is a PhD student of analytical chemistry in the laboratory of Chemoinformatics, Nankai University. Her research interest is the development of new chemometric methods for near infrared and Raman spectral analysis.

Jingjing Liu

Jingjing Liu

Jingjing Liu gained her B.S. degree (2005) from Anyang Normal University and M.S. degree (2008) from Henan Normal University. She is pursuing a PhD degree in the laboratory of Chemoinformatics, Nankai University. Her research interest is the development of new chemometric methods for classification and calibration.

Min Zhang

Min Zhang

Min Zhang graduated in 2008 from Tianjin University with a B.S. degree in chemical engineering and technology. She is currently pursuing a M.S. degree in analytical chemistry at the laboratory of Chemoinformatics, Nankai University. Her research interest is the development of new chemometric methods for modeling near infrared spectra of complex samples.

Wensheng Cai

Wensheng Cai

Wensheng Cai graduated from Anhui University in 1985 and received her PhD in chemistry from University of Science and Technology of China (USTC) in 1994. She worked at USTC from 1988 to 2005, and now is a professor of chemistry at the College of Chemistry, Nankai University. Her research interests are chemometrics and molecular simulations.


1. Introduction

Near infrared (NIR) spectroscopy, especially diffuse reflectance spectroscopy, has been a widely used technique for complex sample analysis in both scientific studies and industrial production. It provides a rapid, nondestructive analysis with little or no sample preparation. However, the relatively weak and highly overlapping spectral bands in NIR spectra pose a challenge for NIR spectroscopic analyses. On the one hand, signal processing methods for removing the background and extracting sample-specific or component-specific information must be developed, but on the other hand, it is crucial to apply a suitable calibration strategy to build a reliable quantitative model.

The main role of multivariate calibration is to establish a regression model relating the measured NIR signals to certain properties of samples. The quantitative model is, then, applied to predict the same properties of samples not involved in the calibration set from their measured responses. Many methods such as multiple linear regression (MLR), principal component regression (PCR), partial least squares (PLS), artificial neural network (ANN) and support vector machine (SVM), etc., have been used for multivariate calibration in NIR spectral analysis.1–4 However, the reliability and accuracy of the predicted result depends to a great extent on the quality of the model. Therefore, much effort has been made to improve the quality of the models, including the optimization of calibration datasets,5 spectral signal processing,6 variable selection7 and the development of modeling strategies.8,9

Among the multivariate calibration methods, PLS has been the most commonly used modeling technique due to its practicability and versatility. PLS can be thought of as a combination of principal component analysis (PCA) and MLR, therefore it is able to handle data with strong co-linearity and noise, as well as in situations where the number of variables considerably exceeds the number of available samples. Furthermore, the only factor that needs to be optimized during the modeling is the principal component (PC) number or latent variable (LV) number. However, PLS is a linear technique and global calibration method. The non-linearity of NIR spectra, e.g., caused by optical scattering in diffuse reflectance spectra or interactions between the analytes in high concentration samples, may make PLS models poor in predictive ability. Moreover, samples for building a PLS model should be adequate for accurate and stable prediction. In this mini-review, recent advances in the modeling of NIR spectral data, including non-linearity approaches and ensemble techniques are summarized, and the advantages and disadvantages of these methods are discussed.

2. Non-linear approaches

One way to tackle the issue of non-linearity is to resort to non-linear transformation techniques to simulate the intrinsic non-linear relationship between spectra and certain properties of samples. Examples include ANN, SVM and non-linear PLS. Another way is local regression, in which calibration is still built using the linear models, but unlike global calibration by using all the calibration samples only those samples spectrally most similar to the predicted one are selected for modeling.

ANN, a mathematical model based on biological neural networks, can be viewed as a universal model-free approximator that can represent any non-linear function with sufficient accuracy by seeking the proper combination of several sigmoid functions. For this reason, ANN is often taken to solve problems with non-linear relationships. However, difficulties in the practical use of ANN are optimization of the tunable parameters, low speed of the training process and the problem of over-fitting. In such cases, genetic algorithm (GA), a powerful optimization technique, has been used to optimize the network topology.10,11 On the other hand, variable selection and dimension reduction methods based on GA have been developed to simplify and speed up the ANN method.12

SVM, a machine learning method arising from statistical learning theory and based on structural risk minimization, is highlighted to take the place once occupied by ANN due to its outstanding advantages such as special generalization ability and reliability for the multivariate problem, especially for small sample size problems. SVM consists essentially of support vector classification (SVC) and support vector regression (SVR). The theory and applications of the former have been maturity while the latter is relative weak either in scope or in depth. In order to make SVM easy to be comprehensively understood, Liang et al.13 provided the basic idea and its applications in chemistry and Brereton et al.3 gave a summary of SVM for classification and regression. As a powerful tool for regression, SVR has only recently been used for NIR spectroscopy analysis. Chan et al.14 measured trace gas components by using SVR and the results showed high feasibility and reliability. Ji et al.15 used ν-SVR to construct a calibration model between the soluble solids content of apples and acousto-optic tunable filter (AOTF) NIR spectra. Compared with PLS and BP-ANN, ν-SVR was superior especially in the case of fewer samples and treating the noise polluted spectra. Yan et al.16 proposed adaptive weighted least square support vector machine regression (AWLS-SVM) to eliminate the influence of unavoidable outliers to the model. SVR for functional data analysis (SVR-FDA) has also been introduced recently and used for nonlinear multivariate calibration problems.17 However, it should be pointed out that SVR has its limitation and disadvantages when dealing with spectral datasets. On the one hand, there is no theoretical guide for the optimization of the parameters, which may limit their applications. On the other hand, it is hard to benefit from the method for large datasets, and the training of large datasets is a time-consuming task compared with PLS. Therefore, SVR can be seen as a competitive and promising method for modeling small NIR datasets of nonlinearity.

By incorporating non-linear transformation techniques into the PLS framework, non-linear versions of PLS, including Polynomial PLS (PPLS),18 Spline PLS (SPLS),19 Quadratic Fuzzy PLS (QFPLS),20 ANN-NLPLS21 and kernel PLS22,23 were developed. These methods can successfully approximate complex non-linear relationships and have the power of PLS to combat over-fitting of linear models in the presence of excessive variability. The basic idea of PPLS, SPLS and QFPLS is to use a non-linear function based on the quadratic function, spline function and quadratic fuzzy function, respectively, to replace the inner linear function in PLS. Although non-linear function techniques are more progressive in comparison with the linear one, there are some drawbacks including the complexity of implementation and the over-fitting to the parameters of regression model. In ANN-NLPLS, spectra are firstly transformed from the input layer into the hidden outputs following the ANN architecture, and then a model is constructed between the outputs of the hidden layer and concentration of the components by PLS. Results revealed that ANN-NLPLS offers a substantial improvement in the ability to model the non-linearity and simultaneously circumvent the over-fitting frequently involved in ANN. KPLS takes advantage of the kernel function of SVM, the basic idea of which is mapping the original data into a high-dimensional feature space by a kernel function and modeling with kernel-transformed data. The advantage of this method is that it essentially requires only linear algebra, making it as simple as standard PLS and it can handle a wide range of non-linearity by different kinds of kernel functions and their changeable parameters.

The nonlinear relationship can be regarded as a combination of linear part and nonlinear part. Based on this consideration, a mixed method combined GA-PLS and PC-ANN was proposed for building the nonlinear model of NIR in the analysis of protein in milk powder.24 In this method, GA was firstly used to select the calibration region, and then PLS model was used to predict the linear part, and the principal components (PCs) were taken as the input of ANN model to predict the nonlinear part.

Another approach to avoid non-linearity is local regression, which is based on the idea that quantitative predictions can be obtained from a large library of samples by identifying library samples spectrally similar to the unknown sample and deriving a prediction from these similar samples only. Five classical local regression methods, i.e., Comparison Analysis using Restructured Near infrared and Constituent Data (CARNAC), Locally Weighted Regression (LWR), (LOCAL), Locally Biased Regression (LBR) and Local Lazy Regression (LLR) algorithm have been developed and reviewed.25 Perez-Marin et al.26 compared three non-linear approaches of LS-SVM, CARNAC and LBR, and the best result was obtained by using CARNAC. In one of our previous works,27 wavelet transform (WT) was introduced into the local regression. Discrete wavelet transform (DWT) was firstly utilized to compress the NIR spectra, and then the calibration sub-sets were individually selected for each prediction sample according to the Euclidean distance in wavelet domain. Most recently, a new local approach named Local Central Algorithm (LCA) is proposed,28 in which the Mahalanobis distance in PC space was used to select the calibration set and the final prediction is calculated using a central tendency statistic, the mean of the local neighbours. Local regression can avoid the nonlinearity in large datasets which is caused by the large concentration difference, however, further studies are still needed for the selection of the sub-set and as a way of performing subsequent regression.

3. Ensemble modeling techniques

Although linear and non-linear approaches have been developed for the quantitative analysis of NIR, sometimes satisfactory prediction can not be obtained due to the large variation in the samples or predicting properties, especially when the number of samples in the calibration set is relatively small,9 so a single model with full calibration samples is adopted to estimate the relationship between NIR spectra and properties. When these methods are utilized to model the NIR spectra of complex samples, a single calibration model may overemphasize some aspects, underestimate others, and ignore some important features contained in the richly complex spectra. Therefore, ensemble modeling techniques have gained increasing attention in multivariate calibration of NIR spectra. Ensemble modeling is a statistical technique that combines the results of multiple individual models to produce a single prediction. The underlying assumption in ensemble modeling is that multiple models will effectively identify and encode more aspects of the relationship between independent and dependent variables than a single model. In most cases, individual models are constructed using different training subsets randomly selected from the full calibration set. Many scientific terms such as multi-model, committee, consensus, model fusion, combined model and model aggregation, etc., have the same meaning as ensemble model.29

There are three basic questions in ensemble modeling, i.e., the basic algorithm, generation and integration of the individual models (or member models). Theoretically, any modeling technique, such as MLR, PLS, SVR, ANN, etc., can be used as the basic algorithm, but PLS is the most commonly used one. The methods for generation of the member models include techniques of re-sampling,9,30,31 blocking,32 clustering33,34 and signal decomposition such as wavelet transform.35–37 Integration of the member models is key to ensemble modeling, which are generally implemented by combination of or statistically selecting from the results obtained by all the member models. The former is realized by simple average38 or weighted average,39 while the latter by weighted median,30 and different criteria such as prediction error,40 prediction residual error sum of squares (PRESS),36 hat matrix,41 nonnegative least square39etc., can be used to determine the weights for the member models.

Both theoretical and empirical researches have shown that the ensemble is valid only if the member models encode different aspects of whole calibration space and have a certain accuracy, which means a trade-off between so-called diversity and accuracy. Therefore, different ways for this purpose gave birth to different ensemble methods. Two popular approaches for creating diversity are bagging and boosting.42 In bagging, each member model is independent, i.e., samples in a calibration set are picked out randomly (with replacement) in the generation of member models, while in boosting, each member model is dependent on the previous ones, i.e., samples in a calibration set are picked out (with replacement) according to the probability obtained by previous member models under the ruler of “poor prediction result with high probability”. Simple average and weighted median are generally used for the ensemble prediction for the two strategies, respectively. Methods such as boosting PLS,40 boosting SVR43 and consensus LS-SVR9 have been proposed, and applied in the quantitative analysis of NIR spectra with the superiorities both in accuracy and stability compared with the methods based on single model. However, it is worthy of note that the sample selection method in boosting is sensitive to the outliers, since the samples with high probability are always selected in the construction of member models. Although robust versions of boosting PLS were developed to overcome the effect of outliers,44,45 efforts for improving the technique is still needed.

Recently, an alternative ensemble scheme based on a self-organizing map (SOM) named as SOMEPLS was proposed.33 In this method, all the calibration samples are clustered into the neurons by SOM, and then samples from different neurons are chosen to form a calibration subset. The results indicated that SOMEPLS can achieve better prediction accuracy than reference algorithms without increasing the complexity of the corresponding calibration model. However, it is not easy to determine the number of the neurons in practical uses, and the performance of the algorithm is dependent on the size calibration set.

NIR spectra are generally composed of hundreds or thousands of wavelengths. Another approach for generating member models by the spectra of different wavelength was proposed. Li et al.46 developed a random subspace (RS) based ensemble regression using MLR, which obviously improved the accuracy and robustness. Nevertheless, some member models were found to be very poor in predictive accuracy, because the wavelengths were selected in a completely random way. In order to reduce the blindness, controlled randomization was proposed in the wavelength selection.34 In this approach, variable clustering (VC) is also introduced to assist subspace generation. Using a similar idea, a weighted PLS by variable grouping strategy was developed in our work.47 Instead of using randomly selected wavelengths, ensemble models based on wavelength intervals may maintain the multi-channel advantage of multivariate calibration. Jiang et al.39 reported a method named MCCV stacked regression, in which iPLS is used for splitting the wavelength into different intervals for building the member model. Stacked moving-window partial least squares (SMWPLS)48 has been lately developed based on the algorithm. On the other hand, multi-block PLS (MB-PLS)32 has also been used for the quantitative analysis of NIR spectra.

DWT has been proved to be a powerful tool for signal decomposition and was successfully applied to chemical signal processing.49,50 By means of DWT, the NIR spectra can be decomposed into the components (or scale blocks) with different frequency, representing different information contained in the NIR spectra. By using DWTdecomposition two ensemble methods, named weighted multi-scale regression (WMR)36 and weighted MB-PLS,37 respectively, were developed. In WMR, PLS member models were built with the decomposed components, and a combined model was built by a weighted average. The weight of each model is determined by the PRESS value obtained with MCCV. In weighted MB-PLS, the ensemble is performed in regression coefficients. Both the interpretative and predictive ability were found to be improved.

4. Conclusion

NIR spectroscopy is a powerful technique for analysis of real samples in various fields, and calibration is one of the crucial steps for its practicability. In this review, recent developments in non-linear approaches and ensemble techniques for NIR spectroscopic analysis are briefly summarized. The former provide approaches for improving the prediction accuracy when nonlinearity occurs in the dataset, and the latter provide techniques for making the modeling more accurate and reliable in practical uses. However, because NIR spectra are composed of weak and broad overlapping spectral bands with high level variant background, new techniques are still needed for further improving the robustness and practicability of the models. Besides, new methods for calibration set design, outlier removal from the calibration set, background correction or removal, and model transfer are also required.

Acknowledgements

This study is supported by National Natural Science Foundation of China (No. 20835002).

References

  1. A. A. Kardamakis and N. Pasadakis, Fuel, 2010, 89, 158–161 CrossRef CAS.
  2. T. M. Venas and A. Rinnan, Chemom. Intell. Lab. Syst., 2008, 92, 125–130 CrossRef.
  3. R. G. Brereton and G. R. Lloyd, Analyst, 2010, 135, 230–267 RSC.
  4. M. Blanco and A. Peguero, Talanta, 2008, 77, 647–651 CrossRef CAS.
  5. D. Pena and F. J. Prieto, J. Comput. Graphical Stat., 2007, 16, 228–254 Search PubMed.
  6. W. D. Ni, S. D. Brown and R. L. Man, Chemom. Intell. Lab. Syst., 2009, 98, 97–107 CrossRef CAS.
  7. H. Xu, Z. C. Liu, W. S. Cai and X. G. Shao, Chemom. Intell. Lab. Syst., 2009, 97, 189–193 CrossRef CAS.
  8. J. Z. Li, S. Y. Li, B. L. Lei, H. X. Liu, X. J. Yao, M. C. Liu and P. Gramatica, J. Comput. Chem., 2010, 31, 973–985 CAS.
  9. Y. K. Li, X. G. Shao and W. S. Cai, Talanta, 2007, 72, 217–222 CrossRef CAS.
  10. N. Qu, H. Mi, B. Wang and Y. L. Ren, J. Taiwan Inst. Chem. Eng., 2009, 40, 162–167 Search PubMed.
  11. J. J. Yu, Y. He and Y. D. Bao, Spectrosc. Spect. Anal., 2008, 28, 2839–2842 Search PubMed.
  12. Q. Fei, M. Li, B. Wang, Y. F. Huan, G. D. Feng and Y. L. Ren, Chemom. Intell. Lab. Syst., 2009, 2, 127–131 CrossRef.
  13. H. D. Li, Y. Z. Liang and Q. S. Xu, Chemom. Intell. Lab. Syst., 2009, 95, 188–198 CrossRef CAS.
  14. N. Ni and C. C. Chan, Meas. Sci. Technol., 2009 Search PubMed 115601 (7 pp).
  15. D. Z. Zhu, B. P. Ji, C. Y. Meng, B. L. Shi, Z. H. Tu and Z. S. Qing, Anal. Chim. Acta, 2007, 598, 227–234 CrossRef CAS.
  16. W. T. Cui and X. F. Yan, Chemom. Intell. Lab. Syst., 2009, 98, 130–135 CrossRef CAS.
  17. N. Hernandez, I. Talavera, R. J. Biscay, D. Porro and M. M. C. Ferreira, Anal. Chim. Acta, 2009, 642, 110–116 CrossRef CAS.
  18. J. Kohonen, S. P. Reinikainen, K. Aaljoki and A. Hoskuldsson, Chemom. Intell. Lab. Syst., 2009, 97, 159–163 CrossRef CAS.
  19. R. M. Balabin, R. Z. Safieva and E. I. Lomakina, Chemom. Intell. Lab. Syst., 2007, 88, 183–188 CrossRef CAS.
  20. A. I. Abdel-Rahman and G. J. Lim, J. Chemom., 2009, 23, 530–537 CrossRef CAS.
  21. Y. P. Zhou, J. H. Jiang, W. Q. Lin, L. Xu, H. L. Wu, G. L. Shen and R. Q. Yu, Talanta, 2007, 71, 848–853 CrossRef CAS.
  22. B. M. Nicolai, K. I. Theron and J. Lammertyn, Chemom. Intell. Lab. Syst., 2007, 85, 243–252 CrossRef CAS.
  23. N. Labbe, S. H. Lee, H. W. Cho, M. K. Jeong and N. Andre, Bioresour. Technol., 2008, 99, 8445–8452 CrossRef CAS.
  24. Q. Sun, J. H. Wang and D. H. Han, Spectrosc. Spect. Anal., 2009, 29, 1818–1821 Search PubMed.
  25. D. Perez-Marin, A. Garrido-Varo and J. E. Guerrero, Talanta, 2007, 72, 28–42 CrossRef CAS.
  26. D. Perez-Marin, A. Garrido-Varo, J. E. Guerrero, T. Fearn and A. M. C. Davies, Appl. Spectrosc., 2008, 62, 536–541 CrossRef CAS.
  27. X. Shi, W. S. Cai and X. G. Shao, Chin. J. Anal. Chem., 2008, 36, 1093–1096 CAS.
  28. E. Zamora-Rojas, A. Garrido-Varo, F. Van den Berg, J. E. Guerrero-Ginel and D. C. Pérez-Marín, Chemom. Intell. Lab. Syst., 2010, 101, 87–94 CrossRef CAS.
  29. A. Durand, O. Devos, C. Ruckebusch and J. P. Huvenne, Anal. Chim. Acta, 2007, 595, 72–79 CrossRef CAS.
  30. D. S. Cao, Q. S. Xu, Y. Z. Liang, L. X. Zhang and H. D. Li, Chemom. Intell. Lab. Syst., 2010, 100, 1–11 CrossRef CAS.
  31. Z. Cheng, A. S. Zhu and D. Z. Chen, Chin. J. Anal. Chem., 2007, 35, 978–982 CAS.
  32. M. Jing, X. G. Shao and W. S. Cai, Anal. Lett., 2010, 43, 1910–1921 CrossRef CAS.
  33. C. Tan, X. Qin and M. L. Li, Anal. Bioanal. Chem., 2008, 392, 515–521 CrossRef CAS.
  34. C. Tan, X. Qin and M. L. Li, Anal. Lett., 2009, 42, 1693–1710 CrossRef CAS.
  35. H. W. Lee, M. W. Lee and J. M. Park, Chemom. Intell. Lab. Syst., 2009, 98, 201–212 CrossRef CAS.
  36. Z. C. Liu, W. S. Cai and X. G. Shao, Analyst, 2009, 134, 261–266 RSC.
  37. M. Jing, W. S. Cai and X. G. Shao, Chemom. Intell. Lab. Syst., 2010, 100, 22–27 CrossRef CAS.
  38. F. W. Pi, H. Shinzawa, J. H. Wang, D. H. Han and Y. Ozaki, J. Near Infrared Spectrosc., 2009, 17, 33–40 CrossRef CAS.
  39. L. Xu, J. H. Jiang, Y. P. Zhou, H. L. Wu, G. L. Shen and R. Q. Yu, Chemom. Intell. Lab. Syst., 2007, 87, 226–230 CrossRef CAS.
  40. M. H. Zhang, Q. S. Xu and D. L. Massart, Anal. Chem., 2005, 77, 1423–1431 CrossRef CAS.
  41. B. L. Lei, L. L. Xi, J. Z. Li, H. X. Liu and X. J. Yao, Anal. Chim. Acta, 2009, 644, 17–24 CrossRef CAS.
  42. H. Shinzawa, J. H. Jiang, P. Ritthiruangdej and Y. Ozaki, J. Chemom., 2006, 20, 436–444 CrossRef CAS.
  43. Y. P. Zhou, L. Xu, L. J. Tang, J. H. Jiang, G. L. Shen, R. Q. Yu and Y. Ozaki, Anal. Sci., 2007, 23, 793–798 CrossRef CAS.
  44. Y. P. Zhou, C. B. Cai, S. Huan, J. H. Jiang, H. L. Wu, G. L. Shen and R. Q. Yu, Anal. Chim. Acta, 2007, 593, 68–74 CrossRef CAS.
  45. X. G. Shao, X. H. Bian and W. S. Cai, Anal. Chim. Acta, 2010, 666, 32–37 CrossRef CAS.
  46. C. Tan, M. L. Li and X. Qin, Anal. Sci., 2008, 24, 647–653 CrossRef CAS.
  47. H. Xu, W. S. Cai and X. G. Shao, Anal. Methods, 2010, 2, 289–294 RSC.
  48. W. D. Ni, S. D. Brown and R. L. Man, J. Chemom., 2009, 23, 505–517 CrossRef CAS.
  49. X. G. Shao, A. K. M. Leung and F. T. Chau, Acc. Chem. Res., 2003, 36, 276–283 CrossRef CAS.
  50. M. Kompany-Zareh and F. van den Berg, Anal. Chim. Acta, 2010, 668, 137–142 CrossRef CAS.

This journal is © The Royal Society of Chemistry 2010