Yong-Huan Yun†
a,
Yang-Chao Wei†a,
Xing-Bing Zhaob,
Wei-Jia Wub,
Yi-Zeng Lianga and
Hong-Mei Lu*a
aCollege of Chemistry and Chemical Engineering, Central South University, Changsha 410083, PR China. E-mail: hongmeilu@csu.edu.cn; Tel: +86 731 88830831
bHunan Longshishan Dendrobium Candidum Wall.ex Lindl Base Co., Ltd, Changsha 410205, PR China
First published on 26th November 2015
Polysaccharides are one of the active components of Dendrobium officinale (D. officinale) and its content is used as one of the main quality assessment criteria. The existing methods for polysaccharide quantification involve sample destruction, tedious sample processing, high cost, and non-environmentally friendly pretreatment. The aim of this study is to develop a simple, rapid, green and nondestructive analytical method based on near infrared (NIR) spectroscopy and chemometrics methods. A set of 84 D. officinale samples from different origins was analyzed using NIR spectroscopy. Potential outlying samples were initially removed from the collected NIR data in two steps using the Monte Carlo sampling (MCS) method. Spectral data preprocessing was studied in the construction of a partial least squares (PLS) model. To eliminate uninformative variables and improve the performance of the model, the pretreated full spectrum was calculated using different wavelength selection methods, including competitive adaptive reweighted sampling (CARS), Monte Carlo-uninformative variable elimination (MC-UVE) and interval random frog (iRF). The selected wavelengths model met the following three points: (1) improved the prediction performance; (2) reduced the number of variables; (3) provided a better understanding and interpretation, which proves that it was necessary to conduct wavelength selection in the NIR analytical systems. When comparing the three wavelength selection methods, the results show that CARS has the best performance with the lowest root mean square error of prediction (RMSEP) on the independent test set and least number of latent variables (nLVs). This study demonstrates that the NIR spectral technique with the wavelength selection algorithm CARS could be used successfully for the quantification of the polysaccharide content in D. officinale.
The content of polysaccharides is used as one of the quality assessment criteria (no less than 0.2500 g of glucose per g dry weight) in Chinese pharmacopoeia.8 It varies with geographical origin and the time of harvest. By far, quantification of the polysaccharides in D. officinale is mainly performed using a colorimetric method, such as the phenol-sulphuric acid method or the anthrone-sulphuric acid method. However, these methods involve sample destruction, tedious sample processing, high cost, and non-environmentally friendly pretreatment, because they require the severe conditions of high temperature and a strong acid. Therefore, a simple, rapid, green and nondestructive analytical technique is in great demand to determine the polysaccharide content in D. officinale.
Nowadays, as a rapid, green, cost-effective and nondestructive analytical technique, near infrared (NIR) spectroscopy has been widely applied to qualitative and quantitative analysis in agriculture, pharmaceuticals, polymer production and food quality evaluation.9–18 Recently, NIR spectroscopy has been employed to study traditional Chinese herbs.19 Some studies on the quantitative analysis of total polysaccharides using NIR have been reported.20–22 NIR spectra assess chemical structures through the analysis of the molecular bonds (e.g. C–H, N–H and O–H, which are the primary structural components of organic molecules) in the NIR region, and their characteristic spectra comprise different overtones and combinations of vibrations that are attributable to the make-up of the molecules .23 As a powerful technique, NIR spectroscopy has gained wide acceptance in many fields by virtue of its advantages over other analytical techniques, such as being highly efficient, economical, the ease of operation, and the most salient is its ability to record spectra for solid and liquid samples without any sample preparation. However, NIR spectroscopy usually encounters a collinearity problem because of the strongly overlapped and broad absorption bands.24 To address this problem, partial least squares (PLS)25 has been proposed to create a calibration model with NIR data. Typically, the establishment of a calibration model usually covers all of the measured wavelengths. It is obvious that such a full spectrum model may contain useless or irrelevant information, which may worsen the predictive ability of the developed model. Liang et al. have demonstrated the importance and necessity of wavelength selection in a NIR analytical system.26,27 Many papers have also proven that it is very important and essential to conduct wavelength selection to gain better prediction performance.28–31 The aim and significance of wavelength selection can be summarized in three points: (1) improving the prediction performance of the calibration model, (2) providing faster and more cost-effective predictors by reducing the curse of dimensionality, (3) providing a better understanding and interpretation of the underlying process that generated the data.32,33
In this work, the first work is to establish the PLS calibration model between the NIR full spectrum data of D. officinale and its polysaccharides. Then, the prediction results of wavelength selection methods and the full spectrum are compared. Three recent and often-used wavelength selection methods, including competitive adaptive reweighted sampling (CARS),34 Monte Carlo-uninformative variable elimination (MC-UVE)35 and interval random frog (iRF),36 were employed to compare. Finally, the best wavelength selection is determined based on the prediction performance and model complexity to develop a calibration model for the prediction of the polysaccharide content in D. officinale.
Sample no. | Origin | Collection time |
---|---|---|
1–6 | Yunnan | Feb. 2013–Mar. 2013 |
7–12 | Zhejiang | Apr. 2012–Oct. 2012 |
13–14 | Hunan | Sep. 2012–Jul. 2013 |
15–16 | Zhejiang | Jul. 2013–Aug. 2013 |
17–20 | Henan | Jul. 2013–Aug. 2013 |
21–32 | Hunan | Dec. 2013 |
33–49 | Hunan | Feb. 2014 |
50–53 | Yunnan | Feb. 2014 |
54–61 | Yunnan | Mar. 2013 |
62–67 | Zhejiang | Apr. 2012–Jul. 2012 |
68–84 | Hunan | Apr. 2014 |
The D. officinale polysaccharide content was firstly measured with the phenol-sulphuric acid method provided by Chinese pharmacopoeia (State Pharmacopoeia Committee 2010). A glucose calibration curve was firstly prepared. The glucose (0.255 g) dried to constant weight at 105 °C was placed in a 250 ml volumetric flask, and water was added to obtain a 100 μg ml−1 solution. Glucose solution volumes of 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0 ml were accurately drawn and added to 10 ml test tubes with lids, and water was added to make the volume 1 ml in each case. Then 1 ml of a 5% phenol solution was added, mixed, and 5.0 ml of sulphuric acid was quickly added, shaken, bathed in 90 °C water for 20 min, then put in an ice bath for 5 min. A BTT miniature array spectrophotometer (B&W Tek, Newark, DE, USA) equipped with glass or quartz cells of 1 cm path length was used for the measurement of absorbance spectra. A Lenovo personal computer was used to control the spectrometer and collect data via BWSpec4 Software. The absorbance unit was recorded at a wavelength of 488.02 nm. The calibration curve was made according to the absorbance unit and glucose concentration.
Polysaccharide measurements were conducted as follows. An accurately weighed, powdered D. officinale sample (0.3 g) was loaded into a standard apparatus set and refluxed for 2 h with 200 ml of water. Subsequently, the sample was cooled to room temperature and transferred to a 250 ml volumetric flask, water was added up to the volume mark, then it was shaken and filtered. Then 2 ml of filtrate was precipitated using ethanol (10 ml) at 4 °C, followed by centrifugation for 30 min at 4000 rpm. The precipitate was washed twice with 8 ml of 80% ethanol. The precipitate obtained after filtering was dissolved in water and collected in a 25 ml volumetric flask. The following operation was based on the aforementioned calibration curve of glucose. The results were expressed as grams of glucose equivalents per gram of dry weight (g glucose per g DW) through the calibration curve with glucose. The content of each sample was determined in triplicate, and the mean of the three measurements was used for further analysis.
A standard sample cup was used to collect the spectra of the D. officinale samples. It was the standard accessory sample holder, specifically designed by Thermo Electron Co. About 0.5 g of the sample in powder form was filled into the sample cup in the standard procedure. In order to avoid errors from uneven samples, the sample cup was rotated 120° to record another spectrum after each measurement. Each sample was measured three times. The mean of three spectra which were collected from the same sample was used for the following analysis.
A set of 84 D. officinale samples from different origins in China was analyzed using NIR spectroscopy. The generated spectra of the 84 samples are shown in Fig. 1(a).
![]() | ||
Fig. 1 (a) The raw NIR spectra of 84 D. officinale samples; (b) preprocessed spectra using SNV + SG 1st derivative of 75 D. officinale samples. |
In addition to useful information, spectral signals contain systematic noise, such as baseline variation, sample background, light scattering and so on.38 In order to build a robust and reliable model, some preprocess must be undertaken to weaken and eliminate interference in the spectra. In this study, eight different signal pre-treatment methods were evaluated and compared, including multiplicative scattering correction (MSC), standard normal transformation (SNV), first and second derivatives computed using the Savitzky–Golay (S–G) method, and the combinations of MSC (or SNV) with the derivatives. MSC is an important procedure for the correction of scattered light caused by different particle sizes. It is also used to correct the additive and multiplicative effects in the spectra. SNV is a mathematical transformation method of the log(1/R) spectra used to remove slope variation and to correct for scatter effects.39,40 Compared to SNV, first and second derivatives are used to reduce peak overlap and remove constant and linear baseline drift, respectively. Thus, they are often used to eliminate baseline drifts and enhance small spectral differences between samples.41
Three different wavelength selection methods combined with PLS, including competitive adaptive reweighted sampling (CARS), Monte Carlo-uninformative variable elimination (MC-UVE) and interval random frog (iRF) were employed to compare and determine the effective wavelengths.
CARS34 is a novel variable selection algorithm, which is similar to the “survival of the fittest” principle in Darwin’s Theory of Evolution. The wavelengths with large absolute coefficients that are selected by CARS were defined as the key wavelengths. In each sampling run, CARS contains four successive steps: (1) use of the MC sampling method to select modeling samples randomly; (2) employ an exponentially decreasing function (EDF) to remove the wavelengths which are of relatively small absolute regression coefficients by force; (3) adopt adaptive reweighted sampling (ARS) to realize a competitive selection of wavelengths; (4) employ cross-validation to evaluate the subset and finally to choose the subset with the lowest root mean squared error of cross validation (RMSECV). For CARS, the number of sampling runs was set to 100.
MC-UVE35 is a useful variable selection algorithm, which combined a Monte Carlo (MC) strategy with the uninformative variable elimination (UVE) method. The MC-UVE method builds a large number of PLS sub-models with randomly selected calibration samples at first, and each variable is evaluated with a stability of the corresponding regression coefficient. Variables with poor stability are known as uninformative variables and are eliminated. The number of MC sampling runs was set to 1000 in this study.
iRF36 is a wavelength interval selection method that considers the continuity of spectra. It is based on random frog45 that employs a reversible jump Markov Chain Monte Carlo (RJMCMC)-like search algorithm in the model space through both fixed-dimensional and trans-dimensional between different models. The objective function is to find the subset which has the maximum regression coefficient. Spectra are first divided into sub-intervals of the whole spectra using a moving window of a fixed width and thus it can obtain all of the possible continuous spectral intervals. Each interval is regarded as the variable and then is inputted into the RJMCMC algorithm. A pseudo-MC MC chain is used to compute the selection probability of each interval, and then rank all of the intervals based on the selection probability. Afterwards, the best intervals with the lowest RMSECV are chosen. In this work, with 1557 full spectral points, the width of the interval was set to 20 resulting in 1538 intervals in total and each interval had 20 variables.
In this work, selection was performed using a splitting ratio of 2:
1 (50 samples formed the calibration set, and the remaining 25 samples served as the independent test set). The statistical values of the polysaccharide content in the calibration and independent test sets are listed in Table 2. After the division, the content values in the calibration and independent test sets covered a wide range, which is helpful for developing a robust model.
The calibration set was used for building a PLS model and wavelength selection, and the independent test set was used for external validation. The optimal nLVs on the calibration set were determined using a 10-fold cross validation as the maximum nLVs was set to 15. The built model was then used to predict the calibration set and test set, generating a root mean squared error of fitting on the calibration set (RMSEC) value and a root mean squared error of prediction on the independent test set (RMSEP) value. Thus, RMSEC, R2cal, RMSEP and R2pre (R2 on the test set), were employed to assess the performance of the generated model. The RMSECV and R2cv were used to determine the spectral data preprocessing method.
Pretreatment | nLVs | RMSECV | R2cv |
---|---|---|---|
Original | 14 | 0.0558 | 0.8211 |
Smooth + MSC | 11 | 0.0539 | 0.8330 |
Smooth + SNV | 6 | 0.0585 | 0.8036 |
SG 1st | 12 | 0.0540 | 0.8330 |
SG 2nd | 4 | 0.0651 | 0.7571 |
MSC + SG 1st | 6 | 0.0543 | 0.8308 |
MSC + SG 2nd | 6 | 0.0619 | 0.7800 |
SNV + SG 1st | 6 | 0.0543 | 0.8309 |
SNV + SG 2nd | 6 | 0.0619 | 0.7802 |
When compared to the full spectrum model, the selected wavelengths model should meet the three following points: (1) improve the prediction performance; (2) reduce the number of wavelengths; (3) provide a better understanding and interpretation. The calibration and validation results of the full spectrum and wavelength selection methods are shown in Table 4.
Full spectrum | CARS | MC-UVE | iRF | |
---|---|---|---|---|
a N.W is the number of wavelengths. | ||||
N.Wa | 1557 | 39 | 339 | 364 |
nLVs | 10 | 8 | 10 | 9 |
RMSECV | 0.0549 | 0.0156 | 0.0260 | 0.0423 |
R2cv | 0.8397 | 0.9872 | 0.9640 | 0.9048 |
RMSEC | 0.0101 | 0.0096 | 0.0010 | 0.0025 |
R2cal | 0.9946 | 0.9952 | 0.9999 | 0.9997 |
RMSEP | 0.0542 | 0.0468 | 0.0533 | 0.0486 |
R2pre | 0.7978 | 0.8495 | 0.8044 | 0.8373 |
For the prediction of the full spectrum model, RMSEP and R2pre were 0.0542 and 0.7978, respectively. The nLVs is 10. It can be observed that all of the wavelength selection methods perform better than the full spectrum PLS model based on the RMSEP, R2pre and nLVs, which satisfies the first point of improving the prediction performance. Moreover, the number of selected wavelengths using CARS, MC-UVE and iRF, were 39, 339 and 364, which are also much less than the full spectrum with 1557 wavelengths. Thus, it demonstrates that the model can obtain a good prediction performance when eliminating the variables that are uninformative and have irrelevant information.
CARS and MC-UVE are the discrete wavelength selection methods, while iRF is a wavelength interval selection method. All of them are based on the PLS regression coefficient. Here we do not aim to prove whether discrete wavelength selection or the wavelength interval selection method is better. The performances of all of the wavelength selection methods are data dependent. In this work, for the determination of the polysaccharide content in D. officinale, by comparison of the three wavelength selection methods, the overall results indicated that CARS obtains the best prediction performance with the lowest RMSEP and R2pre. The least nLVs also indicates that CARS can establish the most parsimonious PLS model. The reason may be that there are too many irrelevant variables in the full spectral data. CARS is an effective procedure to eliminate uninformative variables and improve the predictive precision of the model. Based on the exponentially decreasing function, CARS firstly eliminated a large number of wavelengths in the first stage and then in a refined way to select the wavelength. Although CARS runs fast, it is not stable. Thus, CARS should be conducted many times to obtain the best result.
As polysaccharides belong to carbohydrates, they contain aliphatic cyclic groups with attached OH groups and ether linkages. In order to understand and interpret the selected wavelengths in all of the wavelength selection methods for polysaccharides, they are displayed in Fig. 3. The wavelengths selected by MC-UVE are very scattered, resulting in MC-UVE performing a little better than the full spectrum model. CARS and iRF have a lot of common selected regions. As CARS performs the best in this work, the interpretation of the selected wavelengths focuses on CARS. We can see that the wavelengths selected by CARS are mostly concentrated in the regions 4000–4200 cm−1, 4300–4450 cm−1, 4700–5250 cm−1, 5750–7300 cm−1, 7900–8950 cm−1 and 9000–10000 cm−1. The absorption at 4000–4200 cm−1 is related to C–H stretching and a C–C and C–O–C stretching combination.48 The absorption at 4300–4450 cm−1 corresponds to C–H stretching and a CH2 deformation combination, while that at 4700–5100 cm−1 corresponds to O–H bending, O–H stretching, a C–O stretching combination and an HOH bending combination.48 The absorption at 5750–7300 cm−1 is related to the first overtone of C–H stretching.48 The absorption at 7900–8950 cm−1 could be attributed to the first overtone of the O–H in polysaccharides,49 while that at 9000–1000 cm−1 corresponds to the second overtone of O–H.50
![]() | ||
Fig. 3 The distribution of the selected variables obtained using different wavelength selection methods. |
From the above points, it can be proven that wavelength selection is necessary and essential in multivariate calibration for the NIR analytical system.
Fig. 4 shows the correlation between the values determined using the phenol-sulphuric acid method and the valves predicted using the NIR full spectrum model (Fig. 4(a)) and CARS (Fig. 4(b)). The blue and red circles correspond to the calibration and independent test set, respectively. The diagonal line represents the ideal results. The closer the points are to the diagonal line, the better the model is. It can be found that the samples are distributed more closely to the diagonal line in Fig. 4(b), which shows a good spectral analysis performance for CARS. The results demonstrate the feasibility of using NIR spectroscopy combined with CARS for the determination of the polysaccharide content of D. officinale.
Therefore, NIR spectroscopy could provide a fast and green alternative to classical reference methods, as it dramatically reduces analysis time without any chemical reagents. The established method will significantly improve the efficiency of quality control. Furthermore, future work is the development of similar NIR spectroscopy calibration models coupled with a CARS algorithm for predicting the quantity of additional components in D. officinale, such as alkaloids, sesquiterpenoids and aromatic compounds. It should be noted that more attention should be paid to the robustness of calibration models through collecting more samples and introducing more wavelength selection methods.
Footnote |
† The first two authors contributed equally to this work. |
This journal is © The Royal Society of Chemistry 2015 |