Li
Yan-kun
*
College of Environment Science and Engineering, North China Electric Power University, Baoding, Hebei 071003. E-mail: lyk800@tom.com
First published on 6th December 2011
Consensus modeling based on improved Boosting algorithm (Boosting-PLS, BPLS) combined with wavelength (variable) selection by MC-UVE (Monte Carlo-Uninformative Variable Elimination) method is applied to determination of cetane number (CN) of diesel. MC-UVE is firstly used to select characteristic variables from Near-infrared (NIR) spectra of diesel based on principles of MC simulation and UVE, and then the selected variables instead of the full spectra are used for BPLS modeling to predict results. From predicted results, the proposed MC-UVE-BPLS algorithm improves the performance of conventional linear PLS modeling in terms of accuracy and robustness, so it is more efficient and parsimonious with few numbers of useful variables when applied to the relationship between CN and diesel NIR spectra. Simultaneously, the prediction results of MC-UVE-BPLS compared with those of MC-UVE-PLS, BPLS and CPLS (Consensus modeling based on Bagging) show that MC-UVE-BPLS is superior to other models, and also verifies the efficiency of MC-UVE and improved BPLS. So the proposed MC-UVE-BPLS method provides a new approach for determination of diesel CN by NIR spectra.
Diesel cetane number (CN) is a key indicator for measuring diesel combustion. ASTMD 613 test method is an internationally traditional determination method of CN. GB/T 386 test method developed in China also references to ASTMD 613 method, which requires a lot of oil samples, is time-consuming and expensive. At present, NIR spectra for the determination of CN in diesel fuel have been reported.4–7 Generally agreed that for non-linear relationship between diesel CN and spectral signals, non-linear support vector machine algorithm, artificial neural network, etc. is superior to linear modeling such as multiple linear regression and partial least squares (PLS) algorithms, etc. And it is found that few variable selection method4,7 is used for pre-processing of diesel spectrum signal.
In this article, Monte Carlo-Uninformative Variable Elimination (MC-UVE) method was first adopted to select characteristic wavelengths (variables) from diesel original NIR signals. Sample-specific or component-specific information remained, and useless information was abandoned. As a result, the calibration modeling was proven and predigested significantly with fewer variables. MC-UVE method was combined with improved consensus PLS method based on Boosting (called BPLS) to build MC-UVE-BPLS modeling, which enhanced PLS linear modeling ability in predicting diesel CN in terms of accuracy and robustness for non-linear relationship between diesel CN and spectral signal. Compared with the full-spectrum method of BPLS, consensus PLS method based on Bagging (called CPLS) and conventional PLS method, it was found that MC-UVE-BPLS modeling obtained better prediction efficiency. Simultaneously, better predicted results were also obtained when compared with MC-UVE-PLS method without consensus strategy.
In this work, MC cross-validation was introduced in UVE-PLS to establish MC-UVE-PLS method, which had been applied in NIR data pre-processing, and got more accurate and robust prediction results than UVE-PLS method.11
Consensus algorithm based on Boosting sampling technique is adopted in this essay,14 which is modified on the basis of AdaBoost algorithm (one case of Boosting algorithm).15 The relative prediction errors between prediction value and true value (experimental value) of every sample in the training set are immediately used to update the resampling weights of samples. This is instead of using the absolute prediction errors between them, which are due to values of absolute prediction error of different samples or especially different property samples which sometimes differ greatly, so frequently modeling prediction ability can not be well reflected. Furthermore, relative to adding “robust” processes to BPLS in other research,16 the BPLS process proposed in this study is simple and intuitive, and retains the AdaBoost algorithm virtue of adjusting the number of iterations, and not needing to adjust any other parameters.
BPLS based on modified Boosting combined with PLS regression algorithm firstly constructs multiple models, which are trained individually, and then their predictions are combined by weighted averaging. The brief flowchart of the algorithm is given in Fig. 1 and the corresponding detailed procedures can be described as follows:
Fig. 1 Flowchart of the BPLS algorithm. |
(1) All the samples in the calibration set start with equal sampling weights 1/n (n is the number of samples in the calibration set). Individual PLS model Ct is established by picking up 63%17 of the samples from the whole calibration set according to the sample sampling weights.
(2) Predict samples of the whole calibration set with model Ct, and then calculate the relative prediction errors (δi,t) between prediction value and true value of every sample xi in calibration set.
(1) |
(2) |
(3) |
Finally, update the resampling weights of samples in calibration set according to formula (4), and then new weights are normalized.
(4) |
(3) During every iteration process, step (2) is repeated, and results of prediction set are predicted by model Ct. The iteration is finished when the prediction error of samples in assessing the set starts to be stabilized. The ensemble prediction results are obtained by the weighted mean of the member models Ct in the boosting series using the following formula.
(5) |
Fig. 2 An example of the measured NIR spectra of diesel. |
245 samples (or spectra) were arbitrarily divided into three sets. 155 samples were used for calibration data set, 45 samples were used for assessing data set, and the remaining 45 samples were used for prediction data set.
In the comparison of MC-UVE-PLS, BPLS, CPLS and PLS methods, the same calibration set and prediction set were adopted. In the optimization of the parameters, the root mean squared error of prediction (RMSEP) was used as an evaluation criterion. RMSEP is defined as
(6) |
Matlab 6.5 is used as the model calculation software.
Fig. 3 Variation of RMSEPversus the number of principal components. |
In Fig. 3, it can be seen that model accuracy changes with different principal factor number of PLS. At first, values of RMSEP are large and unstable. RMSEP descends gradually with the increase of principal factor number, but RMSEP have an increasing trend after nf > 11. Therefore, nf = 11 is used for further calculations in all five methods of MC-UVE-BPLS, MC-UVE-PLS, BPLS, CPLS and PLS.
Fig. 4 The stability distribution of each variable by MC-UVE method. |
It can be seen from Fig. 4, that the distributions of the stability are relatively dispersed, which is maybe caused by a total amount of multiple components in diesel including aromatics, naphthenes, alkanes, alkenes, etc., that is, absorptions of diesel NIR spectra contain a variety of groups of methyl, methylene, vinyl and aryl, etc.
In Fig. 4, the selected variables are concentrated on wavelength 818–988 nm, and several wavelengths around 1134, 1174, 1206 nm. In addition, some narrow wavelength intervals retained in the range of 1342–1534 nm. It is estimated that these bands are respectively assigned to frequency absorptions of C–H groups involved in functional groups of methyl, methylene, vinyl and aryl; double frequency absorptions of partial C–H groups involved in methyl and methylene; double frequency absorptions of C–H groups existed in all kinds of groups.18
A suitable number of reserved variables (Nj) is an important parameter to affect the stability and accuracy of the model. When the number of reserved variables is below Nj, the robustness and accuracy of the model will be poor, due to informative variables not completely remained. On the contrary, if the number of reserved variables is larger than Nj, uninformative variables should be embodied, resulting in bad performance of the model. Nj is investigated with steps of 20 from 30 to 300. For each Nj, a PLS model is developed and then the model is used to predict the assessing set. The mean value of RMSEP and the standard error of RMSEP (σ) of the assessing results through 50 repeated runs are shown in Fig. 5.
Fig. 5 Variation of RMSEPs with the number of selected wavelengths. Standard deviation of 50 runs results is plotted as an error bar crossing the mean value. |
Clearly, Fig. 5 shows that, at the beginning, the mean value and the standard error are both large, then both reduce gradually with the increase of Nj. When Nj is 130, the mean value is the lowest, and the standard error is also small. When Nj is bigger than 130, the mean value increases slightly with the increase of Nj. Accordingly, Nj = 130 is used for further study.
Fig. 6 shows variations between RMSEP of assessing set and iteration time during 1800 iterations. RMSEP is high and instable when iteration number is small. RMSEP decreases with the increase of iteration time, and the variations of RMSEP tend to be slowed down. When iteration number is 1300, RMSEP reaches a minimum. Then RMSEP rises when iteration number exceeds 1300. The above situation coincides with the theory of consensus strategy, that is, an ideal combined system should consist of several accurate models that disagree in prediction as much as possible.20 Finally, iteration number of 1300 is used for further study.
Fig. 6 Variation between prediction error and iteration time. |
Method | RMSEP (σ) | Recovery |
---|---|---|
MC-UVE-BPLS | 2.089(0.007) | 87.96–110.09 |
MC-UVE-PLS | 2.142(0.010) | 87.27–113.39 |
BPLS | 2.143(0.005) | 87.63–110.20 |
CPLS | 2.255(0.002) | 87.74–111.56 |
PLS | 2.308(0.067) | 86.24–117.32 |
It is found from Table 1, that the proposed MC-UVE-BPLS model improves the performance of conventional linear PLS modeling with few variables in predicting diesel CN from terms of accuracy and robustness. The results of MC-UVE-BPLS compared with BPLS and CPLS modeling with full-spectrum indicate that, wavelength selection by MC-UVE method is able to promote modeling ability and predigest modeling. The results of MC-UVE-BPLS are superior to MC-UVE-PLS, which reflects advantages of consensus modeling based on Boosting. Taken overall, the performance of MC-UVE-BPLS is superior to other models studied in dealing with NIR spectra signals.
It is also concluded that from Table 1, equivalent prediction results are obtained by MC-UVE-PLS with fewer wavelengths compared with the prediction results of BPLS. Simultaneously, for the data in this study, BPLS method based on Boosting obtains more accurate results than CPLS method based on Bagging.
This journal is © The Royal Society of Chemistry 2012 |