Jun Bina,
Fang-Fang Aib,
Wei Fan*a,
Ji-Heng Zhou*a,
Yong-Huan Yunc and
Yi-Zeng Liangc
aCollege of Bioscience and Biotechnology, Hunan Agricultural University, Changsha, China. E-mail: wei_fan@foxmail.com; jihengzhou211@163.com; Tel: +86-731-84635356 Tel: +86-731-84785708
bShanghai Tobacco Group Co., Ltd., Shanghai, China
cCollege of Chemistry and Chemical Engineering, Central South University, Changsha, China
First published on 17th March 2016
To select informative variables for improving the ensemble performance in random forests (RF), a modified RF method, named random forest combined with Monte Carlo and uninformative variable elimination (MC-UVE-RF), is proposed for multi-class classification analysis of near-infrared (NIR) spectroscopy in this work. The MC method is used to increase the diversity of classification trees in RF and the UVE method is applied to gradually eliminate the less important variables based on variable reliability obtained by aggregation of each sub-model. The above two steps can be regarded as a variable selection process. As comparisons to MC-UVE-RF, the conventional RF, model population analysis combined with RF (MPA-RF) and support vector machine (SVM) for discrimination of tobacco grades by NIR spectroscopy have also been investigated. MC-UVE-RF has a marked superiority for discriminating tobacco samples into high-quality, medium-quality and low-quality groups of dataset I and II with external validation accuracy 100% and 96.83%, respectively (coarse classification). Furthermore, a good external validation accuracy in the subdivision of high-quality, medium-quality and low-quality groups of dataset I is 88.46%, 97.22% and 96%, and that of the subdivision of dataset II's three groups is 100%, 97.14% and 100%, respectively, which are better than or equal to those by other methods (refined classification). Therefore, MC-UVE-RF is a powerful alternative to multiple classification problems. Moreover, it could be a fast and powerful method for discrimination of tobacco leaf grades coupled with NIR technology instead of artificial judgment.
Near-infrared (NIR) spectroscopy, as a rapid, simple and non-destructive technique, has become an increasingly popular analytical method in classification problems.14–17 Especially, a realizing of portable NIR spectrometer instrument provides a favorable condition for real-time measurement and makes it more popular in classification problems of practical production recently. The conventional method in the discrimination of tobacco leaf grade is artificial judgment according to the appearance of the leaves and sensory taste.18,19 The main influence factor for the discrimination results of tobacco leaf grade is the different level of classification operators. Therefore, establishing a standard and reliable method to classify the quality grade of tobacco leaf samples is valuable and essential. Many published studies20,21 have successfully utilized NIR spectra coupled with chemometrics to predict the content of some chemical components which are important indicators of tobacco leaf quality.22,23 As a consequence, in this paper, a speediness and efficient method in distinguishing different tobacco leaf qualities by NIR spectroscopy would be presented and discussed.
The purposes of this study are the following: (i) modify RF with gradually eliminating the number of variables according to the order of variable reliability which obtained by aggregation of each sub-model; and (ii) compare the novel approach, namely MC-UVE-RF, with RF, MPA-RF and support vector machine (SVM) by discriminating different grades of tobacco leaf samples.
Dataset I which was collected in the year of 2014 contains 3 groups, including high-quality group (I), medium-quality group (I) and low-quality group (I), of 428 NIR spectra of samples. The high-quality group (I) is consisted of 9 grades of 128 samples. The medium-quality group (I) contains 14 grades of 179 samples and the low-quality group (I) contains 9 grades of 121 samples.
Dataset II which was collected in the year of 2015 also contains 3 groups, including high-quality group (II), medium-quality group (II) and low-quality group (II), of 317 NIR spectra of samples. The high-quality group (II) is consisted of 5 grades of 59 samples. The medium-quality group (II) contains 13 grades of 171 samples and the low-quality group (II) contains 8 grades of 87 samples. Detailed information about samples is shown in Tables 1 and 2.
Groups | Grade labels | Number of samples | Groups | Grade labels | Number of samples | ||
---|---|---|---|---|---|---|---|
Dataset I | High-quality group (I) | B1F | 11 | Dataset II | High-quality group (II) | B1F | Null |
B1L | 16 | B1L | 11 | ||||
B2F | 10 | B2F | Null | ||||
C1F | 16 | C1F | 9 | ||||
C1L | 9 | C1L | 15 | ||||
C2F | 27 | C2F | Null | ||||
C2L | 14 | C2L | 12 | ||||
C3F | 11 | C3F | Null | ||||
X1F | 14 | X1F | 12 | ||||
Medium-quality group (I) | B2L | 15 | Medium-quality group (II) | B2L | 12 | ||
B2V | 12 | B2V | 9 | ||||
B3F | 15 | B3F | 14 | ||||
B3L | 15 | B3L | 13 | ||||
B3V | 7 | B3V | 12 | ||||
B4F | 11 | B4F | 13 | ||||
C3L | 15 | C3L | Null | ||||
C3V | 9 | C3V | 15 | ||||
C4L | 5 | C4L | 14 | ||||
X1L | 15 | X1L | 15 | ||||
X2F | 15 | X2F | 14 | ||||
X2L | 17 | X2L | 15 | ||||
X2V | 15 | X2V | 15 | ||||
X3F | 13 | X3F | 10 | ||||
Low-quality group (I) | B1K | 11 | Low-quality group (II) | B1K | 13 | ||
B2K | 16 | B2K | 6 | ||||
B4L | 16 | B4L | Null | ||||
CX1K | 12 | CX1K | 12 | ||||
CX2K | 16 | CX2K | 12 | ||||
GY1 | 7 | GY1 | 13 | ||||
X3L | 13 | X3L | 14 | ||||
X4F | 15 | X4F | 7 | ||||
X4L | 16 | X4L | 10 |
Datasets | Classes | Total | Training set | Testing set |
---|---|---|---|---|
Dataset I | 3 | 428 | 341 | 87 |
High-quality group (I) | 9 | 128 | 102 | 26 |
Medium-quality group (I) | 14 | 179 | 143 | 36 |
Low-quality group (I) | 9 | 121 | 96 | 25 |
Dataset II | 3 | 317 | 254 | 63 |
High-quality group (II) | 5 | 59 | 47 | 12 |
Medium-quality group (II) | 13 | 171 | 136 | 35 |
Low-quality group (II) | 8 | 87 | 69 | 18 |
(1) Draw ntree bootstrap samples from the original training dataset. In the bootstrap set, about two-thirds of the original training samples are used to grow a classification tree. Other one-third of the samples are left which are called OOB samples. Here, OOB samples can be used to obtain estimate of classification accuracy of internal validation.
(2) For each bootstrap sample, grow an unpruned classification tree, and modify with the following procedure: at each node, randomly select mtry variables and choose the best split from among those variables. That is, each tree is constructed using the bootstrap samples of the training data and random feature selection.
(3) Predict new data by aggregating the predictions of the ntree trees with the majority vote of all analogous trees in the forest.
The frequently used type of RF to measure variable importance is the MDA index based on permutation. For each tree, the classification accuracy of the OOB samples is determined both with and without random permutation of the values to each variable, one by one. The MDA of the classification was used to measure variable importance. The importance of each variable j can be calculated as following:
Importance (j) = accuracy (j)normal − accuracy (j)permuted |
(1) By the MC sampling technique, a certain ratio of samples is randomly selected from the training set as the training subset and the procedure would be repeated for thousands of times.
(2) After setting up the training subset, thousands of classification sub-models would be established utilizing RF.
(3) For each sub-models, the variable importance order would be obtained by RF. Subsequently, the reliability of each variable j would be calculated by aggregation of the variable importance results of all sub-models, which can be quantitatively measured by the reliability defined as:
r(j) = mean (importance (j))/std (importance (j)) |
(4) Select the optimum variable combination by running the RF with gradually eliminating the number of variables based on the order of variable reliability according to the OOB error of each established model with different number of variables.
(5) Build the RF model with the newly selected variables and predict new data by aggregating the predictions of the ntree trees with the majority vote of all analogous trees in the forest.
In order to better understand this approach, a simple graphic flowchart of MC-UVE-RF is briefly shown in Fig. 1.
PLS–LDA method, combination of PLS and LDA, employs class information to maximize the separation between classes. It is better than PLS, because it can be validated by a set of independent data. In this part, dataset I was analyzed by PLS–LDA to construct a linear discrimination model and to explore the distribution trends of the samples. The PLS–LDA score plot for dataset I is shown in Fig. 2. One can clearly see that classified result of PLS–LDA is not satisfactory. The best principal components of 7, 4, 6 and 5 in dataset I and its three sub-classes were chosen to build discrimination models, respectively. The best principal components of 2, 3, 4 and 5 in dataset II and its three sub-classes were selected to build discrimination models, respectively. The classification accuracy of training set and testing set of dataset I is only 50% and 44.19%, respectively. The rest of PLS–LDA predicted results of dataset I and dataset II are shown in Tables 3 and 4. As these classification results are inacceptable, it is necessary to looking for more powerful classification algorithm to solve this problem.
Datasets | Samples | PLS–LDA | SVM | RF | MPA-RF | MC-UVE-RF |
---|---|---|---|---|---|---|
a A/B represents the right predicted samples/all used predicted samples. | ||||||
Dataset I | Training set | 50(171/342)a | 98.24(336/342) | 100(342/342) | 100(342/342) | 100(342/342) |
Testing set | 44.19(38/86) | 97.67(84/86) | 98.84(85/86) | 98.84(85/86) | 100(86/86) | |
High-quality group (I) | Training set | 67.65(69/102) | 100(102/102) | 100(102/102) | 100(102/102) | 100(102/102) |
Testing set | 53.85(14/26) | 84.62(22/26) | 88.46(23/26) | 88.46(23/26) | 88.46(23/26) | |
Medium-quality group (I) | Training set | 81.82(117/143) | 97.90(140/143) | 100(143/143) | 100(143/143) | 100(143/143) |
Testing set | 88.89(32/36) | 91.67(33/36) | 94.44(34/36) | 94.44(34/36) | 97.22(35/36) | |
Low-quality group (I) | Training set | 77.08(74/96) | 97.92(94/96) | 100(96/96) | 100(96/96) | 100(96/96) |
Testing set | 56(14/25) | 80(20/25) | 88(22/25) | 92(23/25) | 96(24/25) |
Datasets | Samples | PLS–LDA | SVM | RF | MPA-RF | MC-UVE-RF |
---|---|---|---|---|---|---|
Dataset II | Training set | 60.24(153/254) | 97.24(247/254) | 100(254/254) | 100(254/254) | 100(254/254) |
Testing set | 61.90(39/63) | 96.83(61/63) | 93.65(59/63) | 93.65(59/63) | 96.83(61/63) | |
High-quality group (II) | Training set | 85.11(40/47) | 93.62(44/47) | 100(47/47) | 100(47/47) | 100(47/47) |
Testing set | 75(9/12) | 91.67(11/12) | 91.67(11/12) | 91.67(11/12) | 100(12/12) | |
Medium-quality group (II) | Training set | 73.53(100/136) | 99.26(135/136) | 100(136/136) | 100(136/136) | 100(136/136) |
Testing set | 68.57(24/35) | 94.28(33/35) | 97.14(34/35) | 97.14(34/35) | 97.14(34/35) | |
Low-quality group (II) | Training set | 79.71(55/69) | 97.10(67/69) | 100(69/69) | 100(69/69) | 100(69/69) |
Testing set | 61.11(11/18) | 100(18/18) | 100(18/18) | 100(18/18) | 100(18/18) |
![]() | ||
Fig. 3 Effects of number of trees (ntree) on OOB errors for classifying training samples of dataset I and its three sub-classes by RF. |
![]() | ||
Fig. 4 Effects of random split variables (mtry) on OOB errors for classifying training samples of dataset I and its three sub-classes by RF. |
In the following, only the classification process of overall samples of dataset I is described in detail. In the process of each sub-model building, variable importance is calculated. The MDA index in classification based on permutation is used to evaluate the importance of each variable. The prediction accuracy of after permutation is subtracted from the prediction accuracy before permutation and averaged over all trees in the forest to give the permutation importance value. Since the MDA shows a large value when these variables are permutated, it may be thought that these variables are very important. However, as is illustrated by Huang et al.13 different variable importance orders could be obtained by RF model using the same dataset in different times. That is because the trees in the forest are different between each other, which will cause different results. Therefore, in order to select the informative and reliable variables, 1000 RF sub-models are established combining with MC sampling method and 80% samples are selected in each sub-model. Each variable is given a MDA index in each sub-model. Gather 1000 MDA values, and the reliability of each variable is calculated by mean/std of MDA value. The reliability of each variable is shown in Fig. 5(a). Then, sort the reliability and select the optimum variable by running the RF with gradually eliminating the number of variables. As shown in Fig. 6, the lowest OOB error is obtained when the variable number is 80. Accordingly, these 80 variables are presented above the red line in Fig. 5(a). Three representative variables are selected to show the MDA distribution in Fig. 5(b). Among them, the reliability of variable A, B and C rank 1, 80 and 494, respectively. A and B are the selected variables, and C is the eliminated one. Obviously, the mean of MDA index from high to low is A, B and C. The distribution range from big to small is C, B and A. It can be concluded that the selected variables are important and relatively stable.
Therefore, those 80 reliable variables are selected as the combination variables in the following RF running. The cutoff value of variable reliability is 5.1318. Other datasets are operated with the same process as described above. The cutoff values of variable reliability of all datasets are given in Table 5. From Tables 3 and 4, it can be seen that the external validation accuracies by MC-UVE-RF are decent. It demonstrates that MC-UVE-RF is expert in classifying the tobacco leaf samples into high-quality, medium-quality and low-quality groups. Meanwhile, it has performed well in all subdivisions. In RF, multidimensional scaling (MDS) can be used to visualize the similarities of samples to some extent. By using this approach, most of the similarity information can be displayed in a 2-dimensional or 3-dimensional space. It can be seen from Fig. 7 that a much better classification result is obtained for the three grades when compared to that of PLS–LDA. Because of perspective, it seems that samples in Fig. 7 are not separated very well. But, they actually are kept apart completely in multiple-dimensional space.
Datasets | Cutoff values |
---|---|
Dataset I | 5.1318 |
High-quality group (I) | 2.7907 |
Medium-quality group (I) | 4.6030 |
Low-quality group (I) | 2.9372 |
Dataset II | 4.5035 |
High-quality group (II) | 1.0605 |
Medium-quality group (II) | 2.8718 |
Low-quality group (II) | 3.5328 |
![]() | ||
Fig. 7 Classification by MC-UVE-RF: (a) denotes training set of dataset I and (b) denotes testing set of dataset I. |
The stability of discriminant methods relying on a random component is an important issue. High stability across different runs is considered to be an asset. Thus, MC-UVE-RF, as a cluster ensemble algorithm based on random strategy, its stability should be paid close attention to. An experimental study was carried out to examine whether MC-UVE-RF can give stable results. The MC-UVE-RF algorithm had run 100 times for each sub-dataset. The external validation accuracy with the highest frequency of all datasets has been shown in Table 6. The frequency of its external validation accuracy of each dataset exceeding 70% has demonstrated the preferable stability of MC-UVE-RF method, except that of dataset II.
Datasets | External validation accuracy (%) | Highest frequency (%) |
---|---|---|
Dataset I | 100 | 71 |
High-quality group (I) | 88.46 | 71 |
Medium-quality group (I) | 97.22 | 72 |
Low-quality group (I) | 96 | 74 |
Dataset II | 96.83 | 47 |
High-quality group (II) | 100 | 100 |
Medium-quality group (II) | 97.14 | 94 |
Low-quality group (II) | 100 | 100 |
SVM,32 proposed by Cortes and Vapnik in 1995, is a linear machine working in the high dimensional feature space formed by the non-linear mapping of the n-dimensional input vector x into a K-dimensional feature space (K > n) with a function. In order to get best results, the PCA dimension reduction method and normalization are applied before running SVM. For SVM, there are some parameters needed to be optimized, including core function, penalty parameter C and kernel parameter gamma (γ). A grid search method with exponentially growing sequences of C and γ were applied. The parameters which could generate best cross-validation accuracy were used. Finally, radial basis function (RBF) was chosen as the core function for the two datasets. From Tables 3 and 4, it can be seen that the external validation accuracy by the proposed method are better or equal to that of SVM.
As many method-comparison studies suffer from equivocalness or from comparisons not being quite fair, a method comparison method known as sum of ranking differences (SRD)33–35 had been employed to evaluate the classification ability of MC-UVE-RF and other methods. SRD is a simple objective procedure to ensemble multiple method merits for ranking methods allowing automatic selection of a set of methods. The smaller the SRD value is, the better the method is. In many applications, row average is used as the benchmark. In this case, such a benchmark would correspond to a hypothetical average column. Other data integration possibilities also exist, such as row minimum for error rates, residuals, row maximum for the best classification rates, etc. Since we were to solve classification problems in this paper, the row maximum value should be selected as the benchmark. It can be seen from Table 7, when row maximum was chosen as the benchmark, MC-UVE-RF which has the smallest SRD values turned to be the best method.
Datasets | References | PLS–LDA | SVM | RF | MPA-RF | MC-UVE-RF |
---|---|---|---|---|---|---|
Dataset I | Row maximum | 18 | 12 | 4 | 2 | 0 |
Rank | 5 | 4 | 3 | 2 | 1 | |
Dataset II | Row maximum | 18 | 16 | 8 | 8 | 0 |
Rank | 5 | 4 | 3 | 2 | 1 |
The comparison of MC-UVE-RF with other three classification methods in chemometrics and method-comparison methodology showed that MC-UVE-RF can acquire nice prediction results and superior to other classical algorithms in tobacco leaf grade spectral data.
Therefore, the proposed method is a promising tool for the solution of tobacco leaf grade classification problems. Also, applicability is by no means limited to the text mentioned, and wide applications of this method to multi-class classification systems can be foreseen.
This journal is © The Royal Society of Chemistry 2016 |