Xiaqiong
Fan
,
Wen
Ming
,
Huitao
Zeng
,
Zhimin
Zhang
* and
Hongmei
Lu
*
College of Chemistry and Chemical Engineering, Central South University, Changsha, China. E-mail: zmzhang@csu.edu.cn; hongmeilu@csu.edu.cn
First published on 23rd January 2019
Raman spectroscopy is widely used as a fingerprint technique for molecular identification. However, Raman spectra contain molecular information from multiple components and interferences from noise and instrumentation. Thus, component identification using Raman spectra is still challenging, especially for mixtures. In this study, a novel approach entitled deep learning-based component identification (DeepCID) was proposed to solve this problem. Convolution neural network (CNN) models were established to predict the presence of components in mixtures. Comparative studies showed that DeepCID could learn spectral features and identify components in both simulated and real Raman spectral datasets of mixtures with higher accuracy and significantly lower false positive rates. In addition, DeepCID showed better sensitivity when compared with the logistic regression (LR) with L1-regularization, k-nearest neighbor (kNN), random forest (RF) and back propagation artificial neural network (BP-ANN) models for ternary mixture spectral datasets. In conclusion, DeepCID is a promising method for solving the component identification problem in the Raman spectra of mixtures.
Searching algorithms combined with databases is the most important method to solve the problems of component identification. The construction of databases can provide researchers with a powerful tool to elucidate Raman spectra.6 In fact, Raman databases are the core of many applications. Nowadays, many researchers have established relevant databases, such as database for the Raman spectra of 21 azo pigments,7 database of FT-Raman spectra,8 e-VISART database,9 biological molecule database,10 explosive compound database,11 and pharmaceutical raw material database,12 for the rapid and non-destructive identification of samples. With the increasing number of databases, various searching algorithms13–16 have been developed. Most of them measure the similarity between Raman spectra using correlation, Euclidean distance, absolute value correlation and least-squares.17 However, these correlation- and distance-based methods are effective only for the comparison of pure components. In practical application, multi-components are usually present in the Raman spectra of complex samples. Therefore, there is an urgent need to develop algorithms to identify the components in the Raman spectra of mixtures.
For component identification with Raman spectra of mixtures, many statistics and chemometric methods have been proposed. The reverse search18 compares a subset of peaks in an unknown spectrum with the spectra in the database, which is more effective than the forward search method for component identification in mixtures. The interactive self-modelling method (SIMPLISMA)19 was developed by Winding to extract a pure spectrum from a mixture according to the purity spectrum and standard deviation spectrum. A spectral reconstruction algorithm20 based on minimum information entropy has been developed to recover the individual spectrum contained in mixtures without any database or priori knowledge when there is a significant number of mixture spectra. An automatic system has been presented to identify the Raman spectra of binary mixtures of pigments, with good robustness against some of the critical factors.21 A component identification method22 has been established based on the partial correlation of each component identified. The generalized linear model has been used to compute the probability that an identified component is truly present in a mixture. Recently, the sparse regularized model23 has been proposed as a complement to traditional regression methods to decompose the Raman spectra of a mixture and identify the target components. The reverse searching and non-negative least square (RSearch-NNLS)12 method has demonstrated outstanding performances for component identification and ratio estimation; however, it requires lots of spectral pre-processing including smoothing, baseline correction and peak detection.
Most of the abovementioned methods are based on the similarity or distance between spectra, the characteristic peaks and the database searching algorithms. The choice of searching algorithm, similarity criteria and characteristic peaks will significantly affect the identification results. Furthermore, spectral pre-processing, an essential step in these methods, has the potential risk to introduce errors and variability.24,25 Therefore, these factors may have an impact on the accuracy of the component identification of mixtures.
Deep learning is one of the most focused research fields in recent years, aiming to learn features and directly build predictive models from large-scale raw datasets.26 It shows excellent performances in many fields of chemistry and biology including spectroscopy,27–31 proteomics,32–34 metabolomics35,36 and genomics.37,38 These applications demonstrate the advantages of deep-learning techniques in signal-extraction, feature-learning and modelling complex relationships. Convolutional neural network (CNN) is an important branch of deep learning technology. It is inspired by the biological visual cognition mechanism.39 The neural network model Neocognitron40 for visual pattern recognition, proposed by Kunihiko Fukushima, is considered to be the predecessor of CNN. Other representative neural networks, such as wavelet neural network (WNN),41 have also been developed according to the visual characteristics of multi-resolution, local correspondence and direction. In the 1990s, the framework of modern CNN was determined by Lecun et al.,42,43 called LeNet-5. In terms of image representation and digital recognition, LeNet-5 exhibits great performances; however, it is not suitable for complex problems due to the lack of data and limited computing capabilities. In recent years, various methods have been developed for the effective training of deep CNN. The deeper CNN AlexNet44 has achieved excellent performances with the rectified linear unit (ReLU)45 and dropout46 methods. Other methods such as ZFNet,47 VGGNet,48 GoogleNet49 and ResNet50 are all representative CNNs.
In this study, deep learning-based methods were used to directly identify the components in mixtures from raw Raman spectra. CNN models were built for each compound in the database to achieve the aim of component identification in mixtures. The proposed deep learning-based component identification (DeepCID) can distil useful information from raw Raman spectra and identify the components of mixtures.
No. | Status | Ratio | Components | Prediction results | TP | TN | FP | FN | TPR (%) | FPR (%) |
---|---|---|---|---|---|---|---|---|---|---|
1 | Liquid | 5:3:2 | Methanol | Methanol | 3 | 164 | 0 | 0 | 100.0 | 0.0 |
Ethanol | Ethanol | |||||||||
Acetonitrile | Acetonitrile | |||||||||
2 | Liquid | 7:2:1 | Methanol | Methanol | 3 | 164 | 0 | 0 | 100.0 | 0.0 |
Ethanol | Ethanol | |||||||||
Acetonitrile | Acetonitrile | |||||||||
3 | Liquid | 4:3:3 | Methanol | Methanol | 3 | 164 | 0 | 0 | 100.0 | 0.0 |
Ethanol | Ethanol | |||||||||
Acetonitrile | Acetonitrile | |||||||||
4 | Powder | 1:1 | Polyacrylamide | Polyacrylamide | 2 | 165 | 0 | 0 | 100.0 | 0.0 |
Sodium acetate | Sodium acetate | |||||||||
5 | Powder | 1:1:1 | Polyacrylamide | Polyacrylamide | 3 | 164 | 0 | 0 | 100.0 | 0.0 |
Sodium acetate | Sodium acetate | |||||||||
Sodium carbonate | Sodium carbonate | |||||||||
6 | Powder | 5:3:2 | Polyacrylamide | Polyacrylamide | 3 | 164 | 0 | 0 | 100.0 | 0.0 |
Sodium acetate | Sodium acetate | |||||||||
Sodium carbonate | Sodium carbonate |
Fig. 2 Example showing how the convolution kernel (1 × w) works on a one-layer CNN. The one-dimensional network has one input layer and n output layers. |
Rectify linear unit (ReLU) is used as the activation function between layers:
ReLU = max(0, WTx + b) | (1) |
The SoftMax function is often used in the final layer of a neural network-based classifier. It normalizes the predicted output to obtain the probability distribution on the corresponding class:
(2) |
Pooling is another important concept of CNN. It can be regarded as non-linear down-sampling, which can reduce the size of the representation, the number of parameters and the amount of computation.52 The pooling function uses the statistical characteristics (max, in this study) of adjacent outputs at a certain location. For Raman spectra, pooling can provide the translation invariance and increase robustness to peak displacement.
For component identification, CNN can extract the features of the spectra and learn to identify compounds under complex interferences. Since each Raman spectrum can be regarded as a one-dimensional vector, a four-layered one-dimensional CNN is used. The convolutional layers are used to extract the spectral features, and the fully connected layers (1024 and 2 nodes) are used to build the relationship between the feature maps and the existence of the compound. Following each convolution layer, a pooling function is used to improve the learning efficiency. In each convolutional layer and fully connected layer, dropout is applied to prevent overfitting. The architecture of CNN that we have used herein is shown in Fig. 1.
As abovementioned, ReLU and Softmax are used as activation functions. The adaptive moment estimation (Adam)53 is used to train the networks since it requires little memory and has high computational efficiency. It is invariant to diagonal rescaling of the gradients and is well suited for problems that have a large amount of data. Cross entropy loss was used as the loss function. The weights of the network for kernels were initialized by truncated normal distribution, and biases were initialized by the constant of 0.1. Specifically, the bias from 0.01 to 1.00 was considered. The learning rate was 10−4, and the batch size was 100 after optimization. The epoch was in the range from 200 to 300 for models of different components. The criterion for selecting epochs was that the increase of epoch had no obvious contribution to accuracy. The accuracy-epoch curve and loss-epoch curve are shown in Fig. 3 (taking tri(hydroxymethyl)aminomethane as an example, and the other 166 models also achieved similar results by running the code in our GitHub repository).
Fig. 3 Accuracy-epoch and loss-epoch curves of a valid set during training (using tri(hydroxymethyl)aminomethane as an example). |
(3) |
(4) |
(5) |
Fig. 4 Receiver operating characteristic curves of the different methods (taking tri(hydroxymethyl)aminomethane as an example). |
Model | DeepCID | RF | LR | kNN | BP-ANN | FCNN |
---|---|---|---|---|---|---|
Methanol | 99.2 | 98.7 | 99.7 | 91.2 | 96.7 | 98.7 |
Ethanol | 99.5 | 98.4 | 99.7 | 82.8 | 98.2 | 98.5 |
Acetonitrile | 99.6 | 98.4 | 99.7 | 82.2 | 95.0 | 99.2 |
Polyacrylamide | 99.7 | 98.2 | 99.6 | 73.7 | 91.0 | 97.7 |
Sodium acetate | 99.9 | 98.9 | 99.8 | 82.3 | 97.0 | 98.7 |
Sodium carbonate | 99.9 | 98.2 | 99.8 | 77.6 | 90.4 | 99.0 |
No. | TPR | FP | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
DeepCID | RF | LR | kNN | BP-ANN | FCNN | DeepCID | RF | LR | kNN | BP-ANN | FCNN | |
1 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 0 | 2 | 5 | 37 | 3 | 1 |
2 | 3/3 | 3/3 | 3/3 | 3/3 | 2/3 | 3/3 | 0 | 0 | 4 | 23 | 3 | 0 |
3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 0 | 1 | 5 | 28 | 4 | 2 |
4 | 2/2 | 2/2 | 2/2 | 2/2 | 2/2 | 2/2 | 0 | 1 | 4 | 46 | 15 | 33 |
5 | 3/3 | 3/3 | 3/3 | 1/3 | 3/3 | 3/3 | 0 | 1 | 2 | 43 | 6 | 13 |
6 | 3/3 | 3/3 | 3/3 | 1/3 | 3/3 | 2/3 | 0 | 3 | 4 | 47 | 12 | 21 |
Statistic | 1.00 | 1.00 | 1.00 | 0.78 | 0.94 | 0.94 | 0 | 8 | 24 | 224 | 37 | 70 |
Percentage of methanol (%) | Volume of methanol/mL | Volume of acetonitrile/mL | Volume of distilled water/mL | DeepCID | RF | LR | kNN | BP-ANN | FCNN | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
a ✓ represents that methanol can be detected by the corresponding method. ‘×’ represents that methanol cannot be detected by the corresponding method. | |||||||||||||||||||||
1 | 0.5 | 0/10/20 | 49.5/39.5/29.5 | × | × | × | × | × | × | × | × | × | × | × | × | × | × | × | × | × | × |
4 | 2.0 | 0/10/20 | 48/38/28 | ✓ | ✓ | ✓ | × | × | × | √ | × | × | × | × | × | √ | × | × | × | × | × |
7 | 3.5 | 0/10/20 | 46.5/36.5/26.5 | ✓ | ✓ | ✓ | × | × | × | √ | × | × | × | × | × | √ | × | × | √ | × | × |
10 | 5.0 | 0/10/20 | 45/35/25 | ✓ | ✓ | ✓ | × | × | × | √ | √ | × | √ | × | × | √ | √ | × | √ | √ | × |
13 | 6.5 | 0/10/20 | 43.5/33.5/23.5 | ✓ | ✓ | ✓ | × | × | × | √ | √ | × | √ | × | × | √ | √ | × | √ | √ | √ |
16 | 8.0 | 0/10/20 | 42/32/22 | ✓ | ✓ | ✓ | × | × | ✓ | √ | √ | √ | √ | × | × | √ | √ | × | √ | √ | √ |
20 | 10.0 | 0/10/20 | 40/30/20 | ✓ | ✓ | ✓ | × | × | ✓ | √ | √ | √ | √ | × | × | √ | √ | √ | √ | √ | √ |
23 | 11.5 | 0/10/20 | 38.5/28.5/18.5 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | √ | √ | √ | √ | √ | × | √ | √ | √ | √ | √ | √ |
26 | 13.0 | 0/10/20 | 37/27/17 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | √ | √ | √ | √ | √ | × | √ | √ | √ | √ | √ | √ |
30 | 15.0 | 0/10/20 | 35/25/15 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | √ | √ | √ | √ | √ | × | √ | √ | √ | √ | √ | √ |
33 | 16.5 | 0/5/10 | 33.5/28.5/22.5 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ |
36 | 18.0 | 0/5/10 | 32/27/22 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ |
40 | 20.0 | 0/5/10 | 30/25/20 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ |
43 | 21.5 | 0/5/10 | 28.5/23.5/18.5 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ |
46 | 23.0 | 0/5/10 | 27/22/17 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ |
50 | 25.0 | 0/5/10 | 25/20/15 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ |
53 | 26.5 | 0/5/10 | 23.5/18.5/13.5 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ |
56 | 28.0 | 0/5/10 | 22/17/12 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ |
60 | 30.0 | 0/5/10 | 20/15/10 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ |
63 | 31.5 | 0/5/10 | 18.5/13.5/8.5 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ |
66 | 33.0 | 0/2/5 | 17/15/12 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ |
70 | 35.0 | 0/2/5 | 15/13/10 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ |
73 | 36.5 | 0/2/5 | 13.5/11.5/8.5 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ |
76 | 38.0 | 0/2/5 | 12/10/7 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ |
80 | 40.0 | 0/2/5 | 10/8/5 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ |
83 | 41.5 | 0/2/5 | 8.5/6.5/3.5 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ |
86 | 43.0 | 0/2/5 | 7/5/2 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ |
90 | 45.0 | 0/2/5 | 5/3/0 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ |
93 | 46.5 | 0/1/3.5 | 3.5/2.5/0 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ |
96 | 48.0 | 0/1/2 | 2/1/0 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ |
99 | 49.5 | 0/0.25/0.5 | 0.5/0.25/0 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ | √ |
100 | 50.0 | 0 | 0 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
DeepCID had excellent results of TPR and FPR, and Fig. 5 shows an example of the statistical result (tri(hydroxymethyl)aminomethane is taken as an example, and the other 166 components have also achieved similar results by running the code in our GitHub repository). Moreover, one hundred times of sample partitions and trainings were performed for different models to obtain the histograms of TPR and FPR. Obviously, the TPR distribution of DeepCID is closer to 1.0 than that of the other methods. The FPR of DeepCID is closer to 0.0 than that of RF, kNN, BP-ANN and FCNN. Although the distribution of FPR of LR is close to that of DeepCID, DeepCID has a better performance in the other data sets (such as the liquid and powder mixture dataset and ternary mixture dataset.).
As abovementioned, the training datasets were generated randomly based on the database. The spectra in the training datasets contained not only the noise of the original spectra, but also the interference of other compounds. Moreover, the liquid and powder and ternary mixture datasets were acquired using different Raman spectrometers. In this situation, DeepCID achieved excellent identification, with no false positive and false negative errors. This means that DeepCID not only can alleviate the noise interference (different background), but also the interference of other compounds.
In fact, DeepCID can also alleviate the displacement in Raman shifts and the variation in Raman intensity between different instruments. Taking the pure spectrum of methanol as an example, it can be observed clearly in Fig. 6 that the Raman peaks shift from 1034 and 1452 cm−1 (database) to 1036 and 1454 cm−1 (ternary mixture dataset), respectively. In addition, the maximum intensity of 100% methanol from the ternary mixture dataset was nearly two times higher than that from the database. In this case, DeepCID accurately identified methanol in mixtures in the ternary mixture dataset. This is strong evidence that DeepCID has great capability to correct Raman shifts and Raman intensity.
DeepCID showed excellent sensitivity, especially at low concentrations. It effectively identified methanol mixtures with a percentage as low as 4% in the ternary mixture dataset. Thus, compared with the other methods used (FCNN correctly detected the presence of methanol with a percentage above 13%, and LR with L1-regularization, BP-ANN, RF and kNN correctly detected the presence of methanol with a percentage above 16%, 20%, 23% and 33%, respectively), it undoubtedly has great advantages and improvement.
NVidia GeForce GTX Titan X GPU was used to accelerate the training of CNNs. In 300 epochs, it took about five minutes for training. During the prediction, reloading one model and making prediction consumed less than one second. Specifically, the analysis of 6 mixture spectra with 167 CNN was completed within two minutes. If these 167 models could be preloaded into the memory, the prediction would be finished within only one second for the 6 mixture spectra. Therefore, DeepCID is efficient enough to identify components in mixtures.
Currently, DeepCID can be viewed as a pre-training model. When new samples are available for training, DeepCID can easily be further evolved to achieve better accuracy since all training samples are generated by augmentation from the database. If there are real samples for fine-tuning, DeepCID will achieve even better performance.
With DeepCID, the pre-processing process can be avoided. It not only simplifies the pipeline of DeepCID, but also avoids introduction of errors and variations from pre-processing. To compare and further prove that DeepCID does not require pre-processing, pre-processed data has also been used to build models such as DeepCID. Whittaker smoother57 and adaptive iteratively reweighted Penalized Least Squares (airPLS)51,58 were used to smooth and correct the baseline of the Raman spectra. Some of the results are listed in Tables S2 and S3.† The comparison results show that although the pre-processing improved the fitting accuracy, it also reduced the generalization ability of the models. Specifically, DeepCID with pre-processing increases the probability of false-positive errors and false-negative errors.
Fig. 7 Stability and reproducibility of DeepCID. One hundred trainings were performed for the models of acetonitrile and methanol on simulated test set. |
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c8an02212g |
This journal is © The Royal Society of Chemistry 2019 |