Jinchao
Liu†
a,
Margarita
Osadchy
b,
Lorna
Ashton
c,
Michael
Foster†
d,
Christopher J.
Solomon
e and
Stuart J.
Gibson
*e
aVisionMetric Ltd., Canterbury, Kent, CT2 7FG, UK. E-mail: liujinchao2000@gmail.com; Tel: +44 (0) 1227 811790
bDepartment of Computer Science, University of Haifa, Mount Carmel, Haifa 31905, Israel. E-mail: rita@cs.haifa.ac.il; Tel: +972-4-8288444
cDepartment of Chemistry, Lancaster University, Bailrigg, Lancaster, LA1 4YW, UK. E-mail: l.ashton@lancaster.ac.uk; Tel: +44 (0)1524 593552
dIS-Instruments Ltd., 220 Vale Road, Tonbridge, Kent TN9 1SP, UK. E-mail: mfoster@is-instruments.com; Tel: +44 (0)1732 373020
eSchool of Physical Sciences, University of Kent, Canterbury, CT2 7NH, UK. E-mail: S.J.Gibson@kent.ac.uk; Tel: +44 (0)1227 823271
First published on 28th September 2017
Machine learning methods have found many applications in Raman spectroscopy, especially for the identification of chemical species. However, almost all of these methods require non-trivial preprocessing such as baseline correction and/or PCA as an essential step. Here we describe our unified solution for the identification of chemical species in which a convolutional neural network is trained to automatically identify substances according to their Raman spectrum without the need for preprocessing. We evaluated our approach using the RRUFF spectral database, comprising mineral sample data. Superior classification performance is demonstrated compared with other frequently used machine learning algorithms including the popular support vector machine method.
In this work we focus on multivariate methods, and introduce the application of convolutional neural networks (CNNs) in the context of Raman spectroscopy. Unlike the current Raman analysis pipelines, CNN combines preprocessing, feature extraction and classification in a single architecture which can be trained end-to-end with no manual tuning. We show that CNN not only greatly simplifies the development of a machine classification system for Raman spectroscopy, but also achieves significantly higher accuracy. In particular, we show that CNN trained on raw spectra significantly outperformed other machine learning methods such as support vector machine (SVM) with baseline corrected spectra. Our method is extremely fast with a processing rate of one sample per millisecond.‡
The baseline component of a Raman spectrum is caused primarily by fluorescence, can be more intense than the actual Raman scatter by several orders of magnitude, and adversely affects the performance of machine learning systems. Despite considerable effort in this area, baseline correction remains a challenging problem, especially for a fully automatic system.1
A variety of methods for automatic baseline correction have been used such as polynomial baseline modelling,1 simulation-based methods,2,3 penalized least squares.4,5 Lieber et al.1 proposed a modified least-squares polynomial curve fitting for fluorescence subtraction which was shown to be effective. Eilers et al.6 proposed a method called asymmetric least square smoothing. In this method one first smooths a signal by a Whittaker smoother to get an initial baseline estimation, and then applies asymmetric least square fitting where positive deviations with respect to baseline estimate are weighted (much) less than negative ones. This has been shown to be a useful method, and in principle can be used for automatic baseline correction, although it may occasionally require human input. Kneen et al.2 proposed a method called rolling ball. In this method one imagines a ball with tunable radius rolling over/under the signal. The trace of its lowest/highest point is regarded as an estimated baseline. A similar method is rubber band3 where one simulates a rubber band to find the convex hull of the signal which can then be used as a baseline estimation. Zhang et al.4 presented a variant of penalized least squares, called adaptive iteratively reweighted Penalized Least Squares (airPLS) algorithm. It iteratively adapts weights controlling the residual between the estimated baseline and the original signal. A detailed review and comparison of baseline correction methods can be found in Schulze et al.7
Classification rates have been compared for various machine learning algorithms using Raman data. The method that is frequently reported to outperform other algorithms is support vector machines (SVM).8 An SVM is trained by searching for a hyperplane that optimally separates labelled training data with maximal margin between the training samples and the hyperplane. Binary (two class) and small scale problems in Raman spectroscopy have been previously addressed using this method. A large proportion of these related to applications in the health sciences, use a non-linear SVM with a radial basis function kernel, and an initial principal component analysis (PCA) data reduction step. In this context SVM was shown to: outperform linear discriminant analysis (LDA) and partial least squares discriminant analysis (PLS-LDA) in breast cancer diagnosis,9 successfully sort unlabelled biological samples into one of three classes (normal, hyperplastic polyps or adeno-carcinomas)10 and discriminate between three species of bacteria using a small number of training and test examples.11 Although multiclass classification is possible using SVM, in practice training a non-linear SVM is infeasible for large scale problems involving thousands of classes. Random forests (RF)12 represent a viable alternative to SVM for high dimensional data with a large number of training examples. RF is an ensemble learning method based on multiple decision trees that avoids overfitting the model to the training set. This method generated a lot of attention in the machine learning community in last decade prior the widespread popularity of CNN. However when compared with PCA-LDA and RBF SVM on Raman microspectroscopy data13 it performed poorly. The method previously applied to spectral classification problems that is closest to our own approach is fully connected artificial neural networks (ANN). Unlike CNN, ANN is a shallow network architecture which does not have enough capacity to solve large scale problems. Maquel et al.14 determined the major groupings for their data prior to a multilayered ANN analysis. Their study concluded that vibrational spectroscopic techniques are well suited to automatic classification and can therefore be used by nonexperts and at low cost.
A drawback associated with the methods previously used is that they require feature engineering (or preprocessing) and don't necessarily scale easily to problems involving a large number of classes. Motivated by the recent and widespread success of CNNs in large scale image classification problems we developed our network architecture for the classification of 1D spectral data. A suitable dataset to test the efficacy of the CNN is the RRUFF mineral dataset. Previous work15,16 has focused on identifying mineral species contained in this dataset using nearest neighbour methods with different similarity metrics such as cosine similarity and correlation (also used in commercial softwares such as CrystalSleuth). Carey et al.16 achieved a species classification accuracy on a subset of the RRUFF database17 of 84.8% using a weighted neigbour (WN) classifier. Square root squashing, maximum intensity normalisation, and sigmoid transformations were applied to the data prior to classification. Accuracy was determined using cross validation with semi-randomised splits over a number of trials. The WN classifier compared favourably with the k = 1 nearest neighbour (82.1% accuracy) on which the CrystalSleuth matching software is believed to be based. In Table 1 we summarise the sample data used in our own work and in some previous Raman based spectral classification studies.
| Problems | #Classes | #Spectra | Baseline removal |
|---|---|---|---|
| a Note that in this work special filtering methods were developed to discard spectra of bad quality which account for 80% of the total amount. | |||
| Sattlecker et al.9 | 2 | 1905 | N/Aa |
| Kwiatkowski et al.18 | 10 | N/A | Yes |
| Carey et al.16 | 1215 | 3950 | Yes |
| Ours #1 | 1671 | 5168 | Yes |
| Ours #2 | 512 | 1676 | No |
CNNs are designed to extract features from an input signal with different levels of abstraction. A typical CNN includes convolutional layers, which learn filter maps for different types of patterns in the input, and pooling operators which extract the most prominent structures. The combination of convolutional and pooling layers extracts features (or patterns) hierarchically. Convolutional layers share weights which allow computations to be saved and also make the classifier invariant to spatial translation. The fully connected layers (that follow the convolutional and pooling layers) and the softmax output layer can be viewed as a classifier which operates on the features (of the Raman spectra data), extracted using the convolutional and pooling layers. Since all layers are trained together, CNNs integrate feature extraction with classification. Features determined by network training are optimal in the sense of the performance of the classifier. Such end-to-end trainable systems offer a much better alternative to a pipeline in which each part is trained independently or crafted manually.
In this work, we evaluated the application of a number of prominent CNN architectures including LeNets,21 Inception22 and Residual Nets23 to Raman spectral data. All three showed comparable classification results even though the latter two have been considered superior to LeNet in computer vision applications. We adopted a variant of LeNet, comprising pyramid-shaped convolutional layers for feature extraction and two fully-connected layers for classification. A graphical illustration of the network is shown in Fig. 1.
![]() | ||
| Fig. 1 Diagram of the proposed CNN for spectrum recognition. It consists of a number of convolutional layers for feature extraction and two fully-connected layers for classification. | ||
For our convolutional layers, we used LeakyReLU24 nonlinearity, defined as
![]() | (1) |
Formally, a convolutional layer can be expressed as follows:
where xi and yi are the i-th input map and the j-th output map, respectively. kij is a convolutional kernel between the maps i and j, * denotes convolution, and bj is the bias parameter of the j-th map.
The convolutional layer is followed by a max-pooling layer, in which each neuron in the output map yi pools over an s × 1 non-overlapping region in the input map xi. Formally,
The upper layers of the CNN are fully connected and followed by the softmax with the number of outputs equal to the number of classes considered. We used tanh as non-linearity in the fully connected layers. The softmax operates as a squashing function that re-normalizes a K-dimensional input vector z of real values to real values in the range [0,1] that sum to 1, specifically,
To avoid overfitting the model to the data, we applied batch normalization25 after each layer and dropout26 after the first fully connected layer. Further details of the architecture are shown in Fig. 1.
![]() | (2) |
and #C is the number of samples in the class C that xn belongs to. N is the total number of samples and K is the number of the classes.
CNN is a data hungry model. To reduce the data volume requirements we use augmentation which is a very common approach for increasing the size of the training sets for CNN training. Here, we propose the following data augmentation procedure: (1) we shifted each spectrum left or right a few wavenumbers randomly. (2) We added random noise, proportional to the magnitude at each wave numbers. (3) For the substances which had more than one spectra, we took linear combinations of all spectra belonging to the same substance as augmented data. The coefficients in the linear combination were chosen at random.
The training of the CNN was performed using the Adam algorithm,27 which is a variant of stochastic gradient descent, for 50 epochs with learning rate equal to 1 × 10−3, β1 = 0.9, β2 = 0.999, and ε = 1 × 10−8. The layers were initialised from a Gaussian distribution with a zero mean and variance equal to 0.05. We applied early stopping to prevent overfitting. Training was performed on a single NVIDIA GTX-1080 GPU. The training time was around seven hours. While for inference, it took less than one millisecond to process a spectrum.
The proposed CNN was implemented using Keras29 and Tensorflow.30 The gradient boosting machine method was implemented based on lightGBM released by Microsoft. All other methods were implemented using Scikit-learn.31
![]() | ||
| Fig. 2 (a) Spectra of an example mineral species (Actinolite) indicating the within class spectrum variation and (b) a frequency plot showing the imbalance regarding spectra per species. | ||
In a large scale classification, some classes could be quite similar and differentiating between them could be very difficult or even impossible. Hence, it is common to report top-1 and top-k accuracy. In the former, the class that the classifier assigns the highest probability to is compared to the true label. The latter reports whether the true label appears among the k classes with the highest probability (assigned by the classifier).
We report in Table 2, the top 1, 3 and 5 accuracies of the compared methods, averaged over 50 independent runs. One can see that CNN outperformed all other methods and achieved top-1 accuracy of 88.4% and top-3 accuracy of 96.3%. The difference in classification accuracy between CNN and the second best method is statistically significant, t(50) = 71.78, p < 0.001.
| Methods | KNN(k = 1) | Gradient boosting | Random forest§ | SVM(linear) | SVM(rbf) | Correlation | CNN§ |
|---|---|---|---|---|---|---|---|
| Top-1 accuracy | 0.779 ± 0.011 | 0.617 ± 0.008 | 0.645 ± 0.007 | 0.819 ± 0.004 | 0.746 ± 0.003 | 0.717 ± 0.006 | 0.884 ± 0.005 |
| Top-3 accuracy | 0.780 ± 0.011 | 0.763 ± 0.011 | 0.753 ± 0.010 | 0.903 ± 0.006 | 0.864 ± 0.006 | 0.829 ± 0.005 | 0.953 ± 0.002 |
| Top-5 accuracy | 0.780 ± 0.011 | 0.812 ± 0.010 | 0.789 ± 0.009 | 0.920 ± 0.003 | 0.890 ± 0.007 | 0.857 ± 0.005 | 0.963 ± 0.002 |
To understand the trained model of CNN better, we also closely examined typical predictions, especially where these did not agree with the correct labelling. In Fig. 3 the top spectrum in each set is the test sample (shown in red) which is followed by the top-3 predictions given by the CNN. The correct prediction is highlighted in green. We also show scores in each plot which reflect the confidence level of predictions. Fig. 3(a) shows the examples where the CNN made the correct prediction. Fig. 3(b) shows the examples in which the correct prediction is scored second. In Fig. 3(c), the top-3 predictions do not include the correct label.
As shown in Fig. 3(a), the CNN successfully predicted the correct mineral, actinolite, and also ranked Ferroactinolite and Tremolite as the second and third probable candidates. In fact, these three minerals are all members of the same mineral group. This is not uncommon. For instance, in Fig. 3(b), the most probable mineral Montebrasite (as predicted by the CNN) belongs to the same group as the correct one, Amblygonite, and they share similar spectral structure.
If we examine the peak similarity, for instance in Fig. 3(c), the peak locations of the top-1 prediction, Hydrokenoelsmoreite, are almost identical to those of the test sample Russellite. In Fig. 3(d), only the main peaks were matched correctly. These plots demonstrate that the CNN was capable of matching the peaks characteristic of a particular species even when the prediction did not agree with the correct label.
For this set of experiments, we selected another dataset from the RRUFF database which contains raw (uncorrected) spectra for 512 minerals and six widely-used baseline correction methods: modified polynomial fitting,1rubber band,3robust local regression estimation,33iterative restricted least squares, asymmetric least square smoothing,6rolling ball.2 We used implementations of these methods in the R packages baseline34 and hyperSpec.35 An example of raw spectra and corresponding baseline corrected ones by asymmetric least squares is shown in Fig. 4. We followed the training and evaluation protocol as described in section 2.3. The results are reported in Table 3.
![]() | ||
| Fig. 4 Spectra of a mineral, hydroxylherderite, from RRUFF raw database and corresponding baseline corrected ones by asymmetric least squares. | ||
| Methods | KNN(k = 1) | Gradient boosting | Random forest§ | SVM(linear) | SVM(rbf) | Correlation | CNN§ |
|---|---|---|---|---|---|---|---|
| Raw | 0.429 ± 0.011 | 0.373 ± 0.019 | 0.394 ± 0.016 | 0.522 ± 0.011 | 0.434 ± 0.012 | 0.310 ± 0.007 | 0.933 ± 0.007 |
| Asymmetric least squares | 0.817 ± 0.010 | 0.773 ± 0.009 | 0.731 ± 0.019 | 0.821 ± 0.012 | 0.629 ± 0.016 | 0.777 ± 0.013 | 0.927 ± 0.008 |
| Modified polynomial | 0.778 ± 0.007 | 0.740 ± 0.016 | 0.650 ± 0.016 | 0.785 ± 0.014 | 0.629 ± 0.016 | 0.734 ± 0.013 | 0.920 ± 0.008 |
| Rolling ball | 0.775 ± 0.009 | 0.737 ± 0.008 | 0.689 ± 0.018 | 0.795 ± 0.011 | 0.624 ± 0.013 | 0.730 ± 0.010 | 0.918 ± 0.008 |
| Rubber band | 0.825 ± 0.007 | 0.792 ± 0.015 | 0.741 ± 0.009 | 0.806 ± 0.015 | 0.620 ± 0.010 | 0.789 ± 0.010 | 0.911 ± 0.008 |
| IRLS | 0.772 ± 0.010 | 0.710 ± 0.008 | 0.675 ± 0.007 | 0.781 ± 0.011 | 0.614 ± 0.010 | 0.711 ± 0.011 | 0.911 ± 0.008 |
| Robust local regression | 0.741 ± 0.009 | 0.694 ± 0.008 | 0.667 ± 0.012 | 0.759 ± 0.013 | 0.600 ± 0.013 | 0.696 ± 0.011 | 0.909 ± 0.007 |
For the conventional classification methods, used as a comparison in our work, PCA was adopted to reduce dimensionality and extract features, except for Random Forest where we found that PCA decreased the performance. This is indicated in the table by §. The number of principal components were determined such that 99.9% of total variance was retained. One can see that CNN on the raw spectra achieved an accuracy of 93.3% which is significantly better, t(50) = 77.14, p < 0.001, than the second best method, KNN with rubber band baseline correction, that achieved an accuracy of 82.5%.
There are a few remarks which are worth highlighting. Firstly, it is not a surprise that baseline correction greatly improved the performance of all the conventional methods by 20%–40%. On other hand, CNN's performance dropped by about 0.5%–2.5% when combined with baseline correction methods. This may indicate that CNN was able to learn more efficient way of handling the interference of the baselines and to retain more discriminant information than using an explicit baseline correction method. The advantage of CNNs in achieving high accuracy of classification while requiring minimal preprocessing of spectra opens new possibilities for developing highly accurate fully automatic spectrum recognition systems.
Besides the two datasets from the RRUFF database, we also validated the proposed method on another (independent) dataset which includes both baseline corrected and uncorrected spectra. The additional dataset, which we refer to as Unipr-mineral,36 comprises spectra for 107 minerals, 163 spectra in total. The same training, evaluation protocol, and network architecture were used. Results for this dataset are presented in Table 4 and show that CNN is again significantly more accurate than the second best method (t(50) = 4.20, p < 0.001).
| Methods | KNN(k = 1) | Gradient boosting | Random forest§ | SVM(linear) | SVM(rbf) | Correlation | CNN§ |
|---|---|---|---|---|---|---|---|
| Raw | 0.893 ± 0.023 | 0.723 ± 0.069 | 0.695 ± 0.044 | 0.880 ± 0.037 | 0.874 ± 0.031 | 0.823 ± 0.022 | 0.947 ± 0.020 |
| Asymmetric least squares | 0.905 ± 0.020 | 0.787 ± 0.030 | 0.737 ± 0.051 | 0.913 ± 0.030 | 0.926 ± 0.018 | 0.903 ± 0.014 | 0.952 ± 0.018 |
| Modified polynomial | 0.882 ± 0.024 | 0.768 ± 0.052 | 0.722 ± 0.061 | 0.904 ± 0.026 | 0.893 ± 0.020 | 0.855 ± 0.023 | 0.952 ± 0.016 |
| Rolling ball | 0.912 ± 0.015 | 0.774 ± 0.047 | 0.751 ± 0.044 | 0.918 ± 0.019 | 0.924 ± 0.024 | 0.885 ± 0.021 | 0.949 ± 0.018 |
| Rubber band | 0.864 ± 0.030 | 0.723 ± 0.085 | 0.701 ± 0.049 | 0.873 ± 0.030 | 0.894 ± 0.032 | 0.912 ± 0.014 | 0.942 ± 0.022 |
| IRLS | 0.876 ± 0.014 | 0.794 ± 0.046 | 0.731 ± 0.046 | 0.912 ± 0.026 | 0.928 ± 0.025 | 0.873 ± 0.016 | 0.930 ± 0.019 |
| Robust local regression | 0.866 ± 0.017 | 0.770 ± 0.046 | 0.719 ± 0.045 | 0.878 ± 0.026 | 0.909 ± 0.024 | 0.850 ± 0.022 | 0.923 ± 0.017 |
Footnotes |
| † Funded by Innovate UK, project number 132200. |
| ‡ Software processing time only. Not including acquisition of Raman signal from spectrometer. |
| This journal is © The Royal Society of Chemistry 2017 |