Automatic materials characterization from infrared spectra using convolutional neural networks

Infrared spectroscopy is a ubiquitous technique used to characterize unknown materials in the form of solids, liquids, or gases by identifying the constituent functional groups of molecules through the analysis of obtained spectra. The conventional method of spectral interpretation demands the expertise of a trained spectroscopist as it is tedious and prone to error, particularly for complex molecules which have poor representation in the literature. Herein, we present a novel method for automatically identifying functional groups in molecules given the corresponding infrared spectra, which requires no recourse to database-searching, rule-based, or peak-matching methods. Our model employs convolutional neural networks that are capable of successfully classifying 37 functional groups which have been trained and tested on 50 936 infrared spectra and 30 611 unique molecules. Our approach demonstrates its practical relevance in the autonomous analytical identification of functional groups in organic molecules from infrared spectra.


Introduction
Molecules can be divided into functional groups, which are specic groups of atoms bonded in certain arrangements that together exhibit their own characteristic properties. The same functional group will have the same or similar physical and chemical properties regardless of the molecule of which it is a part. 1 The presence of certain functional groups can affect the physical properties of a molecule, including its boiling point, 2 solubility, 3 and viscosity. 4 Since the chemical properties of a functional group can modify the reactivity of other functional groups nearby, the technique of retro-synthetic analysis can be implemented to solve problems in the planning of organic syntheses. 5,6 Inversely, structure elucidation of molecules can be achieved by observing the characteristic properties of constituent functional groups in unknown molecules. This applicable use of functional groups exemplies the undeniable importance of identifying them.
Infrared (IR) spectroscopy exploits the relationship between the vibrational frequencies of chemical bonds in functional groups of molecules that absorb IR light and the structure of a molecule in order to determine the chemical identity of a compound. These absorptions occur at resonant frequencies where the vibrational frequency of a specic bond matches the frequency of the absorbed radiation. An IR spectrometer is the instrument used to realize the type of data, producing an IR spectrum by observing the differences in the intensities of incident and transmitted electromagnetic rays across a frequency range of around 400-4000 cm −1 . 7 The measured change in the intensities is illustrated in a graph of absorbance (or transmittance) against frequency (or wavelength) to form an IR spectrum from which the presence of a given functional group in a sample molecule can be deduced by analyzing its characteristic patterns of peak frequencies.
IR spectra can be manually interpreted by a spectroscopist, albeit there are factors that can make this process particularly challenging and inefficient. There may be signicant shis in the characteristic peak frequencies of certain functional groups from their archetypal ranges due to changes in their structural environment as a result of various physical and chemical factors that inuence the components of the sample. Such factors include solvent effects, chemical reactions, and tautomerism. The region between 400-1500 cm −1 , the ngerprint region, typically contains a vast number of peaks, and there is a high degree of overlap between spectral bands of molecular vibrations, making it difficult for a spectroscopist to resolve individual peaks manually. Accordingly, manual interpretation of IR spectra is usually performed only in the range 1500-4000 cm −1 . This means that the unique information held within the ngerprint region of an IR spectrum, which can in principle be used to distinguish between different molecules, will be ignored. Even without the interpretation of the ngerprint region, the manual examination of many spectra is timeconsuming, and it calls for the skills of a trained spectroscopist, who is vulnerable to human bias and error, especially in the case of complex and multifunctional compounds. The clear limitations in manual interpretation of IR spectra to identify functional groups call for other characterization techniques to be employed, such as computer-aided spectral interpretation methods.
During the 1990s, a number of neural-network approaches to spectral interpretation were explored. [8][9][10][11] In comparison to current standards, the systems created were constrained by several factors including the size of the published training data, the complexity of the neural-network architecture explorable owing to low processing power, and a dearth in machinelearning techniques that were available at that time. The effects of these limitations are directly reected in the performance of these systems. Given these historical limitations of neural networks, library-search 12-17 methods and expert [18][19][20][21][22] systems have evolved as the most common spectral interpretation frameworks, and they remain dominant in the modus operando of available IR data-analysis soware to date.
The library-search method compares each IR spectrum from a collection of referenced IR spectra with an IR spectrum of an unknown molecule. Because IR spectra are subject to noise and to experimental conditions, exact matches cannot be expected. Therefore, the algorithm usually computes a degree of similarity between the IR spectra of the unknown and the database of reference IR spectra to construct a list of compounds, ranked by decreasing order of spectral similarity. The similarity metrics used in commercially available library-search algorithms include: the correlation coefficient, which is calculated from the absorbance values or the mean-centered absorbances of the spectra; the Euclidean distance; and the absolute absorbance differences. 23 The library-search method has two major drawbacks. First, the associated soware needs a large database of high-quality reference IR spectra, and it cannot identify the unknown compound if it is not stored in the database. However, it seems reasonable to think that similar spectra retrieved correspond to compounds whose molecular structure or key functional groups of molecular structure are more or less similar to the unknown. Second, a powerful computing system is necessary to be able to quickly search this large database.
Expert systems exploit the dependence of IR-spectral peak frequencies upon functional-group characteristics of a molecule to dene its structural fragments. The rst systems followed closely the classical ways that chemists identify unknown compounds where a table-driven classication procedure was applied for fast interpretation of IR spectra to receive a list of substructures that were probably a part of an analyzed compound; these substructures then need to be pieced together by the data analyst to identify the parent molecule of the sample compound. In most systems, knowledge about spectrumstructure correlation is formulated as interpretation rules that indicate the presence or absence of certain molecular fragments, when applied to a spectrum of an unknown structure. These rules can be based on the empirical examination of many reference IR spectra of known compounds or generated from a library of IR spectra via neural networks or the application of statistical algorithms. The major limitation of such expert systems is the nature of the knowledge concerned, in that the spectrum-structure correlations taken from the literature are directly stated by human experts or are extracted automatically from spectral databases which have been accumulated by the same type of experts. Affolter et al. 24 tested a set of normalfrequency ranges for large fragments of organic molecules that had been tabulated from a manually curated database of IR spectra. They showed that the spectrum-structure correlation tables cannot be reliably used for automatic spectral interpretation, as the direct use of tabulated normal-frequency ranges taken from the literature leads to a high percentage of cases in which at least one of the ranges lacked any peaks. These expert systems also involve acquiring and manipulating large volumes of information or data. Taken together, these limitations call for the development of more sophisticated methods for the interpretation of IR spectra.
There have been recent developments in the application of machine learning to IR spectral interpretation, most of which are focused on a specic group of materials. For example, Michel et al. 25 employed support-vector machines (SVMs) to develop a robust and reproducible method to identify six consumer plastics, in an attempt to understand the environmental fate and transport of macro-and micro-plastic debris. Chen et al. 26 combined machine-learning algorithms and mid-IR spectroscopy to identify patients with glioma, which has a low cure rate and a high mortality rate. They rst used principal-component-analysis algorithms to reduce data dimensionality, aer which an SVM, a backpropagation neural network, and a decision-tree model were established. They found that the SVM model had the highest classication accuracy. Fine et al. 27 successfully applied a multilayer perceptron (MLP) to classify 16 functional groups of a broader set of materials. MLPs have fully connected layers, where each node in one layer is connected to all nodes in the succeeding layer, making them prone to overtting data. The authors made use of IR spectra and the corresponding mass spectra of 7000 compounds. Wang et al. 28 also constructed SVM models to infer the presence or absence of 16 functional groups using a total set of 823 compounds. Their results were evaluated by a variety of indices and compared favorably with those obtained by using articial-neural-networks (ANN) methods.
In this work, we demonstrate that convolutional neural networks (CNNs) can be used to identify 37 functional groups of organic molecules from a dataset of IR spectra that spans 30 000 unique compounds. CNNs expand upon ANNs by sliding convolution kernels (or lters), which have a shared-weight architecture, across input features and yield translational equivariant responses known as feature maps. 29 The initial lters are embossed with small and simple patterns that are used to assemble increasingly complex patterns, taking advantage of the hierarchical patterns within the data. This signicantly increases the number of pixels handled per neuron compared to an MLP, reducing the connectivity and complexity of the network architecture without compromising its performance. Another major advantage of using CNNs is that there is an independence from prior knowledge and human intervention since they require little or no pre-processing, thereby enabling the network to learn to optimize the lters through automated learning for feature extraction. We show that a onedimensional CNN can be implemented to develop an automated IR spectral interpretation system capable of identifying 37 functional groups with a high degree of condence.

CNN architecture
The best-performing hyperparameters were used to build our CNN model, as shown in Fig. 1. It takes as input a spectrum of an unknown material and outputs a classication of functional groups that pertain to the distinctive spectral features. The full details of the methods that were used to produce this overall CNN architecture can be found in the ESI. †

Original versus extended model performance
We initially dened the 22 most common functional groups to train the original model; an extended model was then trained using all 37 functional groups, thereby incorporating less common functional groups into the model. We trained our model to be a multi-label classier in order to predict all functional groups simultaneously.
We found that the inclusion of the additional 15 functional groups did not hinder the performance of our model in predicting the original list of functional groups, as shown in Fig. 2b-d. The largest increase and decrease in the F1 scores, when implementing the extended list instead of the original list, were +0.031 and −0.034 for alkyne and acyl halide groups, respectively. These small changes in the F1 score indicate that the original and extended models are equally t for purpose. They furthermore suggest that even more functional groups  could, in fact, be introduced into the training of such a model, pending the availability of additional IR-spectral data on molecules that contain many instances of other functional groups. Such an expansion would stand to offer a more precise structural characterization of materials while maintaining model performance. For example, these other functional groups could be more complex molecular fragments, such as heterocyclic, aromatic rings, or subdivisions of functional groups. Both approaches would bring our model one step closer to full autonomy via minimal manual intervention. While the lack of suitable datasets renders this potential expansion beyond the scope of this work, the well-performing and open-source nature of our code base offers the tools for readers to create such an expansion. However, this comes with the caveat that the addition of certain molecular fragments, such as described above, can prove to be difficult. Due to vibrational coupling, spectral features can and frequently are very complex, adding to the complexity of the overall spectrum.

Relative abilities of the extended model to identify each functional group
The F1 score of all but a few of the functional groups ranges between 0.57 and 0.97, with values being above 0.73 except for those pertaining to the phosphine, hydrazine, and sulfoxide groups; see Fig. 2a. The precision values are consistently high, ranging from 0.77 to 1.00, and all but one functional group (acid anhydride) are greater than 0.83; see Fig. 2b. However, the range of recall values was relatively broad, extending as low as 0.42 and as high as 0.97; see Fig. 2c. There does not seem to be a clear reason as to why the recall values associated with certain functional groups are considerably lower than others, besides their lack of coverage in the training data cf. the numbers of samples for each of the 37 functional groups is displayed in Fig. 2d. However, our model could identify certain functional groups with low coverage very well. Nonetheless, there are exceptions to consider; for example, isothiocyanate had the highest F1 score of 0.97 with only 216 samples, while arene and alkane had F1 scores of 0.97 and 0.96 with 23 657 and 31 201 samples, respectively. Hence, functional groups with fewer samples should not be le out of consideration simply because of the lack of coverage.
Overall, the original model had an average F1 score of 0.87 and a weighted average F1 score of 0.93. The extended model had an average F1 score of 0.85 and a weighted average F1 score of 0.93. Since the inclusion of more functional groups has a minimal effect on the performance of our model, we will solely use the extended model for further analyses.

Multiple predictions
While it is important for our model to accurately predict the presence of a single functional group, a metric better suited to measure the performance of predicting all functional groups in a given molecule is required when considering the ultimate task of materials characterization. This is because the prediction of multiple functional groups in a single molecule poses a different optimization problem. To this end, we investigated how the EMR was affected for molecules with one through ten functional groups (maximum in our data) using the extended set of functional groups.
The EMR for individual EMR values as well as the accumulated EMR values is shown in Fig. 3. We hypothesized that an increase in the number of functional groups to be predicted would decrease the EMR. Although the individual EMR for higher numbers of functional groups in a single molecule was found to drop drastically aer six functional groups per molecule, the sample frequency of these molecules also dropped. Therefore, it turns out that the effect on the accumulative EMR, which plateaus at around 0.72, is minimal. Since molecules with more than ten functional groups will rarely be characterized via FTIR spectroscopy, and that the sample frequency of the less common functional groups will be relatively low, it is reasonable to say that our model has an overall accumulated EMR of 0.72.
We also explored how our model performed while varying the number of functional groups that were considered. The order of functional groups considered was chosen from the most abundant to the least abundant functional groups; although, this order is essentially arbitrary because considering all 37 functional groups in any order would result in the same nal EMR. As hypothesized, the EMR decreases as the number of classes predicted increases. Since most of the common functional groups in the organic chemistry domain have already been included in our model, the predictive classication of any additional functional groups would have fewer samples and its effect on the EMR would be negligible. The EMR for varying the number of functional groups in the predictive-classication task plateaus at around 0.72.
Given that EMR is a stringent metric, we conclude that the addition of further class labels to our classication model in order to predict more functional groups is possible without compromising its ability to predict other functional groups. In addition, the number of functional groups enclosed in a single molecule will not have a drastic effect on the EMR.

Class imbalance
The class label for each functional group carries a positive or negative sign according to whether the model classies the group as being present in a subject molecule or not. Given that there are typically many more options for functional groups (class labels) than the number of such groups which are present in a given molecule, there is a large imbalance between the positive and negative class labels in which the negatives dominate. Therefore, we also evaluated our model by focusing on the positive class labels.
We calculated the AP to summarize PR curves as the weighted mean of precisions acquired at each threshold, with the increase in recall from the previous threshold used as the weight. Here, threshold refers to the classication threshold used to convert predicted probabilities to class labels in order to indicate whether or not a functional group is present. The AP for the correct classication of each functional group is shown in Fig. 4. It shows if our model can accurately identify all relevant class labels for functional groups as positives without mistakenly labeling too many negatives as positives. Thus, AP is higher when our model can correctly handle positive class-label assignments for each functional group. The AP of a random classier is equal to the true proportion of positive class labels, and the AP of a perfect classier is 1.0. Fig. 4 shows that our model is efficient at identifying positive class labels even when the AP of a random classier is close to zero. The mean average precision (mAP) of our model for the classication of 37 functional groups is 0.88.
We implemented weighted binary cross-entropy (WBCE) as the loss function, as was dened in eqn (S7), † in order to improve the prediction accuracy of the minorities. We calculated these accuracies for the presence and absence of functional groups separately rather than calculate the overall accuracy for each class label in order to see the effects of the WBCE function more clearly. Fig. 5 shows the average accuracy for the presence of functional groups, while implementing binary cross-entropy (BCE) and WBCE, was 0.80 and 0.88, respectively. The average accuracy for the absence of functional groups, while implementing BCE and WBCE, was 0.99 and 0.96, respectively.
We managed to signicantly improve the accuracy for the predictions of the minorities with only a slight decrease in the accuracy for the predictions of the majorities. However, owing to the huge class imbalance, the slight decline in the accuracies Fig. 4 Average precision of our predictive classification model shown against that of a random classifier based on the proportion of the positive signs in each class label that represents a given functional group.
for the classication of minorities notably affected the overall accuracy, which dropped from 0.99 to 0.96. Therefore, the WBCE was not implemented in the nal model, but a separate model was made available; see soware and availability section for details.

Data augmentation and oversampling
As a CNN is a data-hungry model, we examined how oversampling and three different methods of augmentation (horizontal shi, vertical noise, and linear combination) would affect the performance of our CNN model. Oversampling was used as a control method to compare against the augmentation methods, by augmenting the same sampled data. We varied the amount of sampling from 25% to 100% of the training data, in steps of 25%. We also compared the results with the performance of our model trained only on the original data.
We hypothesized that all the augmentation methods and oversampling of the training data would generally improve the performance of the model. While this was true when performing and evaluation with the macro average F1 score, except in two cases (25% sampling of vertical augmentation and 100% sampling of linear combination as shown in Fig. 6a), this is not the case when evaluating the effect of these data treatments with weighted average F1 score, as shown in Fig. 6b. The latter is more meaningful as it represents the number of true instances for each class label by altering 'macro' to account for class label imbalance.
All sampling proportions of the data augmentation by linear combination resulted in F1 scores that were lower than those without this data treatment. A simple explanation could be the highly sensitive nature of IR spectra, where two spectra of the same compounds can show signicant differences in the peak intensities, shape, and frequency position (within a certain range) depending on various factors. As a result, depending on the coefficients selected, erroneous IR-spectral peak shapes might be created when two IR spectra are linearly mixed.
The horizontal shi of the spectrum is another form of augmentation that we investigated. Translational invariance is attained as a result of the max-pooling operation within our CNN model architecture. The max-pooling operation replaces the output of a convolutional layer at a specic position with the maximum of neighboring outputs. As a result, even if we adjust  the inputs to the convolutional layer slightly, the values of most of the pooled outputs should remain unchanged. This indicates that even if the inputs to the convolutional layer are translated, the predictive output of the CNN model will still be able to recognize the class of functional group to which it belongs. As a consequence, we hypothesized that augmenting the data with horizontal translations would improve the performance of our CNN model, as demonstrated in Fig. 6a and b.
When noise was added to a spectrum, the magnitude of the noise at a certain point was determined by the intensity value at that point. This method was employed to introduce noise with a corresponding level of magnitude throughout the spectrum. The addition of noise did not improve the performance of the model. One possible explanation for the lack of effectiveness of augmentation by adding noise is that it may have altered the shape of the peaks in a way that the CNN model would recognize the pattern to be completely different to the original rather than one with slight variation.
With the exception of augmentation by linear combination, the general trend of the augmentation methods may appear to be an uptrend as the sample size increases. When compared to oversampling, however, the augmentation methods did not enhance the performance of the model, considering that the same oversampled data were augmented. The best improvement was achieved by oversampling 100% of the training data where the F1 score improved from 0.926 to 0.932. Overall, the improvements gained in the performance of the model from all augmentation methods and oversampling turned out to be insignicant.

Optimal threshold tuning
One of the ways that we used to try to improve the performance of our model for the classication of functional groups was via optimal-threshold tuning. The optimal thresholds, which were used to convert probability predictions into crisp class labels, were computed based on the training dataset, following the training of the model. To begin, a line plot of recall (x-axis) against precision (y-axis) was produced in ascending order of the thresholds. Finding the threshold that results in the best balance between precision and recall is equivalent to optimizing for the F1 score that summarizes the harmonic mean of both metrics. We took the naive approach of calculating the F1 score for each threshold, then locating the threshold where the corresponding F1 score was the largest. When employing the calculated optimal thresholds (Table 1) for interpreting the predicted probabilities of all functional groups, the average F1 score and the weighted average F1 score were 0.85 and 0.93, respectively. As mentioned in Section 3.1, when 0.5 was used as the threshold, we obtained the same average F1 score and weighted average F1 score. As there were no differences between the sets of F1 scores when applying the optimal thresholds and when using 0.5 for all interpretations, we used the latter for simplicity.

Demonstration of the model
To showcase the ability of our model more clearly, Fig. 7 illustrates several examples of corresponding inputs and outputs to and from our CNN model. The inputs are the IR spectra of the unknown molecules, and the outputs are the functional groups that are predicted to be present in the molecule. We also depict the correct identication of functional groups on the actual molecule to aid in interpretation of the output predictions.

Conclusions
Most functional groups have the same or comparable innate qualities regardless of the molecule of which they are a part. When identifying or predicting the structure and properties of molecules, information regarding the existence of functional groups, which may be acquired by interpreting the corresponding spectrum, is critical. The conventional method of spectral interpretation requires the skills of an expert spectroscopist and can take a long time to interpret. We have presented a novel spectral interpretation approach that uses convolutional neural networks (CNNs) to identify spectral features in a spectrum, which are then analyzed to predict the presence of distinct functional groups. Our model was trained and validated on a huge collection of infrared (IR) spectra which includes a wide range of functional groups that work together to deliver a complete spectral interpretation. Our approach makes it simple to analyze organic compounds because of its autonomous nature. A range of case studies are used to demonstrate the practical application of our predictive CNN model. We conclude that CNNs are effective at identifying IR spectral features and are transferable to a chemical-identication application.

Data availability
We have made the code for scraping the two data sources, preprocessing data, hyperparameter optimization, model training, optimal-threshold tuning, data augmentation, and the trained models used in this study is available at: https://github.com/ gj475/irchracterizationcnn. The data sets used in this study are available from the NIST Chemistry WebBook (https:// webbook.nist.gov) and the National Institute of Advanced Science and Technology, SDBS Web (https://sdbs.db.aist.go.jp).