Kate
Mottershead
a and
Thomas H.
Miller
*b
aDepartment of Analytical, Environmental & Forensic Sciences, School of Population Health & Environmental Sciences, Faculty of Life Sciences and Medicine, King's College London, 150 Stamford Street, London SE1 9NH, UK
bCentre for Pollution Research & Policy, Department of Life Sciences, Brunel University London, Kingston Lane, Uxbridge, UB8 3PH, UK. E-mail: thomas.miller@brunel.ac.uk
First published on 19th April 2023
With the advent of high-resolution mass spectrometry (HRMS), untargeted analytical approaches have become increasingly important across many different disciplines including environmental fields. However, analysing mass spectra produced by HRMS can be challenging due to the sensitivity of low abundance analytes, the complexity of sample matrices and the volume of data produced. This is further compounded by the challenge of using pre-processing algorithms to reliably extract useful information from the mass spectra whilst removing experimental artefacts and noise. It is essential that we investigate innovative technology to overcome these challenges and improve analysis in this data-rich area. The application of artificial intelligence to support data analysis in HRMS has a strong potential to improve current approaches and maximise the value of generated data. In this work, we investigated the application of a deep learning approach to classify MS peaks shortlisted by pre-processing workflows. The objective was to classify extracted ROIs into one of three classes to sort feature lists for downstream data interpretation. We developed and compared several convolutional neural networks (CNN) for peak classification using the Python library Keras. The optimized CNN demonstrated an overall accuracy of 85.5%, a sensitivity of 98.8% and selectively of 97.8%. The CNN approach rapidly and accurately classified peaks, reducing time and costs associated with manual curation of shortlisted features after peak picking. This will further support interpretation and understanding from this discovery-driven area of analytical science.
Environmental significanceEnvironmental fields are increasingly employing high-resolution mass spectrometry (HRMS) for non-target applications with methods often developed from areas such as metabolomics. A current challenge using HRMS is that data processing in non-target screening is complex and requires manual curation and checking by the user adding significant time and resource costs. To support data analysis in these areas it is important to investigate the use of innovative technologies such as Artificial Intelligence (AI) to overcome these bottlenecks. A deep learning model was developed using image classification to determine whether features extracted out from raw HRMS data could be reliably classified as a peak or not. The work demonstrated these models could rapidly classify peaks from images significantly reducing time and cost. |
Untargeted analysis generates a significant amount of data requiring extensive processing before data interpretation stages. The data processing referred to as pre-processing is used to generate a feature list (i.e. a list of m/z–tR pairs) that have been measured by the mass analyser. There are many approaches and algorithms used to extract this information and perform peak picking including vendor-specific software and freely available packages such as XCMS9 and MZmine.10 However, these algorithms are not without limitations and peak picking can often extract noise instead of a true signal (i.e. a peak).11,12 Therefore, feature lists can contain a large number of false positives. This can be reduced somewhat by appropriate optimization of parameter settings during peak picking but does not completely solve the issue. Furthermore, implementing more stringent parameter settings can also reduce the sensitivity for a true signal so that potentially important peaks are not extracted out of the raw data.13–15 To ensure that the signal is a true positive, feature lists must be manually inspected leading to a laborious process that is prone to human error.16
Different research groups have attempted to improve this situation by developing approaches to reduce false positives and improve quality control.14,16 A potential solution to significantly reduce the need for manual curation is through the application of artificial intelligence (AI). AI is the ‘cognitive ability’ demonstrated by machines and the field is further separated into machine learning (ML) and deep learning (DL) which enables machines to learn. The advantage of ML and DL is that they can model a process in a rapid, automated and reliable way. Furthermore, DL models, such as convolutional neural networks (CNNs), can be trained on visual data such as images. This approach can be utilised for untargeted analysis where feature lists require visual inspection and the extracted ion chromatograms can be exported as images. For example, a previous investigation by Melnikov et al. used a CNN to create an algorithm which is capable of peak classification and integration of raw LC-HRMS data.17 These early studies of DL applications for untargeted metabolomics data analysis show good potential for the use of AI in analytical science. With further advancements in accuracy and generalizability of models, the impact of AI in untargeted analysis has the potential to increase accuracy whilst decreasing the overhead and further support data interpretation.
The aim of this work was to demonstrate the applicability of CNNs for peak classification of regions of interest (ROIs) extracted from an untargeted LC-HRMS analysis. Feature lists (tR–m/z pairs) were extracted using XCMS in R. Several CNN models were applied and compared for their ability to categorise these features (i.e., ROIs) as true positives, false positives and those requiring further investigation. The optimised CNN was then applied to two external datasets to test the generalisability of the model to classify peaks from other HRMS datasets. The application of CNNs to identify peaks during untargeted workflows would provide an automated, data driven solution to support downstream data interpretation, saving significant time and costs.
The ROI files were labelled into three classes; files containing a true chromatographic peak(s) (Type I), files containing no identifiable chromatographic peak (Type II) or files that needed further investigation (Type III). Several criteria were applied during the manual labelling of the ROI files to ensure consistency and reproducibility (see ESI, Fig. S1†). Images were labelled with the class (i.e., Types I–III) against each file name in an excel spreadsheet and then a robot script was written and applied to append the class name to each of the ROI's file name.
The script first organises the data into train, validation and test subsets. The train and validation sets are used to develop and optimise the model parameters. The test set is then used to test the model classifications on new data not previously used in the model development to get a more robust estimate of model performance. The train set was given 2759 ROIs (70% of the data) split across the classes as 1290 Type I, 769 Type II and 700 Type III, the validation set contained 799 ROIs (20% of the data) split across the classes as 368 Type I, 231 Type II and 200 Type III, and finally the test set contained 442 ROIs (10% of the data) split across the classes as 185 Type I, 157 Type II and 100 Type III. The ROIs assigned to each class in each set was fixed. The ROI image files were pre-processed into an image data format (.png) that the Keras neural network could receive (e.g., png, jpeg, bmp, gif). Different pre-trained CNN models are available as applications online (https://keras.io/api/applications/), which were created externally on different image data types, and can be utilised in the pre-processing step. In the CNN models that were applied in this study, three different apps were used to provide the image pre-processing: VGG-16, Xception and MobileNet (see ESI, Fig. S2†).
Three more complex, fine-tuned CNN models were then built using the pre-trained model applications from Keras: VGG16, Xception and MobileNet. In all cases the pre-trained models were downloaded and then modified and a dense output layer (with 3 outputs) was added. All models had different layer types (see ESI, Table S1†), The VGG16 model had 23 layers containing a mixture of Conv2D and MaxPool2D layers. The Xception model had 126 layers, containing; activation, separable convolutional, batch normal, Conv2D, MaxPool2D and used a global average pooling layer for the final layer. The MobileNet model, the least complex of the three models (4253864 parameters), contained 88 layers including Conv2D, Rectified Linear Unit, Depthwise Conv2D, batch normalization, zero padding and global average pooling layers. A variety of batch sizes, number of epochs, and learning rates were trialed across all models to assess the overfitting and underfitting of the models. Overfitting would mean increased sensitivity of the model to the train sets ROI image details, resulting in a negative impact on the performance of the model against other ROI images. This can be observed, whilst increasing parameters such as epochs, at the point when the accuracy of the model against the train set begins to improve at a faster rate than the accuracy of the validation set and this was used to assess overfitting of the models tested here. The final parameters were set to use the Adam optimizer, with a learning rate of 0.001, a batch-size of 50, a loss function of categorical cross-entropy, epochs set to 10 and each model trained twice. As with the simpler model, the test data were then run against the best of two models for each optimised pre-trained model and a confusion matrix was produced.
In binary data classification, the reliability of a model typically includes measures of sensitivity and selectivity (eqn (1) and (2)).
(1) |
(2) |
CNN model | Pre-processing app (PPA)/pre-trained model (PTM) | Test set accuracy (%) | Sensitivity | Selectivity | Type III misclassification (%) |
---|---|---|---|---|---|
A | VGG16 (PPA) | 76.47 | 0.965 | 0.857 | 12.6 |
B | MobileNet (PPA) | 76.24 | 0.964 | 0.915 | 16.4 |
C | Xception (PPA) | 76.24 | 0.971 | 0.892 | 14.2 |
D | VGG16 (PTM) | 83.03 | 0.988 | 0.977 | 14.0 |
E | MobileNet (PTM) | 85.52 | 0.988 | 0.978 | 9.9 |
F | Xception (PTM) | 84.84 | 0.994 | 0.971 | 8.8 |
Melnikov et al. | — | 87.33 | 0.994 | 0.989 | 24.0 |
Following this the pre-trained models; VGG16 (D), MobileNet (E) and Xception (F) were downloaded in full and fine-tuned. These more complex models took a significantly longer time to train, thus training was repeated only once. The new models were assessed using the same criteria as previously. Model D and E had the same sensitivity of 0.988 compared to model F which had a better sensitivity of 0.994. However, Model E had the highest test accuracy of 85.52% and the highest selectivity of 0.978 (Table 1).
As mentioned before, the selectivity and sensitivity used to assess the models did not account for the Type III ROIs due to the ambiguity of this category. Although, comparisons of the model performance for the Type III class can be made from the percentage of ROIs predictively assigned to Type III that were not a true Type III (i.e. either a Type I or Type II). A higher number of ROIs predicted in the Type III class would mean an increased amount of time for manual investigation. Given this, Model E and F showed better performance, as misclassification of Type I or Type II ROIs into the Type III class was 9.9% and 8.8%, respectively. In comparison Model D had a misclassification of 14.0% for ROIs that were Type I or Type II (Table 1). Model E was selected as the best of the six developed models demonstrating the highest test set accuracy, high sensitivity, high selectivity and a low Type III misclassification.
The training of these models made use of available image pre-processing apps and, for models D-F, pre-trained CNNs were also applied. These models and their respective image pre-processing apps were developed from the ImageNet Large Scale Visual Recognition Challenge that has significantly advanced image recognition in DL approaches.20 The ImageNet dataset contains >1.3 M images covering 1000 object classes (e.g. cats & dogs, different lizard species and numbers). Although it is possible to create an image pre-processing app or a full model these methods are time consuming and typically yield worse accuracies than the use of pre-trained models. A paired two-tailed t-test showed that the test set accuracy of the simple models that were developed here (A–C) were statistically lower (p < 0.05) than the fine-tuned models (D–F) optimised from available pre-trained models. This transfer learning, where one model developed for a specific task is used as a starting point for a second task, is common within CNN approaches.21 By using pre-trained models, effort can instead be spent improving and fine-tuning models to be more accurate to a specific task. This follows one of the main concepts in AI of continuous improvement. However, the pre-trained models used in this study were trained on image data that is considerably different to extracted ion chromatograms, which in comparison are much simpler. Thus, the feature extraction by the CNN models may not be optimal for recognition of image features in the ROI files. If these pre-trained models had been developed using similar objects, then the CNN accuracy would potentially improve for this application.
Feature # | Lipid | Calc. m/z | Prediction (probability) | Class | ||
---|---|---|---|---|---|---|
Type I | Type II | Type III | ||||
a Poor performance of the peak picking algorithm (XCMS). b Not detected in the original publication. | ||||||
1 | GPCho(14:0/0:0) | 468.3085 | 0.90 | 0.10 | 0.00 | Type I |
2 | GPEtn(18:1/0:0) | 480.3085 | 0.81 | 0.14 | 0.05 | Type I |
3 | GPCho(O-16:1) | 480.3449 | 0.99 | 0.01 | 0.00 | Type I |
4 | GPEtn(18:0/0:0) | 482.3241 | 0.80 | 0.20 | 0.00 | Type I |
5a | GPCho(O-16:0) | 482.3605 | 0.12 | 0.70 | 0.18 | Type II |
6 | GPCho(16:1/0:0) | 494.3241 | 0.99 | 0.01 | 0.00 | Type I |
7 | GPCho(16:0/0:0) | 496.3398 | 1.00 | 0.00 | 0.00 | Type I |
8 | GPEtn(20:4/0:0) | 502.2928 | 1.00 | 0.00 | 0.00 | Type I |
9a | GPCho(O-18:1) | 508.3762 | 0.10 | 0.90 | 0.00 | Type II |
10 | GPCho(18:3/0:0) | 518.3241 | 0.99 | 0.01 | 0.00 | Type I |
11 | GPCho(18:2/0:0) | 520.3398 | 0.90 | 0.10 | 0.00 | Type I |
12 | GPCho(18:1/0:0) | 522.3554 | 0.99 | 0.01 | 0.00 | Type I |
13 | GPCho(18:0/0:0) | 524.3711 | 0.99 | 0.01 | 0.00 | Type I |
14 | GPEtn(22:6/0:0) | 526.2938 | 0.99 | 0.01 | 0.00 | Type I |
15 | GPCho(20:5/0:0) | 542.3241 | 0.89 | 0.02 | 0.09 | Type I |
16 | GPCho(20:4/0:0) | 544.3398 | 0.92 | 0.08 | 0.00 | Type I |
17 | GPCho(20:3/0:0) | 546.3554 | 0.75 | 0.25 | 0.00 | Type I |
18 | GPCho(22:6/0:0) | 568.3398 | 0.97 | 0.03 | 0.00 | Type I |
19 | SM(d18:1/14:0) | 675.5436 | 0.92 | 0.07 | 0.00 | Type I |
20 | SM(d18:1/15:0) | 689.5592 | 0.94 | 0.03 | 0.03 | Type I |
21 | SM(d18:1/16:1) | 701.5592 | 1.00 | 0.00 | 0.00 | Type I |
22 | GPCho(O-34:3) | 742.5745 | 0.99 | 0.01 | 0.00 | Type I |
23b | GPCho(O-34:2) | 744.5902 | — | — | — | — |
24 | GPCho(34:4) | 754.5381 | 0.97 | 0.03 | 0.00 | Type I |
25 | GPCho(34:3) | 756.5538 | 0.94 | 0.06 | 0.00 | Type I |
26 | GPCho(34:2) | 758.5694 | 1.00 | 0.00 | 0.00 | Type I |
27 | GPCho(34:1) | 760.5851 | 1.00 | 0.00 | 0.00 | Type I |
28 | GPCho(O-36:6) | 764.5589 | 0.85 | 0.15 | 0.00 | Type I |
29 | GPCho(O-36:5) | 766.5745 | 0.93 | 0.07 | 0.00 | Type I |
30 | GPCho(O-36:3) | 770.6058 | 0.97 | 0.03 | 0.00 | Type I |
31 | GPCho(36:5) | 780.5538 | 1.00 | 0.00 | 0.00 | Type I |
32 | GPCho(36:4) | 782.5694 | 1.00 | 0.00 | 0.00 | Type I |
33 | GPCho(36:3) | 784.5851 | 0.98 | 0.00 | 0.00 | Type I |
34a | GPCho(36:2) | 786.6007 | 0.15 | 0.79 | 0.06 | Type II |
35b | GOCho(O-38:7) | 790.5745 | — | — | — | — |
36 | GPCho(O-38:6) | 792.5902 | 0.99 | 0.01 | 0.00 | Type I |
37 | GPCho(O-38:5) | 794.6058 | 1.00 | 0.00 | 0.00 | Type I |
38b | GPCho(O-38:4) | 796.6215 | — | — | — | — |
39 | GPCho(38:7) | 804.5538 | 1.00 | 0.00 | 0.00 | Type I |
40 | GPCho(38:6) | 806.5694 | 1.00 | 0.00 | 0.00 | Type I |
41 | GPCho(38:5) | 808.5851 | 1.00 | 0.00 | 0.00 | Type I |
42 | GPCho(38:4) | 810.6007 | 1.00 | 0.01 | 0.00 | Type I |
43 | GPCho(38:3) | 812.6164 | 1.00 | 0.00 | 0.00 | Type I |
44b | GPCho(O-40:6) | 820.6215 | — | — | — | — |
45 | GPCho(40:7) | 832.5851 | 1.00 | 0.00 | 0.00 | Type I |
46 | GPCho(40:6) | 834.6007 | 0.99 | 0.01 | 0.00 | Type I |
47 | GPCho(40:5) | 836.6164 | 1.00 | 0.00 | 0.00 | Type I |
In the second external test, the model was tested on the prediction performance of 50 Type I and 50 Type II ROIs that were manually assigned. This was to test the CNN on other data, methods and its confidence to classify true positives versus true negatives. The CNN demonstrated a classification accuracy of 94%, with a sensitivity of 0.900 and selectivity of 0.980. Suggesting that the model had lower predictive performance for determining true positives (i.e. Type I ROIs). The CNN correctly classified 49 of the Type II ROIs and 45 of the Type I ROIs. Misclassification included one Type II ROI as Type I, three Type I ROIs as Type II and two as Type III (Fig. 2). Two of the Type I ROIs misclassified as Type II contained multiple chromatographic peaks (Table S2 and Fig. S5†). Whilst the CNN could correctly classify ROIs with multiple peaks it showed lower performance for these types of ROIs. Multiple peaks can appear in a single ROI due to poor chromatographic resolution and potential structural isomers. However, these cases are typically few in number and with further training on these specific cases the model could be improved the image classification on multi-peak ROIs.
The lower prediction probability corresponded to features that had more than one peak present in the image file or was a narrow eluting peak (<10 s). The CNN model trained on ROI images in the present study, used a chromatography method that had wider eluting peaks (typically >20 s) due to the column dimensions and method parameters. Therefore, for these cases it is likely that Model E had a lower predictive confidence as it classified peaks outside of the data it was trained on. Nevertheless, the utility of model E was well demonstrated as it was tested on data generated from different analytical methodologies analysing different classes of metabolites with ROIs that varied considerably in terms of peak shape, number of points per peak and peak width. Furthermore, the model classification was not impacted by the retention window used whether it was for the full chromatographic run, a constant window either side of the signal or only the signal from the extracted ion chromatogram. The number of training cases was relatively low with <1000 cases for two of the classes (Type II and Type III). As a rule of thumb, training of CNNs for image classification need ∼1000 cases per class to ensure good model performance. Whilst accuracy for this specific external test was perfect, it is likely that with a larger dataset containing thousands of features the prediction will be closer to the internal test set accuracy (85.52%). Further model improvements could be made by using different training cases from several analytical methods, a larger number of ROI training cases for each class and further fine-tuning of CNN parameters. Nevertheless, the model presented here has demonstrated high accuracy on three datasets from three different analytical methods.
The ambiguous nature of this data makes assessing model performance difficult, especially when comparing models that have been trained on different image data to MS ROIs. Melnikov et al. noted that ROI classification was particularly difficult for those ROIs that required further investigation and peaks that were noisy and of low intensity.17 Therefore, depending on the user assigning the data, the process can be prone to subjective classification. By incorporating simple steps in the manual assignment stage, the labelling process is more transparent and reproducible. Several criteria (Fig. S1†) were used to assign ROIs into the respective classes to improve model robustness and reproducibility which can be challenging in ML and DL fields. For example, signal intensity and the baseline noise are important in assessing whether a ROI contains a true positive (i.e. a peak). Similarly, the tR of a peak can also give an indication of whether the ROI is a true positive or not. Early eluting peaks in the void can be attributed to unretained analytes or different types of contamination such as that arising from carry-over. Furthermore, CNNs can handle multiple input data types including text, numerical and image data. This numerical data could be appended to a fully connected layer at the end of the model and include multiple parameters relating to tR, intensity, baseline noise, peak asymmetry and peak width to further improve model classification of ROIs.24 Nevertheless, these studies demonstrate the applicability of image classification in DL to support data analysis in untargeted mass spectrometry. With further improvements, the integration of DL approaches into these data-intensive workflows will significantly reduce resource use associated with time and costs.
A challenge in these areas, using these techniques is in part due to the complexity and scale of the data,2,30 thus automated, rapid and accurate approaches will be vital for our understanding to keep pace with the volume of data being generated in environmental research. Furthermore, predictive tools can support toolbox development for specific approaches. For example, ML has been demonstrated to accurately predict tR across multiple environmental matrices, collision cross section in ion mobility spectrometry and to support identification and annotation of compounds during non-target HRMS.28,31–33 These predictive models would complement the CNN developed here to improve annotation of unknown compounds. Moreover, the advantage of HRMS enables retrospective analysis to improve current knowledge related to exposure and risk.34
The use of DL and ML is not without challenge. There is a critical need for improved standards surrounding reporting, data sharing, and model accessibility when concerning DL and ML research.35,36 Additionally, models and tools developed are often not maintained so the use of these become limited and are often short-lived.37 Principles such as findability, accessibility, interoperability, and reuse of digital assets38 and guidance put forward by NeurIPS35 are beginning to address the issue. Improving these standards will be necessary to ensure that the well-known ‘reproducibility crisis’ does not extend to DL and ML applications.
Overall, the application of image recognition for untargeted mass spectrometry showed a good potential to improve the speed and accuracy of peak picking and compound annotation in complex matrices analysed by untargeted HRMS. These types of approaches will complement strategies aimed at improving feature detection and annotation that include increasing number of replicates, better quality assurance and quality control and improvement in peak picking algorithms. The model will be most beneficial in situations that have a low number of biological replicates, peak picking parameters have not been optimised or where several thousand features have been extracted. There is no single solution but by developing these models, researchers have the option available for sophisticated post-processing tools to support downstream data processing and interpretation. The biggest investment in time in these approaches is the training of the network which in this application took approximately 4 weeks to train and develop 21 models (five versions of each of the three simple sequential models, and two versions of each of the three complex models). However, once the model has been developed it can be applied to any dataset as demonstrated here. Furthermore, running the model takes seconds to predict for hundreds of ROIs without the need for manual inspection. This would save costs associated with this time-consuming stage and further support downstream data interpretation. Moreover, AI should be further explored and applied to new avenues across all areas of environmental research to fully understand the potential benefits of this technology.
Footnote |
† Electronic supplementary information (ESI) available: Supplementary information; manual labelling criteria (Fig. S1), example CNN feature extraction of image files (Fig. S2), example predictions of ROI images using the optimised model (Fig. S3 and S4), architecture of CNN models (Table S1) and misclassification cases of external test from Fiorini et al., 2020 (Table S2 and Fig. S5). See DOI: https://doi.org/10.1039/d3va00005b |
This journal is © The Royal Society of Chemistry 2023 |