Application of deep learning to support peak picking during non-target high resolution mass spectrometry work ﬂ ows in environmental research

With the advent of high-resolution mass spectrometry (HRMS), untargeted analytical approaches have become increasingly important across many di ﬀ erent disciplines including environmental ﬁ elds. However, analysing mass spectra produced by HRMS can be challenging due to the sensitivity of low abundance analytes, the complexity of sample matrices and the volume of data produced. This is further compounded by the challenge of using pre-processing algorithms to reliably extract useful information from the mass spectra whilst removing experimental artefacts and noise. It is essential that we investigate innovative technology to overcome these challenges and improve analysis in this data-rich area. The application of arti ﬁ cial intelligence to support data analysis in HRMS has a strong potential to improve current approaches and maximise the value of generated data. In this work, we investigated the application of a deep learning approach to classify MS peaks shortlisted by pre-processing work ﬂ ows. The objective was to classify extracted ROIs into one of three classes to sort feature lists for downstream data interpretation. We developed and compared several convolutional neural networks (CNN) for peak classi ﬁ cation using the Python library Keras. The optimized CNN demonstrated an overall accuracy of 85.5%, a sensitivity of 98.8% and selectively of 97.8%. The CNN approach rapidly and accurately classi ﬁ ed peaks, reducing time and costs associated with manual curation of shortlisted features after peak picking. This will further support interpretation and understanding from this discovery-driven area of analytical science.


Introduction
Mass spectrometry has long supported understanding for environmental research concerned with characterizing chemical mixtures in the environment, identifying transformation and/or degradation products, measuring kinetics of uptake and elimination and identication of toxins among many other applications. 13][4][5][6] The technique has typically three approaches including targeted, untargeted and suspect screening (i.e.semi-targeted).In targeted analysis compounds, or compound classes are determined and is oen quantitative.However, targeted methods cover a relatively small proportion of compounds that must be known a priori preventing discovery of unknown compounds and can further result in analytical bias (i.e. the Matthew effect). 7Alternatively, untargeted mass spectrometry is a comprehensive approach that aims to detect all compounds present within a sample enabling discovery of novel compounds.HRMS produces signals, or peaks, described by the retention time (t R ) and mass-to-charge ratio (m/z) that require annotation to enable interpretation of the data.Peaks can be matched to mass spectral libraries but annotation is limited by the availability of certied reference standards for conrmatory analysis.Moreover, for applications such as metabolomics, several thousand peaks can be detected in one sample which requires signicant time to process to ensure extracted features are real signals. 8ntargeted analysis generates a signicant amount of data requiring extensive processing before data interpretation stages.The data processing referred to as pre-processing is used to generate a feature list (i.e. a list of m/z-t R pairs) that have been measured by the mass analyser.There are many approaches and algorithms used to extract this information and perform peak picking including vendor-specic soware and freely available packages such as XCMS 9 and MZmine. 10owever, these algorithms are not without limitations and peak picking can oen extract noise instead of a true signal (i.e. a peak). 11,12Therefore, feature lists can contain a large number of false positives.This can be reduced somewhat by appropriate optimization of parameter settings during peak picking but does not completely solve the issue.4][15] To ensure that the signal is a true positive, feature lists must be manually inspected leading to a laborious process that is prone to human error. 16ifferent research groups have attempted to improve this situation by developing approaches to reduce false positives and improve quality control. 14,16A potential solution to signicantly reduce the need for manual curation is through the application of articial intelligence (AI).AI is the 'cognitive ability' demonstrated by machines and the eld is further separated into machine learning (ML) and deep learning (DL) which enables machines to learn.The advantage of ML and DL is that they can model a process in a rapid, automated and reliable way.Furthermore, DL models, such as convolutional neural networks (CNNs), can be trained on visual data such as images.This approach can be utilised for untargeted analysis where feature lists require visual inspection and the extracted ion chromatograms can be exported as images.For example, a previous investigation by Melnikov et al. used a CNN to create an algorithm which is capable of peak classication and integration of raw LC-HRMS data. 17These early studies of DL applications for untargeted metabolomics data analysis show good potential for the use of AI in analytical science.With further advancements in accuracy and generalizability of models, the impact of AI in untargeted analysis has the potential to increase accuracy whilst decreasing the overhead and further support data interpretation.
The aim of this work was to demonstrate the applicability of CNNs for peak classication of regions of interest (ROIs) extracted from an untargeted LC-HRMS analysis.Feature lists (t R -m/z pairs) were extracted using XCMS in R. Several CNN models were applied and compared for their ability to categorise these features (i.e., ROIs) as true positives, false positives and those requiring further investigation.The optimised CNN was then applied to two external datasets to test the generalisability of the model to classify peaks from other HRMS datasets.The application of CNNs to identify peaks during untargeted workows would provide an automated, data driven solution to support downstream data interpretation, saving signicant time and costs.

Model development and optimisation
Three simple sequential models (Models A-C) were built rst using a 2D convolution layer (Conv2D), a 2D max pooling layer (MaxPool2D) another Conv2D layer and MaxPool2D layer, a atten layer and then a dense output layer.Layers here correspond to the internal architecture of the CNN model.The model was complied with the Adam optimizer, the learning rate was set to 0.001 and the loss function was set as categorical cross-entropy.Finally, the model was run over 10 epochs (rounds).The process was repeated 5 times for each pre-processing app each model with the highest validation accuracy and lowest variance selected.The test set was then run against this model and a confusion matrix was produced to represent the predicted class labels versus the true class labels.
Three more complex, ne-tuned CNN models were then built using the pre-trained model applications from Keras: VGG16, Xception and MobileNet.In all cases the pre-trained models were downloaded and then modied and a dense output layer (with 3 outputs) was added.All models had different layer types (see ESI, Table S1 †), The VGG16 model had 23 layers containing a mixture of Conv2D and MaxPool2D layers.The Xception model had 126 layers, containing; activation, separable convolutional, batch normal, Conv2D, MaxPool2D and used a global average pooling layer for the nal layer.The MobileNet model, the least complex of the three models (4 253 864 parameters), contained 88 layers including Conv2D, Rectied Linear Unit, Depthwise Conv2D, batch normalization, zero padding and global average pooling layers.A variety of batch sizes, number of epochs, and learning rates were trialed across all models to assess the overtting and undertting of the models.Overtting would mean increased sensitivity of the model to the train sets ROI image details, resulting in a negative impact on the performance of the model against other ROI images.This can be observed, whilst increasing parameters such as epochs, at the point when the accuracy of the model against the train set begins to improve at a faster rate than the accuracy of the validation set and this was used to assess overtting of the models tested here.The nal parameters were set to use the Adam optimizer, with a learning rate of 0.001, a batch-size of 50, a loss function of categorical cross-entropy, epochs set to 10 and each model trained twice.As with the simpler model, the test data were then run against the best of two models for each optimised pre-trained model and a confusion matrix was produced.

Model assessment
The best CNN model was selected and externally tested without retraining using two available datasets from the MetaboLights repository (https://www.ebi.ac.uk/metabolights/index).The rst study applied a targeted HRMS method to determine 50 lipids from human plasma samples (Koulman et al., 2009, MTBLS4, https://www.ebi.ac.uk/metabolights/MTBLS4/descriptors).We used a targeted method to externally test and validate our model as these ROIs could be conrmed as true positives.The external dataset was downloaded and the same pre-processing workow (i.e.peak picking) was applied as described above.In a second study that investigated metabolism of the sexual cycle of a marine diatom (Fiorini et al., 2020, MTBLS1714, https://www.ebi.ac.uk/metabolights/MTBLS1714/descriptors) we applied our pre-processing workow to extract features from the available les and randomly split a small subset of the nal feature list into 50 Type I ROIs and 50 type II ROIs to further test the model predictions.The peak picking here was not optimised but used values specied by the authors where possible 19 (i.e.MZmine was used that had different parameters).
In binary data classication, the reliability of a model typically includes measures of sensitivity and selectivity (eqn (1) and ( 2)).
where, SN is the sensitivity, TP is the number of true positives and FN represents the number of false negatives.
where, SL is the selectivity, TN is the number of true negatives and FP is the number of false positives.Type III ROIs were excluded from these calculations as these ROI's true value could be either Type I or Type II aer further investigation.

Results & discussion
3.1.CNNs for peak picking in HRMS Initially, three simple sequential models were trained using three different image pre-processing apps.Model A had the highest test set accuracy of 76.47%, but the lowest selectivity of the three models 0.857, Models B and C had the same test set accuracy of 76.24%, but model B had the highest selectivity of 0.915, and Model C had the highest sensitivity of 0.971 (Fig. 1 and Table 1).Following this the pre-trained models; VGG16 (D), Mobile-Net (E) and Xception (F) were downloaded in full and netuned.These more complex models took a signicantly longer time to train, thus training was repeated only once.The new models were assessed using the same criteria as previously.Model D and E had the same sensitivity of 0.988 compared to model F which had a better sensitivity of 0.994.However, Model E had the highest test accuracy of 85.52% and the highest selectivity of 0.978 (Table 1).
As mentioned before, the selectivity and sensitivity used to assess the models did not account for the Type III ROIs due to the ambiguity of this category.Although, comparisons of the model performance for the Type III class can be made from the percentage of ROIs predictively assigned to Type III that were not a true Type III (i.e.either a Type I or Type II).A higher number of ROIs predicted in the Type III class would mean an increased amount of time for manual investigation.Given this, Model E and F showed better performance, as misclassication of Type I or Type II ROIs into the Type III class was 9.9% and 8.8%, respectively.In comparison Model D had a misclassication of 14.0% for ROIs that were Type I or Type II (Table 1).Model E was selected as the best of the six developed models demonstrating the highest test set accuracy, high sensitivity, high selectivity and a low Type III misclassication.
The training of these models made use of available image pre-processing apps and, for models D-F, pre-trained CNNs were also applied.These models and their respective image preprocessing apps were developed from the ImageNet Large Scale Visual Recognition Challenge that has signicantly advanced image recognition in DL approaches. 20The ImageNet dataset contains >1.3 M images covering 1000 object classes (e.g.cats & dogs, different lizard species and numbers).Although it is possible to create an image pre-processing app or a full model these methods are time consuming and typically yield worse accuracies than the use of pre-trained models.A paired twotailed t-test showed that the test set accuracy of the simple models that were developed here (A-C) were statistically lower (p < 0.05) than the ne-tuned models (D-F) optimised from available pre-trained models.This transfer learning, where one model developed for a specic task is used as a starting point for a second task, is common within CNN approaches. 21By using pre-trained models, effort can instead be spent improving and ne-tuning models to be more accurate to a specic task.This follows one of the main concepts in AI of continuous improvement.However, the pre-trained models used in this study were trained on image data that is considerably different to extracted ion chromatograms, which in comparison are much simpler.Thus, the feature extraction by the CNN models may not be optimal for recognition of image features in the ROI les.If these pre-trained models had been developed using similar objects, then the CNN accuracy would potentially improve for this application.

External validation of the optimised CNN
Model E was externally tested using two published studies.The rst used LC-HRMS to target 50 lipids in human plasma samples thus all extracted ion chromatograms were conrmed true positives. 18The second used both a targeted and untargeted approach to characterise metabolism in a marine diatom during its sexual phase of its life cycle.This also demonstrated that the CNN would generalize well to other HRMS datasets that targeted different metabolites and used different methodologies (i.e., chromatography and mass spectrometry methods).The accuracy of Model E for the Koulman et al., 2009 study was 100%.The CNN did not misclassify any of the ROI les extracted by the peak picking workow.Of the 50 lipids targeted, only 47 were reported in the original publication and a further four were not detected in the previous study, 18 leaving a total of 43 lipid targets.The peak picking workow applied here using XCMS performed poorly for three of the lipids (Table 2) and extracted noise instead of the analytes.For these three features, Model E correctly classied them as Type II ROIs (i.e. containing no chromatographic peak).The remaining 40 lipid targets were correctly identied as Type I ROIs.In addition to simple predictions into separate class labels, the CNN model can also present predictions as a probability for each class (Fig. S3 †).This enables further condence when applying the model and assessing the peak classication of extracted features from HRMS datasets.The model showed high condence in the prediction of the peak classes, with the probability of Type I $89% for most cases.Probability in prediction of Type I was lower for features 2, 4, 17 and 28 (75-85%) (Table 2).
In the second external test, the model was tested on the prediction performance of 50 Type I and 50 Type II ROIs that were manually assigned.This was to test the CNN on other data, methods and its condence to classify true positives versus true negatives.The CNN demonstrated a classication accuracy of 94%, with a sensitivity of 0.900 and selectivity of 0.980.Suggesting that the model had lower predictive performance for determining true positives (i.e.Type I ROIs).The CNN correctly classied 49 of the Type II ROIs and 45 of the Type I ROIs.Misclassication included one Type II ROI as Type I, three Type I ROIs as Type II and two as Type III (Fig. 2).Two of the Type I ROIs misclassied as Type II contained multiple chromatographic peaks (Table S2 and Fig. S5 †).Whilst the CNN could correctly classify ROIs with multiple peaks it showed lower performance for these types of ROIs.Multiple peaks can appear in a single ROI due to poor chromatographic resolution and potential structural isomers.However, these cases are typically few in number and with further training on these specic cases the model could be improved the image classication on multipeak ROIs.
The lower prediction probability corresponded to features that had more than one peak present in the image le or was a narrow eluting peak (<10 s).The CNN model trained on ROI images in the present study, used a chromatography method that had wider eluting peaks (typically >20 s) due to the column dimensions and method parameters.Therefore, for these cases it is likely that Model E had a lower predictive condence as it classied peaks outside of the data it was trained on.Nevertheless, the utility of model E was well demonstrated as it was tested on data generated from different analytical methodologies analysing different classes of metabolites with ROIs that varied considerably in terms of peak shape, number of points per peak and peak width.Furthermore, the model classication was not impacted by the retention window used whether it was for the full chromatographic run, a constant window either side of the signal or only the signal from the extracted ion chromatogram.The number of training cases was relatively low with <1000 cases for two of the classes (Type II and Type III).As a rule of thumb, training of CNNs for image classication need ∼1000 cases per class to ensure good model performance.Whilst accuracy for this specic external test was perfect, it is likely that with a larger dataset containing thousands of features the prediction will be closer to the internal test set accuracy (85.52%).Further model improvements could be made by using different training cases from several analytical methods, a larger number of ROI training cases for each class and further ne-tuning of CNN parameters.Nevertheless, the model presented here has demonstrated high accuracy on three datasets from three different analytical methods.

Comparison to Melnikov et al. CNN model performance
The approach used in this study was similar to the work presented by Melnikov et al. including ROI classication into three classes, ROI images scaled to unity at maximum and ROI dataset size 4000 images split across three classes. 17However, in our approach ROIs were extracted as images and the model was used for image classication.The benet of this approach is that our model could be applied to established data processing workows irrespective of tools used provided the ROIs can be exported as an image le.We also use the concept of transfer learning where pre-trained models can be utilised reducing the time needed for development and that this model can be applied to any HRMS data provided the feature lists can be exported as images.We compared the performance of the CNNs in this study with the ).This area is rapidly developing and will drive important downstream discovery during data interpretation stages.
The ambiguous nature of this data makes assessing model performance difficult, especially when comparing models that have been trained on different image data to MS ROIs.Melnikov et al. noted that ROI classication was particularly difficult for those ROIs that required further investigation and peaks that were noisy and of low intensity. 17Therefore, depending on the user assigning the data, the process can be prone to subjective classication.By incorporating simple steps in the manual assignment stage, the labelling process is more transparent and reproducible.Several criteria (Fig. S1 †) were used to assign ROIs into the respective classes to improve model robustness and reproducibility which can be challenging in ML and DL elds.For example, signal intensity and the baseline noise are important in assessing whether a ROI contains a true positive (i.e. a peak).Similarly, the t R of a peak can also give an indication of whether the ROI is a true positive or not.Early eluting peaks in the void can be attributed to unretained analytes or different types of contamination such as that arising from carry-over.Furthermore, CNNs can handle multiple input data types including text, numerical and image data.This numerical data could be appended to a fully connected layer at the end of the model and include multiple parameters relating to t R , intensity, baseline noise, peak asymmetry and peak width to further improve model classication of ROIs. 24Nevertheless, these studies demonstrate the applicability of image classi-cation in DL to support data analysis in untargeted mass spectrometry.With further improvements, the integration of DL approaches into these data-intensive workows will signicantly reduce resource use associated with time and costs.

AI to support data-driven science in environmental research
6][27][28] In the present study, image classication was demonstrated to show high accuracy, sensitivity and selectivity reducing the need for

Paper
Environmental Science: Advances manual investigation of ROIs during pre-processing stages, that would improve data analysis for investigations using non-target HRMS.Indeed, this analytical approach is becoming more widely used in areas such as environmental toxicology where metabolomics, lipidomics and exposomics are supporting discovery-driven research to understand mechanisms of toxicological responses across different species 29 and fully characterise chemical exposure in the environment. 28,30 challenge in these areas, using these techniques is in part due to the complexity and scale of the data, 2,30 thus automated, rapid and accurate approaches will be vital for our understanding to keep pace with the volume of data being generated in environmental research.Furthermore, predictive tools can support toolbox development for specic approaches.2][33] These predictive models would complement the CNN developed here to improve annotation of unknown compounds.Moreover, the advantage of HRMS enables retrospective analysis to improve current knowledge related to exposure and risk. 34he use of DL and ML is not without challenge.There is a critical need for improved standards surrounding reporting, data sharing, and model accessibility when concerning DL and ML research. 35,36Additionally, models and tools developed are oen not maintained so the use of these become limited and are oen short-lived. 37Principles such as ndability, accessibility, interoperability, and reuse of digital assets 38 and guidance put forward by NeurIPS 35 are beginning to address the issue.Improving these standards will be necessary to ensure that the well-known 'reproducibility crisis' does not extend to DL and ML applications.
Overall, the application of image recognition for untargeted mass spectrometry showed a good potential to improve the speed and accuracy of peak picking and compound annotation in complex matrices analysed by untargeted HRMS.These types of approaches will complement strategies aimed at improving feature detection and annotation that include increasing number of replicates, better quality assurance and quality control and improvement in peak picking algorithms.The model will be most benecial in situations that have a low number of biological replicates, peak picking parameters have not been optimised or where several thousand features have been extracted.There is no single solution but by developing these models, researchers have the option available for sophisticated post-processing tools to support downstream data processing and interpretation.The biggest investment in time in these approaches is the training of the network which in this application took approximately 4 weeks to train and develop 21 models (ve versions of each of the three simple sequential models, and two versions of each of the three complex models).However, once the model has been developed it can be applied to any dataset as demonstrated here.Furthermore, running the model takes seconds to predict for hundreds of ROIs without the need for manual inspection.This would save costs associated with this time-consuming stage and further support downstream data interpretation.Moreover, AI should be further explored and applied to new avenues across all areas of environmental research to fully understand the potential benets of this technology.

Fig. 1
Fig. 1 Confusion matrix of CNN: (i) Model A (ii) Model B (iii) Model C (iv) Model D (v) Model E (vi) Model F. The matrix represents the number of files predictively labelled by the CNN model as either Type I, Type II or Type III ROIs from the test data against the true label.Each matrix includes 442 ROIs: 185 Type I, 157 Type II and Type III.

Fig. 2
Fig. 2 Confusion matrix of CNN tested on ROIs extracted from Fiorini et al., 2020.The matrix represents the number of files predictively labelled by the CNN model as either Type I, Type II or Type III ROIs from the test data against the true label.
17N developed byMelnikov et al.From the reported data the accuracy, sensitivity and selectivity were calculated (Table1).The Melnikov et al.CNN had a test set accuracy of 87.33%, a sensitivity of 0.994 and a selectivity of 0.989.17Whencomparing this to the in-house Model E, the Melnikov et al.
23wever, the Melnikov et al. model assigned a much higher percentage of ROIs (24.4%) to the Type III class.It is important to highlight that multiple approaches and models are being developed to support data-driven tools in nontarget mass spectrometry.For example, tools have also been developed that can evaluate the raw signal extracted from feature extraction tools (NeatMS22) and toolboxes that can support feature extraction and data visualization (MStractor23

Table 1
Comparison of CNN performance.The CNN Models A-C were built in-house and used pre-processing feature extraction applications VGG16, MobileNet or Xception.Models D-F were developed by fine tuning the full pre-trained models downloaded from ImageNet: VGG16, MobileNet of Xception.Each model's test set accuracy (percentage of correctly labelled ROI's out of the total ROI's), sensitivity value, selectivity value and Type III misclassification.In-house Models A-F contained 442 test set ROIs, whilst the Melnikov model test set contained 600 ROIs(Melnikov et al., 2020)

Table 2
18del E peak classification performance on an external test set of 47 features.18Foreach feature the CNN predicted the probability of each feature belonging to the three class labels (Types I-III).The final classification is based on the highest probability Poor performance of the peak picking algorithm (XCMS).b Not detected in the original publication. a