Christian
Ieritano
ab,
J. Larry
Campbell
acd and
W. Scott
Hopkins
*abce
aDepartment of Chemistry, University of Waterloo, 200 University Avenue West, Waterloo, Ontario N2L 3G1, Canada. E-mail: shopkins@uwaterloo.ca
bWaterloo Institute for Nanotechnology, University of 200 University Avenue West, Waterloo, Ontario N2L 3G1, Canada
cWaterMine Innovation, Inc., Waterloo, Ontario N0B 2T0, Canada
dBedrock Scientific Inc., Milton, Ontario L6T 6J9, Canada
eCentre for Eye and Vision Research, Hong Kong Science Park, New Territories, 999077, Hong Kong
First published on 29th June 2021
Although there has been a surge in popularity of differential mobility spectrometry (DMS) within analytical workflows, determining separation conditions within the DMS parameter space still requires manual optimization. A means of accurately predicting differential ion mobility would benefit practitioners by significantly reducing the time associated with method development. Here, we report a machine learning (ML) approach that predicts dispersion curves in an N2 environment, which are the compensation voltages (CVs) required for optimal ion transmission across a range of separation voltages (SVs) between 1500 to 4000 V. After training a random-forest based model using the DMS information of 409 cationic analytes, dispersion curves were reproduced with a mean absolute error (MAE) of ≤ 2.4 V, approaching typical experimental peak FWHMs of ±1.5 V. The predictive ML model was trained using only m/z and ion-neutral collision cross section (CCS) as inputs, both of which can be obtained from experimental databases before being extensively validated. By updating the model via inclusion of two CV datapoints at lower SVs (1500 V and 2000 V) accuracy was further improved to MAE ≤ 1.2 V. This improvement stems from the ability of the “guided” ML routine to accurately capture Type A and B behaviour, which was exhibited by only 2% and 17% of ions, respectively, within the dataset. Dispersion curve predictions of the database's most common Type C ions (81%) using the unguided and guided approaches exhibited average errors of 0.6 V and 0.1 V, respectively.
The separation of ions within any ion mobility spectrometry (IMS) device depends on the ion's field-dependent mobility [K(E)] through a neutral buffer gas,19,20 which is specific to the identity of the gas as well as the electric field strength (E) as per eqn (1):
| v = K(E)·E | (1) |
![]() | (2) |
![]() | (3) |
![]() | (4) |
Separations in DMS,33,34 a term used synonymously with field asymmetric waveform ion mobility spectrometry (FAIMS),31,35 harness the field-dependence of ion mobility to achieve a spatial separation of ions (Fig. S1†). The DMS waveform, denoted as the separation voltage (SV), consists of an electric field that oscillates between its high- and low-field phases. Due to the non-linear dependence of ion mobility on field strength, the SV causes the ion to adopt trajectories that divert from the transmission axis. The field-dependent mobility of an ion is encoded within the compensation voltage (CV) required for transmission through the DMS cell, as the CV is related to the alpha function,34 and by association, the ion's CCS.
Based on this first-principles consideration, mapping the field-dependent mobility should be feasible using only the intrinsic properties associated with the ion's mobility (i.e., mass and CCS). Haack and coworkers made a first step in this regard by reproducing the DMS behaviour of the tetramethylammonium36 and tricarbastannatrane ([N(CH2CH2CH2)3Sn]+)37 cations using only temperature dependent CCS calculations in the free molecular regime. Given the reasonable accuracy of this approach, we hypothesized that dispersion plots could be generated in silico using machine learning (ML) models trained only with CCS and m/z as inputs. This follows the absence of a closed-form expression that can relate the ion-neutral interaction potential with the ion's field-dependent mobility. Using ML to complete this connection would enable predictions of dispersion plots using only intrinsic ion properties that are accessible via CCS libraries38–45 or calculation packages.46–48 This would be of tremendous utility for method development within the various ‘omics realms’, where the CV space occupied by the desired analytes could be mapped prior to data acquisition with minimal effort. The methodology simply requires a “reverse-engineering” of the ML-model used to obtain CCSs from DMS-MS data.49 However, broadly applicable predictions of an ion's dispersion behaviour necessitate the use of a calibration set spanning several chemical classes, CCSs, and m/z ratios. As a first step in our endeavour to globally map differential ion mobility, we report on the ML-based in silico generation of dispersion plots in an N2 environment for a compendium containing 409 molecular cations. Since the interaction potential between N2 and a protonated analyte differs from cationic adducts (e.g., [M + Na]+), we chose to model protonated species ([M + H]+), which were present in significantly greater quantities.
:
50 MeOH
:
H2O or MeCN
:
H2O ESI solvent mixture, both of which contained 0.1% formic acid. Analyte mixtures were infused into the ESI source (positive mode) at a flow rate of 10 μL min−1. DMS-MS measurements were conducted using N2 as both the curtain gas (20 psi) and as the collision gas (ca. 7 mTorr) for data acquisition in multiple reaction monitoring (MRM) mode. MRM transitions (available in the ESI†) were monitored as the SV was stepped from 1500 to 4000 V in 500 V increments, with additional data taken at SV = 3250 V and 3750 V to ensure thorough mapping of the dispersion curves at high field strengths. At each SV, the ion current was recorded while ramping the CV from −30 V to 30 V in increments of 0.1 V to produce an ionogram. Each ionogram was fit with a Gaussian distribution, for which the centroid was taken as the CV required for maximum ion transmission. The m/z and CCS of the parent ion, as calculated using MobCal-MPI,48 were used as the inputs for training the ML model to predict SV/CV pairs. Full details of experimental parameters related to data acquisition are provided in Table S1.† Details concerning CCS calculations are available in the ESI in section S1.† The ML source-code, which employs the Random Forest Regression model as implemented in the Python Sci-kit Learn package, and associated benchmarking data is available on the Hopkins Laboratory GitHub repository (https://github.com/HopkinsLaboratory).
The range of CVs adopted by the 409 cations are shown in Fig. 1B. At low SVs, the CVs of Type A, B, and C ions are similar. However, differential mobilities become more pronounced at higher SVs due to the field-dependence of ion mobility. At SV = 4000 V, the optimum CV for ion elution ranges from −26 V for glycine to +20 V for atenolol. Untargeted analysis would necessitate sampling this entire window to ensure adequate coverage of the chemical space even though most ions are Type C and elute within the CV = 0–15 V window (Fig. 1C). As it stands, there are no “rules” for predicting an ion's DMS behaviour, which presents a significant challenge for coupling DMS-MS to some front-end interfaces (e.g., LC). Introduction of the desired analytes to the DMS cell within a short time window precludes a full scan of the CV range, necessitating predictive technologies to facilitate method development in tandem separation workflows the incorporate DMS.
Modelling the dispersion curves (i.e., the DMS behaviour) of an ion requires metrics that capture the ion-neutral interaction potential. This is especially important in the case of the dataset used here, where 331 ions exhibit Type C behaviour, but only 72 and 6 ionic species exhibit Type B and A behaviours, respectively. The interaction potential is heavily influenced by the charge density and conformation of the ion, both of which can be reasonably captured through the ion's m/z and CCS.36,37 However, the broad distributions of m/z and CCS within this dataset (Fig. S3†) requires an ML framework to incorporate these properties in the prediction of an ion's differential mobility.53 One must also be cognisant of bias, variance, and overfitting in the chosen ML model, all of which contribute to poor predictive capabilities for systems outside of the training set. Random Forest Regression (RFR), an unbiased decision-tree-based model, has demonstrated low variance and low susceptibility to overfitting.54,55 The resistance to overfitting stems from the law of large numbers, which states that the average obtained from many trials will become closer to the expected (real) value as more trials are performed. As such, we employed a RFR algorithm to create a predictive model for DMS dispersion curve data utilizing 200 randomized decision trees as implemented in the scikit-learn Python package. To train the RFR framework, our DMS-MS database was randomly split into a training set and an “out-of-the-bag” external validation set using only analyte m/z and CCSs as inputs.
The mean absolute error (MAE) of the RFR predictions, averaged across 100 randomized training/validation set splits, is plotted as a function of training set size (i.e., a learning curve) for SV = 4000 V in the top panel of Fig. 2. Since the CV window occupied by the analytes is largest at SV = 4000 V, the associated MAE can be thought of as the upper limit of error for the RFR model. Training the RFR model using 95% of the database at SV = 4000 V predicts the corresponding CV with a MAE of 2.4 V. This is an encouraging result considering the relatively small size of the dataset and the limited number of parameters used in the ML framework. This model is especially accuare for the lower SVs, for which optimal CVs can be predicted with even lower MAEs (Fig. S4†). Moreover, the MAEs associated with CV predictions typically lie within the full-width half-maximum (FWHM) range of a DMS peak (±1.5 V). It is also worth noting that the unguided learning curve shown in the top panel of Fig. 2 does not plateau at large training set sizes. This implies that more accurate predictions using the unguided approach are to be expected as the DMS-MS dataset expands with the addition of information for more analytes.
Recalling that the proportion of Type A, B, and C ions within the database are 2%, 17%, and 81%, respectively, it is necessary to investigate the accuracy of model predictions for each different DMS behaviour. If a validation set is disproportionately composed of Type A or B ions, the MAE for the data set can be especially high. Conversely, if the validation set is entirely composed of Type C ions, the associated MAE will be low and not representative of the global accuracy. To ensure adequate validation, we performed an additional 1000 randomized trials using a 95
:
5 partition of the dataset for training/validation. The deviations of calculated versus experimental CV values at SV = 4000 V are shown as a boxplot in the bottom panel of Fig. 2 according to their classification as a Type A, B, or C ions. For the unguided ML model (i.e., just using m/z and CCS as input), dispersion curve predictions for Type A, B, or C ions exhibit average errors of −7.9, −2.3, and 0.6 V, respectively. The low errors for Type C ions from the out-of-the-bag external validation set demonstrates that the ML model is accurate to within the day-to-day variance in SV/CV pairs (typically the peak's FWHM).
While predictions of Type C curves lie within the FHWM of the associated ionogram peak, the predictions for Type A and B ions are consistently at more positive CV values than those observed experimentally. It should be noted that the RFR-predicted Type A and B dispersion curves only deviate appreciably from experiment at SV > 2000 V. Therefore, we hypothesized that a “guided” ML model supplemented with CV values measured at SV = 1500 and 2000 V would provide the curvature required to capture Type A and B behaviour. Indeed, this was the case as demonstrated by the two-point guided learning curve and the distribution of errors in Fig. 2. Although this procedure had only a marginal improvement on Type C curve predictions (average error 0.1 V), the overall predictive capability when all species were considered improved by a factor of two (Fig. 2, top panel; 1.2 V MAE for guided model). This improvement stems from the considerable error reduction in predictions of Type A and B behaviour, which exhibit average errors of −4.4 V and 0.2 V, respectively, for the guided model (see bottom panel of Fig. 2).
The success of the ML-approach in predicting an ion's DMS behaviour is further exemplified by analysis of the experimental and predicted dispersion curves. Fig. 3 shows three representative Type A, B, and C dispersion plots taken from a single validation set. Predicted dispersion plots for the remaining molecules of the validation set are provided in section S2 of the ESI.† The Type C behaviour of flufenoxuron is captured almost exactly by both the guided and unguided RFR approach, which is true for nearly all Type C ions in this study. Although the unguided ML model captures the shape of the Type A and B dispersion curves, the predicted CV values are ca. 2 V more positive at the high SV region of the curves. This shift to more positive CV values is consistently observed for predictions of the other Type A and B ions, likely arising from their under-representation in the training set (and thus positive skewing due to over-representation of Type C). The 2-point guided approach substantially improves predictions of Type B ions (e.g., niacin) and, in some instances, produced a near exact prediction of Type A dispersion curves (e.g., sarcosine). Overall, the ability of RFR to replicate an ion's DMS behaviour is impressive and is expected to improve further with the addition of more examples to the database.
Accurate prediction of DMS behaviour will streamline method development for practitioners interested in adding an orthogonal separation dimension to their workflows. The unguided approach requires only m/z and CCS as input features, both of which can be found in published repositories38–45 or determined by calculation.46–48 Since the MAE for Type C ions (1.6 V) aligns with the typical FWHM of an ionogram peak (±1.5 V), employing this ML model to inform experiment will generally result in transmission of the desired analyte. Targeted approaches, in which the identity of the analyte is known, will benefit the most from predictions of DMS behaviour since the ability to set a specific SV/CV pair for a desired analyte will cut down on the time required for method development and mitigate redundant data acquisition. Extension of the predictive capabilities towards other common MS adducts (e.g., [M + Na]+, [M + NH4]+) and negative ions [M − H]− will become possible as more data is acquired. For untargeted approaches, it would be fruitful to utilize the dispersion plot as an additional metric for compound identification. Specifically, one could implement a characterization methodology whereby an ion's CCS could be inferred from its dispersion plot to enhance confidence in unknown compound identifications. The work reported here is intended to serve as the framework for these future endeavours, which will be reported on in due course.
Footnote |
| † Electronic supplementary information (ESI) available: Supplementary Fig. S1–S25, Table S1, and Supplementary sections S1 and S2 (PDF). DMS-MS database used for model training, MRM transitions, and ClassyFire molecular classifications (XLSX). See DOI: 10.1039/d1an00557j |
| This journal is © The Royal Society of Chemistry 2021 |