Open Access Article
Daniel J.
Bryant
*a,
Alfred W.
Mayhew
a,
Kelly L.
Pereira‡
a,
Sri Hapsari
Budisulistiorini
a,
Connor
Prior
b,
William
Unsworth
b,
David O.
Topping
c,
Andrew R.
Rickard
ad and
Jacqueline F.
Hamilton
a
aWolfson Atmospheric Chemistry Laboratories, Department of Chemistry, University of York, Heslington, York, YO10 5DD, UK. E-mail: daniel.bryant@york.ac.uk
bDepartment of Chemistry, University of York, Heslington, York, YO10 5DD, UK
cDepartment of Earth and Environmental Science, University of Manchester, Brunswick St., Manchester, UK
dNational Centre for Atmospheric Science, University of York, Heslington, York, YO10 5DD, UK
First published on 14th December 2022
Liquid chromatography coupled to electrospray ionisation high resolution mass spectrometry is an extremely powerful technique for both targeted and non-targeted analysis of organic aerosol. However, quantification of biogenic secondary organic aerosol species (BSOA) is hindered by a lack of commercially available authentic standards. To overcome the lack of authentic standards, this study proposes a quantification method based on the prediction of relative ionisation efficiency (RIE) factors to correct concentrations obtained via calibration using a proxy standard. RIE measurements of 89 commercially available standards were made relative to cis-pinonic acid and coupled to structural descriptors. A regularised random forest predictive model was developed using the authentic standards (R2 = 0.66, RMSE = 0.59). The model was then used to predict the RIE’s of 87 biogenic organic acid markers from α-pinene, limonene and β-caryophyllene without available authentic standards. The predicted RIE’s ranged from 0.27 to 13.5, with a mean ± standard deviation of 4.2 ± 3.9. 25 markers were structurally identified in chamber samples and ambient aerosol filter samples collected in summertime Beijing. The markers were quantified using a cis-pinonic acid calibration and then corrected using the predicted RIE factors. This resulted in the average BSOA concentration decreasing from 146 ng m−3 to 51 ng m−3, respectively. This change in concentration is highlighted to have an impact on the types of average aerosol metrics commonly used to describe bulk composition. This study is the first of its kind to use predicted ionisation efficiency factors to overcome known differences in BSOA concentrations due to the inherent lack of authentic standards in aerosol chemistry.
Environmental significanceOrganic aerosol is a major contributor to particulate matter concentrations and is made up of an extremely complex mixture of thousands of different compounds. Due to this complexity and an inherent lack of authentic standards, understanding aerosol composition and individual compound concentrations is extremely challenging. Significant differences in ionisation efficiencies have been observed between biogenic secondary organic aerosol compounds, when using electrospray ionisation techniques. These differences lead to significant uncertainties in organic aerosol composition and quantification. A predictive relative ionisation efficiency model allows for these differences to be considered, in turn leading to more reliable composition and concentration results. |
One of the challenges in analysing OA is the number of chemical degradation pathways available for precursor VOCs and the subsequent range of products that can be formed, with one precursor having the potential to create 10 s of different compounds.17,18 High resolution mass spectrometry (HRMS) with electrospray ionisation (ESI) sources have become an extremely versatile technique for improving our understanding of complex environmental samples.19 ESI is a soft ionisation technique allowing for the molecular identification of thousands of individual species.20,21 However a species' ability to be ionised is highly structurally specific,22 meaning the relative contribution of sample is hard to determine. Many previous studies have used direct injection techniques, without prior separation by liquid chromatography.23–26 Direct injection allows for the identification of 1000 s of different molecular formulae within one sample. However, due to a lack of isomer identification, quantification of individual compounds is not possible. Most studies use data visualisation techniques such as van Krevelen and Kendrick mass diagrams, and other chemical metrics such as average O
:
C and H
:
C ratios to draw conclusions about the aerosol composition, ageing and sources.18,21,26–31
However, these chemical metrics are based on signal response, not quantified concentrations, and as such assume all species ionise with equal efficiency. While semi-quantitative information can be obtained for samples of similar chemical speciation, ionisation efficiencies can vastly differ resulting in data bias and misinterpretation as shown by Pereira et al., 2021 and references therein. Targeted analysis and quantification using authentic standards overcomes these issues. However, due to the sheer number of compounds present in OA and inherent lack of authentic standards, proxy standards are routinely used where equivalent analyte ionisation efficiencies are assumed.32,33,70 However, the use of proxy standards still assumes all species in one functionality group or retention time window have the same ionisation efficiency.
Ideally, all species would have their own authentic standard for quantification, and recently groups have started to synthesise compounds such as organosulfates from isoprene and monoterpenes,34–37 nitrooxy organosulfates from monoterpenes38 and organic acids from a range of monoterpenes.39–42 Kenseth et al., 2020 recently synthesised 6 α-pinene derived carboxylic and dimer ester species and found large differences between their ionisation efficiencies.39 The measured relative response (RF) to cis-pinonic acid ranged from 0.46 to 35.65. The large differences in response factors observed by this study highlight the need to consider these differences in the quantification of species with similar functionalities and retention time windows. However, two issues arise with this approach; firstly, the time and expenses to synthesise different standards limits the work to larger laboratories with synthesis facilities, and secondly the sheer number of standards that would need to be synthesised for the hundreds of identified compounds makes this approach impractical.
A species' ability to ionise in the ESI source, both in the negative and positive modes, is highly dependent on its functionality and structure as well as the ionisation conditions.22,43,44 This has led to the development of models that can predict how well a compound can ionise based on structural descriptors or properties relative to a standard compound.43,45–47 The RF, i.e. how well a compound ionises in comparison to a reference compound, is calculated as shown in eqn (1), where log
RF(C1,C2) is the log value of the ratio of the gradients for compounds C1 and C2 across a concentration–response curve.43,46
![]() | (1) |
This type of RF scale has been used to investigate the structural or chemical features which affect a species' ability to ionise in the ESI. Early studies focussed on measured or calculated physical properties such as log
P or pKa.48–52 Recent studies have focussed on using computationally calculated molecular fingerprints or structural descriptors to assess and predict a species RF.46,47 Mayhew et al., 2020 measured RF's of 51 carboxylic acids which combined with structural fingerprints were used to develop a Bayesian ridge regression model.47 The model showed R2 and RMSE values in line with comparable studies, without the need to measure or predict physical properties of compounds. Liigand et al., 2020 recently developed a predictive machine learning model, which can predict the RF's of species relative to benzoic acid based on their structure, both in the positive and negative ionisation modes across a range of solvent compositions.46 Their model used data collected over a decade and contains RF measurements of 3139 and 1286 compound–solvent combinations in the positive and negative modes respectively. Previously, to our knowledge only one study has predicted RF factors for BSOA species.53 Zhang et al., 2015 estimated the RF's of a range of α-pinene derived organic acids54 based on a linear model developed by Kruve et al., 2014.45 The predicted RF's ranged from 0.54 to 51.64, with dimer species such as C16H26O6 having the largest predicted RF values, in-line with the observations in Kenseth et al., 2020.
In this study we aim to establish an RF model and apply it to improving our ability to reliably quantify BSOA markers. RF measurements of 89 authentic organic compounds relative to cis-pinonic acid were conducted in the negative ionisation mode. These measurements were then coupled to predicted, easily obtained chemical descriptors of molecular structure from ChemDes55 as well as pKa and log
P values to develop a random forest model for the prediction of BSOA RF factors. RF's were then predicted for previously identified BSOA markers which were used to correct concentrations calculated from a proxy cis-pinonic acid calibration of ambient samples collected in summertime Beijing. Overall, this study is the first to apply a method for the prediction of BSOA ESI response factors based on RF measurements, and as such provides a basis for future studies to establish more reliable quantification methods.
:
10H2O
:
MeOH, then changed linearly to 10
:
90 over 24 minutes, returning to 90
:
10 over 2 minutes and then held for 2 minutes, with a flow rate of 300 μL min−1. The MS was operated in negative mode, using full scan data dependant MS2. The scan range was set between 50 and 750 m/z, with a mass resolution of 120
000. The capillary and auxiliary gas temperatures were 320 °C. The sheath and auxiliary gas flow rates were 45 arb. and 10 arb respectively. The spray voltage was set to 4 kV. The number of most abundant precursors for MS2 fragmentation was set to 10. Data was analysed using TraceFinder 4.1 General Quan software (Thermo Fisher Scientific) using a targeted compound library of both standards and BSOA species, with a mass accuracy of 3 ppm for marker identification. All isotopic peaks were corrected with the theoretical isotope correction factor within the software.
:
50 MeOH
:
H2O, where no compound had the same retention time (RT) in order to reduce matrix effects which would affect the measured RF. The mixtures were prepared across a 7-point concentration gradient (5, 2.5, 1, 0.5, 0.25, 0.125, 0.0625 ppm, R2 > 0.95), with 3 replicate measurements per concentration. However, some compounds reached limit of detection before the lowest concentration. A 9-point cis-pinonic acid calibration was run alongside the ambient PM2.5 samples which was used for quantification (R2 > 0.99). 35 compounds were common between this study and that conducted in Mayhew et al., 2020, using the same mass spectrometer but via direct infusion. The 35 compounds showed a high correlation (R2 = 0.83) across the two methods, with an average difference in the measured log
RF's of 0.24 ± 0.42, highlighting the reliability of these measurements. The errors in the measured log
RF values were small, on average 3.6% across the 89 standards based on the standard error of the calibration slopes.
:
50 MeOH
:
H2O (optima, LC-MS grade, Fisher Scientific, UK). Individual α-pinene markers were isolated and collected based on their retention times from generated BSOA mass using a HPLC-ion-trap mass spectrometer coupled to an automated fraction collector, using the method described in Finessi et al., 2014.57
| Tag | Molecular formula | PA concentration (ng m−3) | RF concentration (ng m−3) |
|---|---|---|---|
| Lim_173a | C7H10O5 | 71.9 | 14.4 |
| Lim_187a | C8H12O5 | 39.8 | 7.8 |
| Pinene_185a | C9H14O4 | 10.4 | 1.4 |
| Pinene_183 | C10H16O3 | 8.9 | 8.8 |
| Bcary_253b | C14H22O4 | 5.4 | 0.4 |
| Bcary_197 | C11H18O3 | 3.2 | 5.9 |
| Lim_183 | C10H16O3 | 2.3 | 8.4 |
| Pinene_171a | C8H12O4 | 2.1 | 3.6 |
| Bcary_255a | C13H20O5 | 2.0 | 0.2 |
| Total | 146.0 | 51.0 |
:
50 MeOH
:
H2O. Triplicate recovery tests showed an almost complete recovery of cis-pinonic acid (99 ± 15.6%, n = 3) from the filter.
RF values of the species, given in Table S1† ranged from −2.84 to 1.75, covering four orders of magnitude. Several basic parameters were correlated with the measured log
RF including mass, RT, number of carbon and oxygen atoms as well as the O
:
C and H
:
C ratios, however no correlation (R < 0.1) was observed. On average, the lowest RF values were observed for species eluting before 6 minutes, and after 15 minutes, with the highest between 9 and 12 minutes. Matrix effects were investigated using the same method as in Bryant et al., 2021 using cis-pinonic acid to determine if signal suppression was occurring due to the highly complex nature of the samples.32 However, no significant matrix effect was observed for cis-pinonic acid, but further work is required for a range of different acid species. These measured log
RF values were then combined with over 3000 predicted chemical structural descriptors predicted from the ChemDes platform for computing molecular descriptors and fingerprints (molecular descriptors were taken from the Chemopy, CDK, RDKit, Pybel and PaDEL open-source packages).55
Several data cleaning steps were undertaken before model development. Firstly, non-numeric descriptors and descriptors containing only one value were removed resulting in 1766 descriptors. Descriptors with a pairwise correlation greater than R2 = 0.8 were then removed, in-line with previous studies,46 resulting in 224 descriptors using the “findCorrelation” function from the Caret R package.58 The remaining descriptors were then correlated to the log
RF values of the standards, and those with an R greater than 0.3 were selected (Table S2†). Two descriptors were removed (“fr_nitro_arom_nonortho” and “fr_phenol”), due to their lack of applicability to functionalities of the organic acid markers being studied here. pKa and log
P were also predicted using ChemDraw Prime 18.1 software, based on previous studies highlighting their importance to the ionisation efficiencies of compounds.47 The pKa had a correlation of R = 0.32 towards log
RF, but limited correlations were observed for log
P (R < 0.1), however a more accurate model was obtained with the inclusion of log
P. The predicted pKa and log
P values were combined with the remaining descriptors, giving 18 descriptors for model development. Several predictive models were developed using the Caret R package58 including random forest, Bayesian ridge regression and linear regression, with regularised random forest (RRF) being the best performing based on the lowest RMSE. Regularised random forest models work in the same way as random forest models but reduce model complexity by disregarding features that share information. The number of trees used in the random forest was optimised to 100 trees, and mtry (the number of variables available for splitting at each tree node) was optimised to 10. The RMSE and R2 values were calculated by default by the built-in functionality of the Caret R package. The final model was chosen based on minimising the RMSE.
Owing to the small dataset size, leave one out cross validation (LOOCV) was used to test the predictive capabilities of the model. LOOCV uses each compound in the data set once as a test set, with the other (n − 1) compounds as the training set. Full details of the model development and the dataset containing the predicted descriptors can be found at https://github.com/djb96/Response_factor_model.
As shown in Table 2, the 18 descriptors for model development were those of structural descriptors surrounding acidity and polarisation. Of the 18 descriptors, the most influential descriptors were MLFER_A and SpMAD_Dzp. MLFER_A provides a description of the overall solute hydrogen bond acidity and SpMAD_Dzp is a measure of a compound's polarizability. These specific descriptors were not identified as important in Liigand et al., 2020, but other descriptors for acidity/basicity were.
:
C – oxygen to carbon ration, H
:
C – hydrogen to carbon ratio, DBE – double bond equivalent, C – number of carbons, H = number of hydrogens, O – number of oxygens, MF – average molecular formula
| Metric | Number | cis-Pinonic acid calibration | RF calibration |
|---|---|---|---|
O : C |
0.43 | 0.61 | 0.48 |
H : C |
1.55 | 1.48 | 1.53 |
| DBE | 3.22 | 3.05 | 3.01 |
| C | 10 | 8.1 | 8.84 |
| H | 15.6 | 12.1 | 13.65 |
| O | 4.0 | 4.7 | 4.0 |
| MF | C10H15.6O4 | C8.1H12.1O4.7 | C8.8H13.7O4.0 |
RF values for the 89 readily available standards. The optimised model shows similar accuracy and linearity to previous studies43,45–47,59,60 with an R2 of 0.66 and RMSE of 0.59. The RMSE error means that if compound A is predicted to have an RF 10 times higher (log
RF = 1) than cis-pinonic acid (log
RF = 0), the actual RF would be in the range 2.6–38.9 (log
RF = 1.0 ± 0.59). Overall, the model performed similarly to previous studies, although starts to perform poorly for compounds with log
RF's less than −1, as seen previously, likely due to the lack of observations.46,47 Further work is needed to increase the RF measurement database for more accurate model development and optimisation.
![]() | ||
Fig. 1 Comparison between measured log RF (log RFM) and predicted log RF (log1 RFp) produced by a RRF model. All log RFM and log RFP values are given in Table S1† for the standards used in this study. The solid black line is 1 : 1 i.e. would represent perfect predictions of the measured values. The blue dotted lines represent 2× RMSE from the 1 : 1 line. The grey vertical lines represent predicted log RF ± RMSE. | ||
Liigand et al., 2020 has previously developed this type of machine learning quantitative ESI-LC-MS approach, using to date, the largest compiled dataset of RF measurements. This is a complex dataset, spanning an array of different solvent compositions, ionisation modes and instruments. The model presented here is the first to predict BSOA RF factors based on an experimentally derived predictive model. The model was built for the purpose of quantifying BSOA compounds in a set solvent mixture, and only on one instrument, meaning the dataset could be less complex. This study therefore highlights a method for quantification of BSOA species without authentic standards, without the need of large datasets which take a long time to accumulate, using commercially available, low-cost standards. This method also negates the need to perform numerous standard calibrations for component quantification, leading to faster throughput of samples. However, more authentic BSOA standards are needed to further develop the model and to validate the predicted values. Due to the lack of commercially available authentic BSOA standards, predicted RF values could not be compared to numerous measured RF's of authentic standards. However, pinic acid was synthesised as part of this study adapted from the procedure developed by Kenseth et al., 2020. The RF of pinic acid was analysed across the same concentration range as part of a mixture with cis-pinonic acid. The measured log
RF of pinic acid was 0.46, considering the purity of the synthesised compound compared to a predicted value of 0.86 ± 0.59 highlighting the relative accuracy of the model and its ability to predict reliable RF values. Further BSOA authentic standards are needed to fully validate the predictive RF model.
Furthermore, Liigand et al., 2020 shows that these models can be transferred between instruments, while each instrument and method would produce a specific RF value for a compound, specific compounds have been shown to be effective at moving the model across instruments. This suggests an aerosol community model could be developed but more work is needed. An open-source database which has now been developed by the Kruve group allowing for large amounts of RF measurements to be compiled across instruments and laboratories.22 This would allow for a generalised RF model to be produced for standardised RF factors of SOA species based on a set of defined authentic standards.
The chemical descriptors for these structures were obtained and using the optimised model described in Section 3.1 and the log
RF values were predicted. The predicted RF's ranged from 0.27 to 13.5, with an average of 4.2 ± 3.9 (mean ± SD). These values are of similar magnitude to those measured by Kenseth et al., 2020 and proposed by Zhang et al., 2015, however are notably smaller.71 Due to the lack of intercomparison studies, the cause of this deviation is unknown, but is likely due to different instrument set ups and analysis. The α-pinene and β-caryophyllene markers had similar average RF values of 5.2 ± 4.0 and 5.6 ± 4.5 respectively, while limonene markers had an average of 2.4 ± 2.3. It should be noted that multiple isomers are likely for most of the markers, however only selected isomers which had previously proposed structures were used in this study. For example, 10 isomeric structures of Lim_199 (C10H16O4) were proposed by Hammes et al., 2019.62 The average RF of these 10 structures was 1.4 ± 1.7, with a range of 0.49–4.80, highlighting the importance of structure confirmation for quantification. The highest predicted RF value for the α-pinene markers was 12.7 for Pinene_353a (C19H30O6) and, Pinene_353b (C20H34O5), both of which are dimer species. This is in-line with high RF values measured via authentic standards of 35.6 and 21.1 for Pinene 353a and 353b respectively by Kenseth et al., 2020. Measured and predicted RIE values were not expected to be the same due to the method specific nature of the values as discussed earlier, but the RF values are in-line with one another.
The species in Table S3† were then targeted in the SOA chamber samples generated from α-pinene, limonene and β-caryophyllene precursors. Most of the α-pinene marker structures were confirmed via comparison to either authentic standards56 or matching product ion mass spectra with previous studies.39,53 Comparatively less MS2 data was available for the limonene markers, with far more isomers identified and structures proposed. Several markers were authenticated via matching MS2 peaks to Witkowski et al., 2017.63 For the β-caryophyllene markers, only one species (β-caryophyllinic acid) was authenticated via MS2. Several of the β-caryophyllene markers identified in the chamber samples only had one previously proposed structure, as such markers with only one isomer were assumed to be the same structure. 25 markers were added into a database containing accurate masses and retention times for targeted identification in the ambient samples.
High resolution MS studies of aerosol composition generally employ mass spectral data evaluation methods such as Van Krevelen diagrams, double bond equivalents (DBE), average oxidation states and average molecular formulae (MF) based on the number of detected molecular formulae.18,20,27,64 For example, Kundu et al., 2012 investigated the relative abundance of compounds with different O
:
C and H
:
C ratios and found a high abundance of high molecular weight functionalised aliphatic compounds. These relative abundances when corrected for by RF factors could be drastically different to that proposed using the raw signal.65
To investigate the effect of RF factors on these evaluation methods, the hydrogen to carbon (H
:
C) and oxygen to carbon (O
:
C) ratios, DBE and average MF were standardised by number, proxy concentration (i.e. proportional to peak area) and RF corrected concentrations as summarised in Table 2. First, the average O
:
C and H
:
C ratios were calculated for the 9 markers. O
:
C was calculated to be 0.43 based on the number of markers but increased to 0.61 when the average was weighted by the cis-pinonic acid derived concentrations and 0.48 when weighted by the RF corrected concentrations. This is a significant difference considering relatively small differences in O
:
C ratios between different grouped MF based on mass ranges65,66 and different sources.23,26,67–69 A significant shift in average MF was seen when using the number of unique formulae identified (C10H15.6O4), weighted by cis-pinonic acid calibration concentrations (C8.1H12.1O4.7) and weighted by the RF corrected concentrations (C8.8H13.7O4.0). Overall, this shows that even with a small number of markers, the average MF can change, moving from C10 species to C8/C9 depending on the weighting of the average. More work is needed to understand the impact of different RFs when many hundreds of compounds are used to calculate these metrics.
RF value of 0.62 compared to the predicted value of 0.86. RF values for the BSOA markers ranged from 0.27 to 13.5, meaning that by using cis-pinonic acid as a proxy calibrant, concentrations of individual BSOA components could be underpredicted by a factor of 3.7 or over predicted by 13.5. Nine BSOA markers, including cis-pinonic acid, were then quantified in 25 ambient Beijing PM2.5 samples. Time averaged quantified compound concentrations decreased from 146.0 ng m−3 to 51 ng m−3 when calibrating using a standard cis-pinonic acid calibration and then correcting using the model predicted RF factors. The effect of these factors was then investigated on common aerosol evaluation methods, with differences in O
:
C ratios of 0.61 vs. 0.48 for cis-pinonic acid calibrated and RF corrected weighted average concentrations. A geometric mean average RF value was calculated to be 4.2 ± 3.9, highlighting the large variability in the predicted RFs, and therefore its lack of reliability if used as a generalised RF. We feel it is important to highlight the issues when assuming a single response factor, whether that be from cis-pinonic acid, or a “corrected” RF. Overall, this study highlights a need to account for the differences in ionisation efficiencies when investigating organic aerosol composition, due to the significant differences in calculated aerosol evaluation metrics, which could influence source contributions. Further work is needed to develop this method to predict RF's without the need of structure elucidation and expand to include newly synthesised organic compounds and the range of functional groups and gas phase precursors. Previous studies have suggested the applicability of transferring the predictive model between instruments, suggesting an open-source aerosol community model could be developed in the future.
Footnotes |
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d2ea00074a |
| ‡ Present address: Faculty of Science and Technology, Bournemouth University, BH12 5BB, UK. |
| This journal is © The Royal Society of Chemistry 2023 |