Xiaohui
Qiao
a,
Xiaoxiao
Li
a,
Chao
Yan
bcd,
Nina
Sarnela
c,
Rujing
Yin
a,
Yishuo
Guo
d,
Lei
Yao
cd,
Wei
Nie
b,
Dandan
Huang
e,
Zhe
Wang
f,
Federico
Bianchi
cd,
Yongchun
Liu
d,
Neil M.
Donahue
gh,
Markku
Kulmala
cd and
Jingkun
Jiang
*a
aState Key Joint Laboratory of Environment Simulation and Pollution Control, School of Environment, Tsinghua University, 100084, Beijing, China. E-mail: jiangjk@tsinghua.edu.cn
bJoint International Research Laboratory of Atmospheric and Earth System Research, School of Atmospheric Sciences, Nanjing University, Nanjing, China
cInstitute for Atmospheric and Earth System Research/Physics, Faculty of Science, University of Helsinki, 00014, Helsinki, Finland
dAerosol and Haze Laboratory, Beijing Advanced Innovation Center for Soft Matter Science and Engineering, Beijing University of Chemical Technology, 100029, Beijing, China
eState Environmental Protection Key Laboratory of Formation and Prevention of Urban Air Pollution Complex, Shanghai Academy of Environmental Sciences, Shanghai, China
fDivision of Environment and Sustainability, The Hong Kong University of Science and Technology, Hong Kong SAR, China
gCenter for Atmospheric Particle Studies, Carnegie Mellon University, Pittsburgh, PA, USA
hDepartment of Chemistry, Carnegie Mellon University, Pittsburgh, PA, USA
First published on 3rd December 2022
Gas-phase oxygenated organic molecules (OOMs) can contribute significantly to both atmospheric new particle growth and secondary organic aerosol formation. Precursor apportionment of atmospheric OOMs connects them with volatile organic compounds (VOCs). Since atmospheric OOMs are often highly functionalized products of multistep reactions, it is challenging to reveal the complete mapping relationships between OOMs and their precursors. In this study, we demonstrate that the machine learning method is useful in attributing atmospheric OOMs to their precursors using several chemical indicators, such as O/C ratio and H/C ratio. The model is trained and tested using data acquired in controlled laboratory experiments, covering the oxidation products of four main types of VOCs (isoprene, monoterpenes, aliphatics, and aromatics). Then, the model is used for analyzing atmospheric OOMs measured in both urban Beijing and a boreal forest environment in southern Finland. The results suggest that atmospheric OOMs in these two environments can be reasonably assigned to their precursors. Beijing is an anthropogenic VOC dominated environment with ∼64% aromatic and aliphatic OOMs, and the other boreal forested area has ∼76% monoterpene OOMs. This pilot study shows that machine learning can be a promising tool in atmospheric chemistry for connecting the dots.
Environmental significanceThe formation of new particles and secondary organic aerosols has potential effects on the earth's radiation balance and air quality, and atmospheric oxygenated organic molecules (OOMs) are currently acknowledged to significantly contribute to them. Volatile organic compounds (VOCs) are important precursors for OOMs. Since OOMs are often highly functionalized products of multistep reactions in the complex atmospheric environments, it is challenging to reveal the complete mapping relationships between OOMs and their precursors. In this study, we demonstrate that the machine learning method is useful in attributing OOMs to their precursors using several chemical indicators, such as O/C ratio and H/C ratio. Based on controlled oxidation experiments of four main types of VOCs (isoprene, monoterpenes, aliphatics, and aromatics), we trained the machine learning model to extract the internal mapping relationships of OOMs and their precursors. It showed unique advantages in precursor apportionment over current methods. We then used it for analyzing atmospheric OOMs measured in both urban Beijing and forested Hyytiälä. It can well identify the differences in OOMs between anthropogenic and biogenic dominated atmospheric environments. This pilot study shows that machine learning can be a promising tool in atmospheric chemistry for connecting the dots and is worth further exploring. |
Both the complex functionalization processes and the differences between controlled laboratory conditions and real atmospheric environments make precursor apportionment of atmospheric OOMs very challenging. The VOC precursors of OOMs can generally be divided into anthropogenic volatile organic compounds (e.g., aromatics and aliphatics) and biogenic volatile organic compounds (e.g., monoterpenes and isoprene).7 Positive matrix factorization (PMF)8 is a widely used source apportionment analysis method,9,10 which classifies numerous species into several factors based on their similarities in temporal variations. However, PMF could not sufficiently attribute atmospheric OOMs to their precursors.9,11 It is partially because the concentration of OOMs is not only affected by their precursors but also strongly affected by the oxidation processes under the given atmospheric conditions. Atmospheric OOMs from different precursors may share similar time series if they are oxidized by the same oxidants. For example, OOMs generated from the photo-oxidation of aromatics and monoterpenes cannot be separated by the PMF analysis.3 Recently, Nie et al.3 developed a workflow method, which performs precursor apportionment of atmospheric OOMs based on the up-to-date knowledge of the characteristics of the products from VOC oxidation processes. In that method, however, the identification of OOMs from monoterpene oxidation (monoterpene OOMs) still relies on PMF. This is known to underestimate the concentration of monoterpene OOMs because monoterpene oxidized by the OH radical, which exists under the given conditions, cannot easily be retrieved by PMF.3,12 Controlled laboratory experiments provide important information on OOMs produced from different precursors. However, the chemical settings of controlled laboratory experiments cannot fully reflect the complexity of the real atmosphere, which makes it difficult to interpret atmospheric OOMs solely based on laboratory conditions.
Machine learning methods possess the advantage of mining the relationships among complex data. For instance, the decision tree model, one of the classical machine learning methods, has been successfully applied in predicting amino acid sequencing and in proteomics research due to its high interpretability and tolerance of data scale.13,14 With the development and application of high-resolution mass spectrometry in the field of atmospheric chemistry, large datasets of atmospheric OOMs at the molecular level are obtained with high time resolution for long-term periods.4,15 Thus, finding precursors for atmospheric OOMs appears to be a natural playground for machine learning methods. They have the potential to mine large datasets from controlled laboratory experiments with no reliance on the variation of OOM concentration and further attribute OOMs measured in ambient atmosphere to their likely precursors.
In this work, we use the decision tree model to test the feasibility of machine learning for precursor apportionment of atmospheric OOMs. This model is trained and tested using the datasets from controlled laboratory experiments using various VOCs as OOM precursors. It helps build a mapping relationship between OOMs and their precursors, including isoprene, monoterpenes, aliphatics, and aromatics. Finally, we apply the model to atmospheric datasets obtained in both urban Beijing and a remote forest environment of Hyytiälä.
In order to reduce the uncertainty caused by one single training model, we repeated the above process ten times to get ten independent decision trees. The outputs of all the ten decision trees vote together for the final apportionment result. Specifically, precursors with votes more than the upper limit, i.e., the sum of the mean and standard deviation of the whole votes, will be retained, or, if all the precursors received votes no more than the upper limit, precursors with votes more than the lower limit will be retained, i.e., the difference of the mean and standard deviation of the whole votes. There are examples in the ESI† to illustrate this rule.
This voting strategy also helps to reduce the uncertainty caused by the overlapping formulae of OOMs generated from different precursors. According to the laboratory experiments, there are overlaps between OOMs oxidized from those precursors (Fig. S1†). Therefore, in the pre-labeled dataset, the same descriptors of an OOM molecule may correspond to more than one label of precursors. The ten decision trees would take into account the overlaps and give a combination of the most likely answers. Table S2† shows two examples of the determination of overlapping molecules. The optimization process for the number of trees is described in the ESI.†
Precursors | Products | Descriptorsa,b,c,d,f | |||||||
---|---|---|---|---|---|---|---|---|---|
nC | nH | nO | nN | DBE | H/C | O/C | OSc | ||
a nC, nH, nO, and nN: the number of carbon, hydrogen, oxygen, and nitrogen atoms. b DBE: double bond equivalent, which is calculated as (2nC + 2nH − nN)/23. c H/C and O/C: the ratio of nH over nC, and nO over nC. d OSc: carbon oxidation state, which is calculated as 2O/C–H/C.32 e 63 reported aliphatic OOMs and 283 generated substances (adding an integer number of CH2 groups to the reported aliphatic OOMs). f Values of nC, nH, nO, nN, DBE, H/C, O/C, and OSc in the table are averaged from each product with equal weight. | |||||||||
Monoterpenes | |||||||||
α-Pinene, limonene | 872 | 12.7 | 19.3 | 9.3 | 0.4 | 3.9 | 1.5 | 0.9 | 0.2 |
Aliphatics | |||||||||
t-Decalin, decalin, cyclohexane, n-decane, generatede | 346 | 11.4 | 20.5 | 5.8 | 0.2 | 2.0 | 1.8 | 0.6 | −0.6 |
Aromatics | |||||||||
Benzene, toluene, ethylbenzene, xylene, mesitylene | 485 | 9.7 | 11.8 | 8.8 | 0.2 | 4.7 | 1.2 | 1.1 | 0.9 |
Isoprene | 30 | 4.7 | 8.7 | 4.9 | 0.9 | 0.9 | 1.9 | 1.4 | 1.02 |
Aliphatics, including some endocyclic alkenes and straight-chain model compounds, also give considerable yields of OOMs which are present in relatively high atmospheric concentrations in an urban environment. Due to the limited laboratory experiments on aliphatics, only 63 oxidation products of aliphatics are available which are mainly C6 and C10 substances.21 In order to reduce the possible biases due to the imbalanced number of input compounds for different precursors, we tested the case that artificially increases the number of aliphatic compounds by adding unstudied but likely existing ones that are homologous to the studied aliphatic compounds. The number of oxidation products of aliphatics was extended to 346. Details about the expansion of aliphatic OOMs and comparisons to the results without the expansion are provided in the ESI.† It should be noted that this cannot be done for isoprene OOMs, because isoprene itself has no homologous compounds.
As shown in Table 1, there are eight descriptors of oxidation products based on their chemical formula. The first four are the original information of elemental composition, i.e., the number of C, H, O, and N. The other four are the processed information of the molecules: double bond equivalent (DBE) and H/C ratio which reflect the carbon saturation state of OOMs, and O/C ratio and OSc which reflect their carbon oxidation state. As the VOC precursors are different in carbon number, DBE, and functional groups, their oxidation pathways and the generated OOMs are significantly different in these characteristics. For example, monoterpene OOMs and aromatic OOMs generally have higher DBE than aliphatic OOMs based on the current knowledge of their oxidation reactions.3 We found that the minimum number of features is 3–4 including H/C and O/C. Increasing the number of features would slightly improve the performance of the models. So all the 8 features are selected.
The atmospheric OOM dataset was acquired from the measurements in Beijing (324 OOM species) and Hyytiälä (328 OOM species). The former was conducted at the BUCT-AHL site during 2018.12.26–2019.1.26, which is ∼500 meters west of the Third Ring Road in Beijing with heavy traffic loading and surrounded by residential and commercial areas.4,22 The latter was made at the SMEAR II station during 2018.3.9–2018.3.31.9 Details of these two sites can be found in previous studies. Note that the atmospheric OOM dataset does not contain nitrated phenols since its concentration is significantly high in Beijing (∼75%). In order to reduce the impact of extremely high signals,3 we excluded nitrated phenols in this analysis.
(1) |
As suggested by the model, atmospheric OOMs in Beijing have a significant contribution from anthropogenic precursors, while those in Hyytiälä are mainly the oxidation products of biogenic monoterpenes. 37% and 27% of the OOM species in Beijing are from the oxidation of aromatics and aliphatics, respectively. As for the number concentration, the proportion of aromatic OOMs and aliphatic OOMs in Beijing is 43% and 23%, respectively. The observed dominance of aromatic and aliphatic OOMs in urban Beijing is consistent with those obtained using the workflow method9 (Fig. S2†). The contribution from isoprene OOMs (13%) is also similar to those predicted by the workflow (10%). The non-negligible urban isoprene could come from traffic, biomass burning, and a minor contribution from vegetation.23–25 The predicted monoterpene OOMs (21%) is much higher than those by the workflow (4%). This is partly due to the missing identification of OH-oxidized monoterpene oxidation products from the workflow.3 In contrast, monoterpene OOMs in forested Hyytiälä dominate both in species (76%) and number concentration (62%). In addition, even for forested Hyytiälä, there are still ∼33% aromatic OOMs and ∼5% aliphatic OOMs in total number concentrations. Unnegligible proportions of anthropogenic OOMs in Hyytiälä have also been reported in previous PMF results on OOMs by Yan et al.9 They may come from distant anthropogenic sources or from local wood combustion sources.26–28
As shown in Fig. 4, atmospheric OOMs oxidized from monoterpenes in Beijing and Hyytiälä have different characteristics. In Hyytiälä, monoterpene OOMs consist of 76% non-nitrogen compounds, 18% compounds with one nitrogen, and 6% compounds with two nitrogens. In Beijing, 79% of monoterpene OOMs are with one nitrogen, 17% are non-nitrogen ones, and 4% have two nitrogens. Moreover, monoterpene OOMs in Hyytiälä show a lower volatility distribution compared to those in Beijing (Fig. 4a). In Hyytiälä, ∼22% of monoterpene OOMs are extremely low-volatility organic compounds (ELVOCs, C* ≤ 10−4.5 μg m−3). In Beijing, only ∼1% of monoterpene OOMs are ELVOCs. This could be caused by the large difference of NOx concentrations in Beijing and Hyytiälä, which are in averages of ∼30 ppb and ∼2 ppb, respectively.17,29,30 Laboratory experiments showed that high NOx concentrations can reduce the formation of dimer products and inhibit consecutive oxygen addition in the auto-oxidation of monoterpenes, and consequently shift the volatility distribution of OOMs to the higher range.
Improvement of model precision requires more laboratory results to train the model, including experiments with diverse precursors and oxidation conditions, and more information about the oxidation products, e.g., the relative concentration of the oxidation products to the fingerprint molecules. As shown by the expansion example of aliphatic OOMs, a more balanced and comprehensive dataset would decrease the uncertainty of the machine learning method (Fig. S3 & S4 in the ESI†).
This work shows a vivid example of connecting laboratory data under various experimental conditions to measurements in the complex atmospheric environment. With the rapid development of analytical technologies, a large amount of multi-dimensional data with higher temporal resolution, higher spatial resolution, and higher chemical or physical resolutions are obtained. Developing data analysis methods using machine learning will largely improve data interpretation and provide an effective way of comparing and utilizing data from various collectors.
Footnote |
† Electronic supplementary information (ESI) available: Detailed information on the evaluation of the precursor apportionment model (Table S1); examples of the voting strategy (Table S2); examples of the expansion of aliphatic OOMs (Table S3); the Venn-plot of laboratory-generated OOMs used in this study (Fig. S1); the apportionment results of OOMs in urban Beijing using the workflow method (Fig. S2); the overall accuracy of the decision tree model using a laboratory dataset without the expansion of aliphatic OOMs (Fig. S3); the application results for the model trained with the dataset without the expansion of aliphatic OOMs (Fig. S4); and the performances of models consisting of different number of trees (Fig. S5). See DOI: https://doi.org/10.1039/d2ea00128d |
This journal is © The Royal Society of Chemistry 2023 |