Tommaso
Galeazzo
* and
Manabu
Shiraiwa
*
Department of Chemistry, University of California, Irvine, CA92697, USA. E-mail: tommaso.galeazzo@gmail.com; m.shiraiwa@uci.edu
First published on 5th April 2022
Gas-particle partitioning of secondary organic aerosols is impacted by particle phase state and viscosity, which can be inferred from the glass transition temperature (Tg) of the constituting organic compounds. Several parametrizations were developed to predict Tg of organic compounds based on molecular properties and elemental composition, but they are subject to relatively large uncertainties as they do not account for molecular structure and functionality. Here we develop a new Tg prediction method powered by machine learning and “molecular embeddings”, which are unique numerical representations of chemical compounds that retain information on their structure, inter atomic connectivity and functionality. We have trained multiple state-of-the-art machine learning models on databases of experimental Tg of organic compounds and their corresponding molecular embeddings. The best prediction model is the tgBoost model built with an Extreme Gradient Boosting (XGBoost) regressor trained via a nested cross-validation method, reproducing experimental data very well with a mean absolute error of 18.3 K. It can also quantify the influence of number and location of functional groups on the Tg of organic molecules, while accounting for atom connectivity and predicting different Tg for compositional isomers. The tgBoost model suggests the following trend for sensitivity of Tg to functional group addition: –COOH (carboxylic acid) > –C(O)OR (ester) ≈ –OH (alcohol) > –C(O)R (ketone) ≈ –COR (ether) ≈ –C(O)H (aldehyde). We also developed a model to predict the melting point (Tm) of organic compounds by training a deep neural network on a large dataset of experimental Tm. The model performs reasonably well against the available dataset with a mean absolute error of 31.0 K. These new machine learning powered models can be applied to field and laboratory measurements as well as atmospheric aerosol models to predict the Tg and Tm of SOA compounds for evaluation of the phase state and viscosity of SOA.
Environmental significanceSecondary organic aerosols (SOA) represent a major component of atmospheric particulate matter and their accurate representation in aerosol models is a demanding problem in atmospheric chemistry. SOA partitioning is impacted by the particle phase state, viscosity and glass transition temperature (Tg) of organic compounds. Here, we develop a machine learning model to predict glass transition temperature of organic compounds. The new model considers molecular structure, functionality and atomic interconnectivity, discerning compositional isomers. It reproduces experimental measurements very well, outperforming previous compositional parametrizations. This powerful tool offers state-of-the-art performances and its implementation in aerosol models would contribute to a better evaluation of SOA effects on climate and air quality. |
SOA viscosity can be estimated using the glass transition temperature (Tg) of the constituting organic compounds.7–9 While there is a fair amount of measured Tg of organic compounds,7 there are only limited number of Tg measurements for SOA compounds.10–12 To fill such measurement gaps, various Tg parametrizations have been developed based on molecular properties including molar mass and atomic oxygen to carbon ratio,13 saturation mass concentration,12,14,15 and elemental composition (e.g., number of C, H, N, O, S atoms).8,15 Moreover, Rothfuss and Petters16 have introduced an empirical group contribution estimation based on functional groups presence within a molecule. Their results suggest functionalization is a crucial predictive parameter for molecular Tg. These parameterizations are simple, practical, and versatile prediction methods, which have been applied to estimate SOA viscosity for high-resolution mass spectrometry8,17–19 and also implemented into thermodynamic,20,21 gas-phase chemistry models9 and chemical transport models.5,22,23 These parameterizations, however, have relatively larger uncertainties (∼25 K) and they do not account for molecular structure and functional groups presence. Notably, Rothfuss and Petters16 found that viscosity of weakly functionalized organic compounds is highly sensitive to functional groups addition and location within a molecule. Thus, there is a strong need to discern between compositional isomers and to account explicitly for functionality and molecular structure.
In cheminformatics molecular descriptors are mathematical objects representing chemical species at different levels of complexity, covering from 0-D atomic information to 3-D molecular structures.24 They are commonly used in medicinal chemistry to develop models predicting chemical activities (Quantitative Structure–Activity Relationship, QSAR) or physical properties (Quantitative Structure–Property Relationship, QSPR) of pure compounds in absorption, distribution, metabolism, excretion and toxicity (ADMET) studies.25 Over the years there have been studies focusing on the prediction of the melting point (Tm) of an organic compound via QSPR models: it is a task of particular interest because of the correlation of the Tm with the vapor pressure, boiling point, glass transition temperature, and water solubility.26–28 In environmental sciences, the MPBPWIN module from the EPI Suite software by the Environmental Protection Agency (EPA) is the standard reference for the estimation of the Tm of environmentally-relevant organic compounds.29 This QSPR model is built on a simple group-contribution-method where a vectorized list of functional groups acts as the molecular descriptor of a species. The EPI Suite performs well for predicting the Tm of small molecules with a few functional groups, but it overestimates Tm of more structurally complex and aromatic compounds.29 Therefore, the need for a powerful estimation model that could be more accurate in its prediction and in capturing molecular complexity. However, developing an accurate Tm model is challenging due to the quality of the available datasets and the variability of the used molecular descriptors which induce large errors and uncertainty in model generalizability (i.e., accurate error estimation when analyzing new molecules).
Over the last years, developments in artificial intelligence and machine learning (ML) have overspilled to cheminformatics. Notably, text analysis techniques borrowed from Natural Language Processing (NLP) have been used to learn chemically contextualized molecular descriptors from canonical molecular representations including SMILES (Simplified Molecular Input Line Entry System: molecular strings notations to describe unique chemical structures). Jastrzębski et al.30 showed that SMILES can be used to learn contextualized chemical descriptors of molecules via convolutional neural networks (CNN). Recently, Gómez-Bombarelli et al.31 developed a chemical Variational Auto Encoder (VAE): a generative model based on a Recurrent Neural Network (RNN) that learns compressed numerical representations from SMILES (encoding step) and generates molecules with target properties from the chemical space (generative step). Segler et al.32 showed that SMILES can be used to generate bioactive molecular candidates from small datasets using RNNs and by transferring the learned molecular descriptors and knowledge to large datasets. Jaeger et al.33 developed mol2vec, an algorithm that learns molecular descriptors as high dimensional embeddings of molecular substructures (i.e. “molecular embeddings”) from SMILES notations by combining the word2vec and Morgan algorithms.
Recently, a few studies have explored the performances of different combinations of ML models and molecular descriptors in predicting the Tm of organic compounds.34–36 The resulting models largely outperform the EPI Suite in predicting Tm, suggesting an increasing ability of ML models and complex molecular descriptors in predicting Tm of pure compounds and potentially their Tg. These studies have focused on the development of molecular descriptors from molecular graphs via convolutional embedding methods.60,61 The developed embeddings (i.e., convolutional embeddings) reach extremely high prediction accuracies, but they can result in significant drawbacks with regards to model deployment and portability. In these approaches both the embedding and the property prediction steps are engrained in the Convolutional Neural Network (CNN). The major caveat of using the former approach is that the dataset and CNN architecture cannot be decoupled, and the embeddings are generated in situ from the very specific dataset. Therefore, the resulting convolutional embeddings are dataset specific and cannot be loaded and used for other tasks. Moreover, in such approaches the development of the target QSAR model requires the optimization and training of the CNN, which are very computationally demanding tasks. As a consequence, these models lack portability, transferability and scalability due to the in situ generation of molecular descriptors dependent on the dataset of origin, and the absence of a finalized trained model which could be transferred to other datasets. On the other hand, mol2vec reaches state-of-the-art performances to statistically infer various molecular properties in supervised learning tasks by generating unique high-dimensional vectors from a pretrained embedding model. As a result, it can be easily transferred and included in the analysis of complex chemical systems with large numbers of diverse compounds.
Here we introduce the first ML-driven Tg prediction method based on molecular embeddings. We use different machine learning algorithms to predict Tg by explicitly considering molecular structure, functionality and atomic interconnectivity, outperforming previous Tg parameterizations. The new model can reproduce experimental Tg data and the influence of number and location of functional groups within the molecule on Tg. We extend the investigation to the prediction of Tm using a large experimental dataset with ∼200000 compounds. We develop a new model for Tm prediction, reaching close to state-of-the-art performances. These new ML powered Tg and Tm models can be exploited to predict viscosity in aerosol models involving organic molecules, with future applications that go beyond aerosol chemistry and extend to modeling of organic mixtures.
The largest measurement dataset reporting Tg of organic compounds (Tg-Measured) contains 394 measurements.7 The original dataset has been enriched with recently measured Tg for 7 atmospheric compounds12 and with theoretically-derived Tg for 9 linear alkanes from molecular dynamics simulations.38 The Tg-Measured dataset is composed of 415 entries and after the cleaning step it comprises 298 unique entries. Due to the scarcity of available experimental Tg data, we train separate Tm models using larger datasets of experimental Tm. Tg can be estimated from Tm using the structure–activity relationship known as the Boyer-Kauzmann rule: Tg = g × Tm with g as a constant to be 0.7 based on analysis by Koop et al.7 The first experimental Tm dataset is formed by values extracted from patents by Tetko et al.39 (Tm-Tetko), while the second dataset is the “Bradley good Tm dataset” (Tm-Bradley),40 a highly curated experimental Tm dataset of drug-like small molecules. The Tm-Tetko dataset is the largest publicly available Tm dataset: it contains 228174 entries and it accounts for 220348 species after cleaning.
The final step of dataset preprocessing is the conversion of canonical SMILES into molecular embeddings. We have used the mol2vec library to generate unique 300-dimensional embeddings (i.e., “molecular embedding”) for each chemical species in the different datasets (Table 1).
Model selection and optimization are conducted via a nested cross-validation (also known as “double cross-validation”). This model development technique allows to estimate an almost unbiased and low variance true error when data are scarce.42–44 A previous study has shown that a nested cross-validation is particularly suitable for QSAR/QSPR model development and when datasets are small (i.e. <1000 entries).43Fig. 1a shows a simplified representation of the 10-fold nested cross-validation implemented for the training of Tg models with the Tg-Measured dataset. The model development task is structured as a double loop composed by an outer loop for model evaluation (i.e., outer K-loop) and an inner loop for model selection and optimization (i.e., inner J-loop).44 Initially, the entire dataset is divided into K folds (K = 10 in this study): K-1 folds are used to train the model on the best set of hyperparameters and one fold is kept aside to evaluate the error of the trained model on unseen data for model evaluation. At each i iteration of the outer loop (1 < i < K), the ith fold used for model training (i.e., Ki) is passed to the inner J-loop for model optimization via hyperparameters selection. To carry the optimization of the model, Ki is further divided into J sub-folds with J = 10 in this study.
During this step, J-1 sub-folds are used to select the best hyperparameters (i.e., hyperparameters tuning based on the specific model architecture) and the remaining J sub-fold is a validation set used to evaluate the performance of the model developed with this specific parameter combination. The process is repeated for J times on the different J-folds combinations obtained from further split of Ki. The model optimization step is repeated for each K-fold of the outer loop, resulting in K different models developed from the K data combinations. As a result, for each model architecture (e.g., RF, XGBoost) we have conducted n × 10 × 10 fits and error estimations, where n is the total number of single values that can be assumed by individual model architecture parameters. Once we have identified the best model parameters, we have trained the model on the whole dataset and used the estimated errors from the outer cross-validation as the final true value of MAE. This approach enables to reach the best trade-off between bias and variance by selecting the best model parameters, while obtaining a true error estimation by accounting for a vast number of possible data combinations and cross-validation. The RF regressor model is implemented in Python using scikit-learn, a library for scientific computing and machine learning.45 The XGBoost regressor model is implemented via xgboost, a Python library for optimized distributed gradient boosting model development.41 The hyperparameters of our regressor models were selected through a grid search approach: we selected a range of plausible values for each hyperparameter (e.g., estimators, maximum depth of trees, learning rate, etc.) and we have trained as many models as the possible combinations of available hyperparameters. Finally, we selected the best Tg regression model whose combination of hyperparameters provided the lowest error in the nested cross-validation step. The best Tg regression model (tgBoost) developed via a XGBoost regressor and through the nested cross-validation is composed by: 100 estimators, a maximum depth of 9 trees, a learning rate of 0.1, and a γ equal to 30.
Dataset | Algorithma | MAE (K) | RMSE | R CV 2 | R | Study |
---|---|---|---|---|---|---|
a RF = random forest, XGBoost = extreme gradient boosting method, OLS = ordinary least squares (i.e., compositional parametrizations8,15), DNN = deep neural network, CNN = convolutional neural network, GCNN = graph convolutional neural network, ASNN = adversarial neural network, GPR = Gaussian process regression. b The datasets used in these studies are all different variations of the “Bradley Good Melting Points Dataset” from Bradley et al.40 | ||||||
T g-Measured | RF | 22.2 | 26.9 | 0.86 | 0.94 | This work |
XGBoost | 18.3 | 15.3 | 0.99 | 0.99 | This work | |
OLS | 27.2 | 0.83 | 0.91 | 8 and 15 | ||
T m-Tetko | DNN | 31.0 | 40.1 | 0.6 | 0.77 | This work |
T m-Bradley | CNN | 26.2 | 35.5 | 35 | ||
Baseline | 43.3 | 57.7 | 35 | |||
ASNN | 32.0 | 34 | ||||
GCNN | 28.85 | 0.78 | 36 | |||
RF | 34.62 | 0.66 | 36 | |||
GPR | 29.41 | 0.75 | 36 |
Fig. 2 Correlation plot between the predicted Tg values by the tgBoost model and the experimental Tg values from the Tg-Measured dataset. |
Fig. 3 T g values predicted for the n-alkanes (n = 2–20) by the developed tgBoost model (red dots), molecular dynamics simulations (yellow squares),38 the Tg compositional parametrizations (black crosses).8 |
Remarkably, the tgBoost model repeats a smooth increase of Tg for n > 11 after a dip between n = 10–11. MD simulations suggest that the slower Tg increase is a non-linear phenomenon resulting from the combination of structural inner- and inter-molecular effects occurring in the bulk phase of pure n-alkanes.38 It is possible to assume that the higher degree of available conformational rearrangements of longer n-alkanes would lead to a lower Tg with each addition of a C atom to the alkyl chain. However, MD simulations suggest that longer n-alkanes can be easily trapped in pipe cages within the bulk phase of the glassy material and be prevented from rearranging in stable conformations that would lead to crystallization. With each addition of a C atom to the alkyl chain the interplay between these contrasting effects would need to be taken into account for Tg evaluation. As the MD values were included in the training dataset for the tgBoost model, it is possible that the tgBoost model might have extrapolated the trend observed for the lower mass molecules (i.e., n < 10) and expanded it to n > 11 based on the similarity between molecular embeddings of n-alkanes. To validate and resolve this issue, Tg measurements or MD simulations for higher n-alkanes are necessary. This result confirms the high performance of ML driven molecular descriptors (i.e., information rich embeddings) in capturing subtle variations in experimental/simulated data in relation to changes in the molecular structures and non-linear combinatorial physical effects.
Fig. 4 compares the experimental Tg measurements of n-alkyl alcohols (n = 1–16) with the respective Tg values predicted by the tgBoost model and the compositional parametrization.8 Primary, secondary and tertiary alcohols are all isomers with same elemental composition and consequently the compositional parametrization predicts the same values for all species, representing its major limitation. Overall, the compositional parametrization tends to overestimate the Tg of primary alcohols and to underestimate for secondary alcohols. In contrast, the tgBoost model predicts Tg in consistence with measurements, showing lower Tg for primary alcohols and higher Tg for secondary and tertiary alcohols when n < 7. This behavior is consistent with the results by Rothfuss and Petters,16 who highlighted that smaller Tg values are typically observed for primary alcohols whereas longer chain alcohols with branching and midchain –OH group have larger Tg values. The values modeled by tgBoost for secondary and tertiary alcohols overlap at this stage. It is hard to refine the tgBoost model performance for this task due to the lack of sufficient experimental measurements for tertiary alcohols. However, our results indicate that molecular embeddings would be able to capture oscillations in Tg due to small variations in molecular structures (such as the displacement of the –OH group along the alkyl chain) if more data were included during model training.
Fig. 4 T g values of primary alcohols (black), secondary alcohols (magenta), and tertiary alcohols (light blue) as a function of the number of carbons of the n-alkyl chain. The markers represent the available experimental measurements,7 the solid lines represent the predicted values by the tgBoost model, and the dashed lines represent the Tg predictions by the compositional parametrization.8 Note that primary alcohols have a terminal –OH group, and secondary and tertiary alcohols have one –OH group placed on the second and third carbon atoms of the alkyl chain, respectively. |
Fig. 5 shows the experimental and modeled Tg of n-alkanes, monoalcohols, diols and triols. Note that, monoalcohols have the –OH group attached at the end of their alkyl chain, diols have the two –OH groups attached at the two opposite ends of their alkyl chain, and triols have the same structure of diols with an additional –OH group attached to the second carbon of their alkyl chain counted from one of its ends. The measurements are compared to the corresponding predictions by the tgBoost model and the compositional parametrization.8 Both models reproduce an increase in Tg observed with the addition of one to three –OH groups to the alkyl chain. Remarkably, both models reproduce the increase of 30–50 K observed with the addition of each –OH group. These results are in good agreement with a previous study, which reported an almost linear increase in Tg value for the addition of –OH groups to the skeleton of raw alkyl chains.16 The compositional parametrization exhibits a logarithmic growth upon an increase of n, overestimating in general Tg values of alcohols. These results emphasize the predictive power of molecular embeddings and ML models in aerosol modeling over simple compositional parametrizations, given that atmospheric SOA contain many alcohols.48–50
Fig. 5 T g values of n-alkanes (black), monoalcohols (magenta), diols (red), and triols (yellow) as a function of the number of carbons of the n-alkyl chain. The markers represent experimental measurements,7 the solid lines represent the predicted values by the tgBoost model, and the dashed lines represent the Tg predictions by the compositional parametrization.8 Note that mono-alcohols have one terminal –OH group, diols have two –OH groups placed at the extremities of the carbon chain, and triols have a similar structure to diols but with an additional –OH group placed on the second carbon atom of the chain counted from one of the extremities. |
These results should motivate future studies to adopt molecular embeddings and machine learning algorithms to develop predictive models of organic molecules. Note that there are limitations to be accounted when using models like tgBoost. As reported in Fig. 4 and 5 the model can discern among compositional isomers of simple mono-alcohols (i.e. primary, secondary and tertiary alcohols) and it can simulate the increase in Tg by each –OH addition similarly to the compositional parametrization and the few available experimental data points of diols and triols. However, tgBoost should be used with precautions when working with these compound classes: as there are limited amount of observational data for tertiary alcohols, diols and triols, tgBoost might neglect possible trends in Tg for those molecular classes. Therefore, when possible, it would be good to compare tgBoost predictions with other Tg QSPR models developed on different datasets and physical properties. These limitations underline the importance of collecting more experimental data on Tg of atmospherically-relevant organic molecules.
We have conducted further proof-of-concept investigations on the performance of tgBoost in distinguishing other compositional isomers and on the effects on Tg due to the addition of carboxylic groups to an alkyl chain. Fig. 6 shows the Tg of different compositional isomers as predicted by tgBoost. It illustrates the Tg predictions as a function of the number of carbon atoms of the species in compositional isomers grouped as (a) ethers and alcohols, (b) ketones and aldehydes, and (c) esters and carboxylic acids, with the respective functional groups positioned at the end of the alkyl chain. Our results suggest the following trend for sensitivity of Tg to functional group addition: –COOH (carboxylic acid) > –C(O)OR (ester) ≈ –OH (alcohol) > –C(O) (carbonyl) ≈ –COR (ether) where the carbonyl group category comprises –C(O)R (ketone) and –C(O)H (aldehyde). Overall, the results are in good agreement with previous viscosity measurements, which suggested the following trend in viscosity sensitivity to functional group addition –COOH (carboxylic acid) ≈ –OH (alcohol) > –ONO2 (nitrate) > –C(O) (carbonyl) ≈ –C(O)OR (ester) > –CH2 (methylene).16 Our results suggest that for weakly functionalized compounds the addition of an ester functional group to the alkyl chain can strongly increase the Tg of a molecule, particularly for smaller compounds with n < 6 (see Fig. S4† in ESI). This effect may be due to conformational effects resulting from the addition of an alkoxycarbonyl group, which would induce lower flexibility in the aliphatic component and a lower degree of transformation between trans- and cis-stereoisomers in the carbon chain. It is expected that ketones and aldehydes have a relatively lower Tg compared to alcohols and carboxylic acids as there are no functional groups that may be involved directly in hydrogen bonds in the bulk phase of pure materials. However, their potential role in increasing the overall Tg of SOA mixtures should be noted since the carbonyl group may still participate in hydrogen bonds in presence of hydrogen donors. Further investigations are needed to assess how the interplay of multiple functional groups can affect Tg of multicomponent complex mixtures.
Carboxylic acids represent a major fraction of SOA50 and a better representation of carboxylic acids data is particularly relevant for aerosol models. Fig. 7 illustrates the predicted Tg by the tgBoost model and the compositional parametrization8 for the addition of one to three carboxylic groups to the alkyl chain. Both methods predict increasing Tg values for each addition of a carboxylic group to the molecule. At equal number of C atoms in the molecule, the mean increase in Tg between mono carboxylic and dicarboxylic acids is 63 and 53 K for the tgBoost model and compositional parametrizations, respectively. The tgBoost model shows a steady increase in Tg for mono-carboxylic acids and interestingly it predicts a decline in Tg for dicarboxylic acids with n > 5 and no increase in Tg value for tricarboxylic acids. It should be noted that the acidity of dicarboxylic acids lowers with the addition of C atoms to their alkyl chain due to the electron donating nature of the alkyl group. This decrease is proportional to the addition of new C atoms to the aliphatic chain with the highest reductions observed for the first four carbon additions.51 The acidity of dicarboxylic acids depends also on the conformation of the species, with trans-structural isomers being considerably more acidic than the equivalent cis-structural isomers. A lower acidity implies a reluctancy for the molecule to release protons, therefore a more stable structure and a lower strength in hydrogen bonding.52,53 It has been shown that Tg is strongly influenced by the presence of hydrogen bonds, notably because hydrogen bonds promote the crystallization process that leads to the transition into a solid state.54Tg is also strongly affected by stereoisomerism because rotamers and conformers slow down the crystallization process.55 Similar acidity and the presence of more stereoisomers in di- and tri-carboxylic acids with longer alkyl chains would combine and result in comparable strengths of hydrogen bonds with addition of C atoms to the molecular structure. It is possible that this effect may justify the decreasing or constant predicted Tg of dicarboxylic and tricarboxylic acids upon increasing carbon number.
Fig. 7 T g of linear monocarboxylic, dicarboxylic and tricarboxylic acids as a function of the number of carbon atoms in the alkyl chain as predicted by the tgBoost model (solid lines) and the Tg compositional parametrization (dashed lines).8 |
Fig. 8a shows the results of such comparison. Overall, the values predicted using the tgBoost are similar to the ones estimated using the Tm from EPI Suite with MAE = 27.6 K, R = 0.794, and RCV2 = 0.455. Many datapoints are positioned below the 1:1 correlation line, indicating that the tgBoost model tends to underpredict the Tg of SOA compounds compared to EPI Suite. This behavior is consistent with the EPI Suite guideline that reports a tendency for the MPBPWIN module to overestimate the Tm of large and multi-functionalized molecules, such as aromatics and complex carbonyl bearing compounds.29 The shaded pink square in Fig. 8a highlights a cluster of 20 molecules whose values are overpredicted by a factor of 100–150 K by EPI Suite, and which are all complex multi-functional branched carbonyl compounds with ether and alcohol segments within their molecular structure (see Fig. S2†).
Fig. 8 (a) Correlation plot between the Tg values predicted by the tgBoost model and by the MPBPWIN module of EPI Suite for SOA compounds from Shiraiwa et al.56 The pink squared area highlights the cluster of 20 molecules with the highest deviation between predictions. (b) Correlation plot between the Tg values predicted by the compositional parametrizations8,15 and by the MPBPWIN module of EPI Suite for SOA compounds from Shiraiwa et al.56 The orange squared area highlights the cluster of 30 molecules with the highest deviation between predictions. |
Fig. 8b shows the correlation between the Tg values predicted from the compositional parametrization8 and those estimated from EPI Suite Tm. In this case the Tg values predicted by the two models are very similar with a very low MAE of 14.6 K, a high positive correlation of R = 0.917, and a relatively low variance of RCV2 = 0.839. This result suggests that the models have similar prediction capability and the molecular descriptors used to develop the models have similar limitations. Notably, due to the limitations of EPI Suite, the compositional parametrizations may also tend to overestimate Tg of organic species as pointed by the high correlation between the two methods. The shaded orange square highlights a cluster of 30 molecules whose values are overpredicted by a factor of 55–75 K by EPI Suite. The largest divergences are observed for nitrogen bearing large compounds with carbonyl, alkane ring and alcohol segments within their molecular structures (see Fig. S3†).
These results imply that the tgBoost model is applicable to SOA compounds, providing more realistic Tg of organic molecules with complex structure and multiple functional groups compared to the EPI Suite. Remarkably, we observe that molecular embeddings can overcome the limitations affecting the performances of the EPI Suite and the compositional parametrizations. The tgBoost model performs well in predicting Tg of SOA compounds and it has good potential for applications to the modelling of aerosol chemistry.
Molecular embeddings have already shown to be able to capture slight variations in molecular structures and physical trends for predicting Tg as discussed above. Therefore, we expect similar behavior for Tm and here we have focused our analysis on the assessment of the domain of applicability of the DNN model developed from the Tm-Tetko dataset. Fig. 9 exhibits the correlation plot between the Tm predictions from the DNN model and EPI Suite for SOA compounds from Shiraiwa et al.56 It shows a positive correlation with R = 0.582, a high variance of RCV2 = 0.261 and a substantial divergence with MAE = 48.7 K. A deep investigation shows that the Tm of complex multi-functional species is overpredicted by EPI Suite, while the DNN model tends to overestimate the Tm of very simple compounds. The highest divergences are observed for complex multi-functional nitrate groups (EPI Suite predictions are on average 170–150 K higher than DNN ones) and for simple hydroxy acids (DNN predictions are on average 100–150 K higher than EPI Suite ones). These results suggest that our Tm model has limitations that need to be accounted if to be used to predict the Tm of atmospheric species composing SOA. The prime cause of the discrepancy between the predictions of the two models is likely linked to the different nature of the chemical species of the datasets. Notably, the Tm-Tetko dataset has an abundance of drug-like complex compounds such as alkaloids, aromatic cyclic nitrogen bearing compounds, steroids, and polycyclic molecules as well as more molecules with Br and Cl in their structures. This chemical dissimilarity could be responsible for the low performance of the model when it tries to predict the Tm of small low functionalized organic compounds. Despite the general good performance of the DNN model on complex molecules, its application to SOA chemistry may be limited. Further investigations are needed to develop a better Tm prediction model for applications to atmospheric chemistry. Notably, future work should focus on the retrieval of a more representative dataset of experimental Tm for atmospheric species to be used for model development.
Fig. 9 Correlation plot between the Tm values predicted by the DNN model and by the MPBPWIN module of EPI Suite for SOA compounds from Shiraiwa et al.56 |
The developed tgBoost model for Tg estimation is a very powerful tool: it has a low MAE of 18.3 K and its predictions are in very good agreement with experimental measurements, even capturing very subtle trends in data. The tgBoost model can reproduce non-linear trends observed in Tg for n-alkanes due to inter- and intra-molecular interactions in the bulk phase. The model can also distinguish structural isomers and discern how the positioning of a functional group within the molecular structure can influence its Tg. For this task, the tgBoost model was tested on primary, secondary and tertiary alcohols, showing how the displacement of an –OH group along the alkyl chain can influence Tg. The advantage of the tgBoost model lies in its ability to discern structural isomers; it can distinguish between aldehydes and ketones, alcohols and ethers, and esters and carboxylic acids, predicting different Tg for all pairs of isomeric chemical classes. The tgBoost model predicts the following trend in Tg sensitivity to functional group addition: –COOH (carboxylic acid) > –C(O)OR (ester) ≈ –OH (alcohol) > –C(O) (carbonyl) ≈ –COR (ether). This result is in good agreement with the trend in sensitivity to viscosity to functional group addition observed by Rothfuss and Petters.16 The tgBoost model has also been tested on mono-, di- and tri-carboxylic acids in order to assess if it can capture how the interplay between multiple functional groups can affect Tg. The results are in relatively good agreement with experimental measurements, but further investigations and data are needed to refine the model for this task. Finally, the model has been tested for its applicability to SOA compounds by comparing the tgBoost model with the compositional parametrizations and EPI Suite. The tgBoost model predicts Tg in reasonable agreement with or somewhat lower compared to EPI Suite and the compositional parametrizations. This is reasonable because a major limitation of EPI Suite is known to be the overprediction of Tm of structurally complex and multi-functionalized chemical species.29
The DNN model developed for the prediction of the Tm of organic molecules performs well against the available dataset with MAE of 31.0 K. The model has been tested for its applicability to SOA compounds in comparison with the EPI Suite. The model performance was limited for SOA compounds, which may be due to the nature of the training dataset that is rich in drug-like complex molecules with heavy atoms within their structures. Nevertheless, the model has great potential of improvement and further studies should concentrate on the retrieval of better experimental datasets with higher fraction of atmospheric organic compounds for model training.
Considerable progress has been made with regards of the development of a Tg prediction model that includes the effects of functionality and molecular structure on Tg explicitly. This aspect is crucial for aerosol models treating gas-particle partitioning of semi-volatile compounds as strongly affected by Tg and viscosity. A difference in Tg of a few K between compositional isomers could affect the viscosity of a particle. As a result, aerosol models of complex SOA systems such as GECKO-A,57 AIOMFAC-VISC20 and the Master Chemical Mechanism (MCM)58,59 would highly benefit from a more complex treatment of Tg to estimate particle viscosity. For instance, GECKO-A can generate complex chemical mechanisms formed by the reactions and partitioning between gas and particle phases. These SOA chemical systems are rich in alcohols, carboxylic acids and multi-functionalized compounds. A detailed treatment of Tg based on functionality would enhance the accuracy of model simulations and provide better insight on complex aerosol systems. The developed tgBoost model offers state-of-the-art performances in predicting the Tg of organic molecules involved in SOA chemistry. It is a very useful and powerful tool for estimations of SOA phase state and hence it can contribute to a better evaluation of SOA effects on climate and air quality.
Footnote |
† Electronic supplementary information (ESI) available. See https://doi.org/10.1039/d1ea00090j |
This journal is © The Royal Society of Chemistry 2022 |