From the journal Environmental Science: Atmospheres Peer review history

You do not have JavaScript enabled. Please enable JavaScript to access the full features of the site or access our non-JavaScript page.

Round 1

Manuscript submitted on 29 ኦክቶ 2021

Editor’s decision letter

15-Feb-2022

Dear Dr Shiraiwa:

Manuscript ID: EA-ART-10-2021-000090
TITLE: Predicting Glass Transition Temperature and Melting Point of Organic Compounds via Machine Learning and Molecular Embeddings

Thank you for your submission to Environmental Science: Atmospheres, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that major revisions are necessary.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/esatmos?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/esatmos) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

Environmental Science: Atmospheres strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy from CASRAI, https://casrai.org/credit/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines http://www.rsc.org/journals-books-databases/journal-authors-reviewers/author-responsibilities/ for more information.

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Nønne Prisle
Associate Editor, Environmental Sciences: Atmospheres

************

Reviewer comments

Reviewer 1

General comments:
This manuscript describes the development of a machine learning-based algorithm that predicts the glass transition temperatures (Tg) and melting points (Tm) of organic molecules using molecular embeddings as molecular descriptors that capture the molecular structure, functionality, and atomic interconnectivity of test compounds. The model output was evaluated against the experimental data, existing computational parameterizations, and the EPI Suite, showing very good agreement and improved regression performance. Overall, this study demonstrates a practical approach to advancing the understanding of the physicochemical properties of organic molecules, with potential applications to parameterize Tg and Tm of organic aerosol constituents in atmospheric aerosol models. The results presented seem very interesting and useful. However, one major comment for this study is regarding the selection of the training dataset. What exactly are the "drug-like" molecules mentioned throughout the manuscript? Are they organometallics or metal-ligand complexes? If they do not represent the atmospheric organic compounds, can these structurally dissimilar compounds be removed from the training dataset since the model is meant for the prediction of atmospheric organic compounds? What are the rationales to include these compounds in the training dataset? Additionally, I have a few specific comments listed below for the authors’ consideration.

Specific comments:
1. Line 71-73: I am not sure how directly Tm could be linked to toxicity. The current statement seems a bit far-fetched. I would suggest revising this statement as “Tm is an important physicochemical parameter controlling the fate and transport of organic pollutants in the environment.”
2. Line 78-79: The authors mentioned that “EPI Suite performs well for predicting the Tm of small molecules with a few functional groups, but it overestimates Tm of more structurally complex and aromatic compounds.” What was not considered in the EPI Suite that results in the overestimation of Tm for more structurally complex and aromatic compounds? Can the authors provide more information here?
3. Line 105-107: Can the authors clarify what “in-situ generation of molecular descriptors” mean here, and why this approach is significantly disadvantaged?
4. Line 128-129: Could the authors provide a few specific examples of the "nitrogen aromatic compounds with implicit hydrogen bonds" mentioned here? In the current description, it is unclear what these compounds are.
5. Line 137: What does the g value represent? Please provide the definition and explanation. How was the g value of 0.7 determined?
6. Line 199-201: Can the authors provide more details for how the "best” Tg regression model (tgBoost)" was determined? What are the considerations and criteria for the determination, and how the parameters listed in Line 200-201 were finally selected?
7. Line 220-224: Similarly, can the authors please provide more details for the determination of the "best DNN model" and how the parameters listed in Line 221-224 were finally selected?
8. Line 352-354: To support this statement, an explicit quantitative analysis based on the data presented in Figure 6 would be helpful (i.e., including quantitative information on the Tg increment for n >= 6 versus n 6 cases).
9. Page 19 Figure 8: The "shaded pink square" (Line 421) and the "shaded orange square" (Line 432) are not visible on the figure.

Reviewer 2

Summary:

This manuscript by Galeazzo et al. and Shiraiwa used two machine learning algorithms to predict the glass transition temperatures and melting temperatures of atmospheric relevant organic compounds. The results show that on eof the model, tgBoost, performed better than the other model with a mean absolute error (MAE) of 18.3 K. In addition, the study also shows the influence of different functional groups on the glass transition temperature of organic compounds, and these results agree with a previous study. Overall, the manuscript is clear and the results are informative. It is nice to see that the authors use machine learning to try to predict the glass transition temperature, an important parameter to describe aerosol phase state. However, there are a few typographic errors in the manuscript and the methods sections need improvements and clarification. I suggest the manuscript undergo a major revision to address these issues.

Major Comments:

I enjoyed reading this manuscript and it is nice to know that the authors brought new ideas and methods (machine learning) into the modeling of atmospheric chemistry by using machine learning algorithms. This aspect of the manuscript is definitely refreshing to read and I believe the authors did an overall good job with it. My major comments are the methods and data interpretation using the ML algorithm. I believe after addressing these comments the manuscript would be a really interesting and solid paper for many readers.

The first major comment is the machine learning method used here. The tgBoost method was published in 2004 and current machine learning research has moved on from tgBoost to other methods. I wonder whether the authors pick tgBoost over a few more modern ML methods?

Also, I am curious whether the amount of the training dataset is enough to be able to accurately predict the glass transition temperature results. As the authors have shown in Table 1, there are only 298 entries of species that were used for the machine learning algorithm. Is the number of entries enough, especially when compared with thousands of entries that machine learning algorithms usually need to generate accurate results? When it comes down to certain functional groups, such as tertiary alcohols, diols, or triols, there are only 1 or 2 data points to train the machine learning algorithm, which may not be enough to predict the glass transition temperature of these types of compounds. I do not think the author should make the predictions of the above type of compounds that have limited data (Figure 4-5). Instead, it might be better to state out the limitation of the ML algorithm in predicting compounds of certain types of functional groups to avoid future misuse by the readers.

The second comment is the representativity of the training dataset over the testing dataset. The training entries need to be representative of the final testing entries for ML results to be accurate. It looks like the entries used to predict the melting temperature of organic compounds (Tetko et al.) often consist of pharmaceutical compounds that have heavy elements, which are not representative of the atmospheric molecules that the authors are trying to predict. Maybe the author should screen the Tetko et al. entries and remove these non-representative species when training the ML algorithm.

Thirdly, when the author evaluated the effectiveness of the ML algorithm, the testing compounds were also in the training datasets (lines 263-266). My understanding is that the training datesets and the testing datasets should be different groups of data to avoid double counting the results and over-estimating the effectiveness of the model (Figure 5-7). The author should probably exclude some data from the training sets and use those data only in the testing datasets to test how accurate the ML algorithm is.

The last major comment is method comparison. The author concludes that the ML algorithm outperforms the elemental composition model that the group developed earlier. Li et al. (2019) already shows that the O:C ratio model and the volatility models perform better than the elemental ratio model based on average absolute value of the relative error (AAVRE). I wonder whether the author can compare the results from the ML results with the O:C model and the two volatility models to discuss the pros and cons of each model regarding AAVRE or MAE.

Minor Comments:

Line 53-55: The author should probably include that Rothfuss and Petters (19) also used functional groups to predict Tg and that Zhang et al. (12) also used saturation mass concentration to predict Tg.

Line 58: I suggest including a few citations to show another thermodynamic model (Octaviani et al 2019) and chemical transport models also include the viscosity of SOA (Shrivastava et al., 2017; Schmedding et al., 2019, 2020)

Figure 4: why was the prediction for the tertiary alcohol not shown?

Figure 6: in the caption, the letters referring to each panel were out of order.

Line 421 & 432: the pink and orange boxes were not included in Figure 8. Also, why would author not exclude these types of data with known large errors in the ML training dataset? I think by excluding these types of data, the ML results will improve.

References:

Shrivastava, M.; Lou, S.; Zelenyuk, A.; Easter, R. C.; Corley, R. A.; Thrall, B. D.; Rasch, P. J.; Fast, J. D.; Massey Simonich, S. L.; Shen, H.; Tao, S., Global long-range transport and lung cancer risk from polycyclic aromatic hydrocarbons shielded by coatings of organic aerosol. Proc. Natl. Acad. Sci. USA 2017, 114 (6), 1246-1251

Octaviani, M.; Shrivastava, M.; Zaveri, R. A.; Zelenyuk, A.; Zhang, Y.; Rasool, Q. Z.; Bell, D. M.; Riva, M.; Glasius, M.; Surratt, J. D., Modeling the Size Distribution and Chemical Composition of Secondary Organic Aerosols during the Reactive Uptake of Isoprene-Derived Epoxydiols under Low-Humidity Condition. ACS Earth and Space Chemistry 2021, 5 (11), 3247-3257.

Schmedding, R.; Ma, M.; Zhang, Y.; Farrell, S.; Pye, H. O. T.; Chen, Y.; Wang, C.-t.; Rasool, Q. Z.; Budisulistiorini, S. H.; Ault, A. P.; Surratt, J. D.; Vizuete, W., α-Pinene-Derived organic coatings on acidic sulfate aerosol impacts secondary organic aerosol formation from isoprene in a box model. Atmos. Environ. 2019, (213), 456-462.

Schmedding, R.; Rasool, Q. Z.; Zhang, Y.; Pye, H. O. T.; Zhang, H.; Chen, Y.; Surratt, J. D.; Lopez-Hilfiker, F. D.; Thornton, J. A.; Goldstein, A. H.; Vizuete, W., Predicting secondary organic aerosol phase state and viscosity and its effect on multiphase chemistry in a regional-scale air quality model. Atmos. Chem. Phys. 2020, 20 (13), 8201-8225.

Author response

This text has been copied from the PDF response to reviewers and does not include any figures, images or special characters.

REVIEWER REPORT(S):
Anonymous Referee #1
Comments to the Author
General comments:
This manuscript describes the development of a machine learning-based algorithm that predicts
the glass transition temperatures (Tg) and melting points (Tm) of organic molecules using
molecular embeddings as molecular descriptors that capture the molecular structure,
functionality, and atomic interconnectivity of test compounds. The model output was evaluated
against the experimental data, existing computational parameterizations, and the EPI Suite,
showing very good agreement and improved regression performance. Overall, this study
demonstrates a practical approach to advancing the understanding of the physicochemical
properties of organic molecules, with potential applications to parameterize Tg and Tm of
organic aerosol constituents in atmospheric aerosol models. The results presented seem very
interesting and useful. However, one major comment for this study is regarding the selection
of the training dataset. What exactly are the "drug-like" molecules mentioned throughout the
manuscript? Are they organometallics or metal-ligand complexes? If they do not represent the
atmospheric organic compounds, can these structurally dissimilar compounds be removed from
the training dataset since the model is meant for the prediction of atmospheric organic
compounds? What are the rationales to include these compounds in the training dataset?
Additionally, I have a few specific comments listed below for the authors' consideration.
We thank Referee #1 for the review and positive evaluation of our manuscript.
It is a fair point to raise questions about the selection and quality of the training dataset. Our Tg
model is developed using the largest available dataset of experimental measurements of Tg as
originally synthesized by Koop et al., (2011). To clarify, this dataset mainly consists of
monomeric organic molecules, but not drug-like molecules nor organometallics or metalligand complexes. While not all data points correspond specifically to commonly studied
atmospheric compounds, the scope of this work is to develop a Tg regression model accounting
for complex molecular structures. As a result, we think that the Koop’s dataset constitutes a
good base for the development of a Tg regression ML model of monomeric organic molecules
because of the variability and complexity in molecular structures of the chemical species in the
dataset. Concerning the development of the Tm model, we use the term “drug-like compounds”
to refer to organic molecules such as alkaloids, aromatic cyclic nitrogen bearing compounds,
steroids, and polycyclic molecules within the dataset. Some of these compounds such as
Sulfathiazole, Flufenamic acid, Spironolactone, Tolbutamide, etc. are recognized
pharmaceuticals used for their bioactive properties. We did not exclude these compounds, as it
is challenging to judge which molecules are not atmospherically relevant (for example,
polycyclic compounds are often found in atmospheric particles) and such arbitrary exclusion
may lead to bias in the model. With the broad spectrum of molecular structures covered in the
dataset, we think that the resulting model can have broader applications besides atmospheric
chemistry, and which span from toxicological to environmental studies. We now clarify what
the term “drug-like compounds” stands for in the revised manuscript.
“[…] Notably, the Tm-Tetko dataset has an abundance of drug-like complex compounds such
as alkaloids, aromatic cyclic nitrogen bearing compounds, steroids, and polycyclic molecules
as well as molecules with Br and Cl in their structures. […]”
Specific comments:
1. Line 71-73: I am not sure how directly Tm could be linked to toxicity. The current
statement seems a bit far-fetched. I would suggest revising this statement as "Tm is an
important physicochemical parameter controlling the fate and transport of organic pollutants
in the environment."
Thanks for pointing this out and to avoid confusion we simplified this sentence in the revised
manuscript:
“[…] it is a task of particular interest because of the correlation of the Tm with the vapor
pressure, boiling point, glass transition temperature, and water solubility (22–24). […]”
2. Line 78-79: The authors mentioned that "EPI Suite performs well for predicting the Tm
of small molecules with a few functional groups, but it overestimates Tm of more structurally
complex and aromatic compounds." What was not considered in the EPI Suite that results in
the overestimation of Tm for more structurally complex and aromatic compounds? Can the
authors provide more information here?
EPI Suite uses the MPBPWIN module to estimate the Tm of organic molecules. MPBPWIN
estimates Tm by giving a weighted average from the results of two prediction methods: (1) the
Joback Method (a group contribution method); and (2) the Gold and Ogle method where Tm =
0.5839 * Tb (i.e. Tb = boiling point) (QSPR method). Joback’s method is a group contribution
estimation method developed in 1984 which uses basic structural information of a chemical
molecule to infer its physical-chemical properties. In more details, a molecule is divided in a
list containing its functional groups segments, and the final Tm is inferred by summing the
single contribution of each functional group to Tm, as illustrated in the following equation:
�! = 122.5 + ∑ �!,#, where Tm,i corresponds to the inferred contribution to Tm of a specific
functional group i (i.e. methyl group, ketone, ester, alcohol, etc.).
The major drawback arises from the fact that the dataset used to develop the Jakob’s estimation
method was a relatively small dataset for such a bold and simple statistical parametrization (i.e.
388 data points for Tm). Secondly, Joback assumed that each functional group of a molecule
would have an equal contribution to Tm. As a result, the final Tm value would be the result of a
linear cumulative contribution for each functional group parameter. This assumption does not
describe correctly the real behavior of Tm. A more correct approach would have been a decrease
of contribution to Tm with increasing number of groups added to a single molecule. More
accurate group contribution methods suggest, indeed, that enthalpic and entropic corrections
should be included (Jain et al., 2006). As a result, Joback’s method leads to a good estimation
only for mid-sized and relatively low functionalized components and to high deviations for
small and large molecules (EPA, 2017). These major drawbacks were known to EPI Suite
developers of MPBPWIN who strongly advise to account for its usage purely for estimation
and informational assessments. We think this issue is too much detailed to be described in the
manuscript and interested readers can find this information in the cited reference.
3. Line 105-107: Can the authors clarify what "in-situ generation of molecular
descriptors" mean here, and why this approach is significantly disadvantaged?
Most recent QSAR studies regarding Tm prediction (Coley et al., 2018) have focused on the
generation of specific molecular descriptors via convolutional embedding methods (Duvenaud
et al., 2015; Kearnes et al., 2016). With these approaches both the embedding and the property
prediction steps are engrained in the Convolutional Neural Network (CNN) and in model
training. The embedding step constitutes the bulk of the CNN, and the prediction step is the
last fully connected layer of the CNN. The major caveat of this approach is that the dataset and
CNN architecture cannot be decoupled, and the embeddings are generated in-situ from the very
specific dataset. Another disadvantage is that training and optimization of a convolutional
neural network are very computationally demanding tasks. As a consequence, by using such
approaches the development of prediction models and pipelines suffers largely in flexibility
and infrastructure portability. On the other hand mol2vec comes with a pretrained model that
can be loaded independently, nested in a pipeline, and which generates molecular embeddings
independently and without any further pre-training. In addition, the embedder of mol2vec has
been trained using molecular segments of 19.9 million compounds from the ZINC and
ChEMBL datasets, therefore covering a very large spectrum of molecular structures.
Computationally wise, by using mol2vec we diminish consistently the embedding step and we
can focus on model optimization and training, and on model distribution.
We have revised the manuscript and inserted the following sentences to clarify:
“[…] The resulting models largely outperform the EPI Suite in predicting Tm, suggesting an
increasing ability of ML models and molecular descriptors in predicting Tm of pure compounds
and potentially their Tg. These studies have focused on the development of molecular
descriptors from molecular graphs via convolutional embedding methods (Duvenaud et al.,
2015; Kearnes et al., 2016). The developed embeddings (i.e., convolutional embeddings)
reach extremely high prediction accuracies, but they can result in significant drawbacks with
regards to model deployment and portability. In these approaches both the embedding and
the property prediction steps are engrained in a Convolutional Neural Network (CNN). The
major caveat of using the former approach is that the dataset and CNN architecture cannot
be decoupled, and the embeddings are generated in-situ from the very specific dataset.
Therefore, the resulting convolutional embeddings are dataset specific and cannot be loaded
and used for other tasks. Moreover, in such approaches the development of the target QSAR
model requires the optimization and training of the CNN, which are very computationally
demanding tasks. As a consequence, these models lack portability, transferability and
scalability due to the in-situ generation of molecular descriptors dependent on the dataset of
origin and the absence of a finalized trained model which could be transferred to other
datasets. […]”
4. Line 128-129: Could the authors provide a few specific examples of the "nitrogen
aromatic compounds with implicit hydrogen bonds" mentioned here? In the current description,
it is unclear what these compounds are.
In cheminformatics we use SMILES strings and specific chemistry libraries (i.e. Python
toolkits such as RDKit) to generate molecular graphs representing molecules and that can be
read by machines. Kekulization is the process of assigning double bonds to a molecular graph.
SMILES can exist in kekulized and unkekulized forms, as reported in the following table:
SMILES Unkekulized Kekulized
Benzomidazole n2c1ccccc1nc2 C1=CC=C2C(=C1)NC=N2
Pyridine c1ccncc1 C1=CC=NC=C1
Pyrrole c1ccnc1 C1=CNC=C1
In SMILES notation the H atoms are implicit and in the unkekulized notation double bonds are
not explicitly written/assigned. Normally, cheminformatics toolkit are able to reconstruct
molecular graphs from SMILES notations. However, for certain carbon and nitrogen aromatic
compounds they cannot recognize the allocation of hydrogens or aromatic bonds, due to the
ambiguous notation which can give form to different interpretations and therefore molecular
structures. This is the case for the unkekulized notation of pyridine which is not recognized by
toolkits. The algorithms in cheminformatics toolkits fail to recognize these structures and
usually try to add hydrogens to satisfy aromaticity, therefore incurring in valence errors caused
by the ambiguous structure. The former is also the scenario for the unkekulized form of pyrrole
which the machine either reads as pyrrole (c1cc[nH]c1) or pyrrole radical (c1c[N]cc1).
The discussion on algorithms used by toolkits to interpret SMILES and on errors due to
ambiguous unkekulized aromatic species is beyond the scope of this study. To avoid confusion
for the reader we have changed the paragraph and eliminated the sentence:
“[…] Data cleaning is composed by three different steps: filtering of molecules that cannot be
recognized by RDKit (i.e., nitrogen aromatic compounds with implicit hydrogen bonds),
conversion of SMILES strings to their canonical form, and averaging over the target property
for compounds that have multiple entries with different Tg or Tm values. […]”
5. Line 137: What does the g value represent? Please provide the definition and
explanation. How was the g value of 0.7 determined?
The Tm and Tg of organic compounds are linked by a peculiar Structure Activity Relationship
(SAR) referred as the Boyer-Kauzmann rule. By analysis of experimental data, Koop et al.
(2011) found empirically that for organic molecules g = 0.7. We clarify this point in the revised
manuscript.
“[…] Tg can be estimated from Tm using the structure-activity relationship known as the BoyerKauzmann rule: Tg = g • Tm with g as a constant to be 0.7 based on analysis by Koop et al.
(7). […]”
6. Line 199-201: Can the authors provide more details for how the "best" Tg regression
model (tgBoost)" was determined? What are the considerations and criteria for the
determination, and how the parameters listed in Line 200-201 were finally selected?
The hyperparameters were selected through the grid search approach from scikit-learn. We
selected a range of plausible values for each different hyper-parameter (e.g. number_of_trees
= [10, 20, 50, 100, 250, 500, 1000]), and we have trained as many models as the possible
combinations of available hyperparameters (i.e. 54 000 models). Finally, we selected the best
regression model whose combination of hyperparameters provided the lowest error in the
nested cross-validation step. We have added a few sentences to the text to clarify how we
selected the best model:
“[…] This approach enables to reach the best trade-off between bias and variance by selecting
the best model parameters, while obtaining a true error estimation by accounting for a vast
number of possible data combinations and cross-validation. The RF regressor model is
implemented in Python using scikit-learn a library for scientific computing and machine
learning (44). The XGBoost regressor model is implemented via xgboost, a Python library
for optimized distributed gradient boosting model development (45). The hyperparameters of
our regressor models were selected through a grid search approach: we selected a range of
plausible values for each hyper-parameter (e.g., estimators, maximum depth of trees,
learning rate, etc.) and we have trained as many models as the possible combinations of
available hyperparameters. Finally, we selected the best Tg regression model whose
combination of hyperparameters provided the lowest error in the nested cross-validation
step. The best Tg regression model (tgBoost) developed via a XGBoost regressor and through
the nested cross-validation is composed by: 100 estimators, a maximum depth of 9 trees, a
learning rate of 0.1, and a g equal to 30. […]”
7. Line 220-224: Similarly, can the authors please provide more details for the
determination of the "best DNN model" and how the parameters listed in Line 221-224 were
finally selected?
Similar process as reported in Q6, but optimized for the Neural Network architecture (e.g.
number of hidden layers, number of activation nodes, optimizer, learning rate, activation
function, number of nodes in hidden layers, dropout, training batches etc.). We have added a
few sentences to the text to clarify how we selected the best DNN architecture:
“[…] The training sets have been further divided into 10 folds and model parameters have been
selected (i.e., hyperparameters tuning) using iteratively 9 folds for training and one fold for
validation. Similar to the process used to select the best regressor model for Tg, we have used
a grid search approach to select the best hyperparameter for Tm regression. We selected a
range of plausible values for each hyper-parameter (e.g., number of nodes in the activation
layer, number of hidden layers, optimizer, number of nodes in hidden layers, activation
functions, dropout, learning rate, etc.) and we have trained as many models as the possible
combinations of available hyperparameters. The parameters for the best Tm model are the
ones reporting the lowest average error on the 10 K-fold cross-validation step. […]”
8. Line 352-354: To support this statement, an explicit quantitative analysis based on the
data presented in Figure 6 would be helpful (i.e., including quantitative information on the Tg
increment for n >= 6 versus n 6 cases).
A figure comparing tgBoost predictions between ethers and alkanes has been added to SI. The
text has been changed to refer to the additional information provided in SI:
“[…] Our results suggest that for weakly functionalized compounds the addition of an ether
functional group to the alkyl chain can strongly increase the Tg of a molecule, particularly for
smaller compounds with n < 6 (see Fig. S4 in SI). […]”
Figure S4: Estimated Tg of alkanes and ethers as a function of the number of carbon atoms within the
molecule. The functional groups in ethers are positioned at the end of the alkyl chain.
9. Page 19 Figure 8: The "shaded pink square" (Line 421) and the "shaded orange square"
(Line 432) are not visible on the figure.
The figures have been corrected to include the shaded orange and pink squares.
Anonymous Referee #2
Comments to the Author
Summary:
This manuscript by Galeazzo et al. and Shiraiwa used two machine learning algorithms to
predict the glass transition temperatures and melting temperatures of atmospheric relevant
organic compounds. The results show that one of the model, tgBoost, performed better than
the other model with a mean absolute error (MAE) of 18.3 K. In addition, the study also shows
the influence of different functional groups on the glass transition temperature of organic
compounds, and these results agree with a previous study. Overall, the manuscript is clear and
the results are informative. It is nice to see that the authors use machine learning to try to predict
the glass transition temperature, an important parameter to describe aerosol phase state.
However, there are a few typographic errors in the manuscript and the methods sections need
improvements and clarification. I suggest the manuscript undergo a major revision to address
these issues.
Major Comments:
I enjoyed reading this manuscript and it is nice to know that the authors brought new ideas and
methods (machine learning) into the modeling of atmospheric chemistry by using machine
learning algorithms. This aspect of the manuscript is definitely refreshing to read and I believe
the authors did an overall good job with it. My major comments are the methods and data
interpretation using the ML algorithm. I believe after addressing these comments the
manuscript would be a really interesting and solid paper for many readers.
2 4 6 8 10 12
1umber of carbon atoms
40
60
80
100
120
140
160
180
200
tgBoost estimated Tg (K)
alNanes
etKers
We thank Referee #2 for the review and positive evaluation of our manuscript.
A1. The first major comment is the machine learning method used here. The tgBoost
method was published in 2004 and current machine learning research has moved on from
tgBoost to other methods. I wonder whether the authors pick tgBoost over a few more modern
ML methods?
The Extreme Gradient Boosting algorithm (XGBoost) is a gradient bosting implementation
developed by Tianqi Chen and Carlos Guestrin in 2016. It is a relatively recent algorithm and
important improvement over previous Gradient Boosting Method (GBM) algorithms. It is
designed to be both computationally efficient (e.g. fast to execute), and highly accurate and
powerful. Since its introduction in 2016 the XGboost algorithm has been gaining large traction
in the ML community due to its effectiveness in developing robust classification and regression
models. GBM are commonly used in QSAR/QSPR studies due to the relatively low number of
hyperparameters to be tuned during optimization. Moreover, GBM algorithms are less prone
to overfitting with small datasets as opposed to neural networks. Therefore, XGBoost was the
chosen model because of its reliability, accuracy and computational efficiency while
developing a nested cross-validation model.
We have implemented the following changes in the text to highlight why we focused our
attention on Gradient Boosting Method algorithms, notably on XGboost:
“[…] We apply two different model architectures, Random Forest (RF) and Extreme Gradient
Boosting Method (XGBoost), to develop Tg regressor models based on the Tg- measured
dataset. We focused our investigations on Gradient Boosting Method (GBM) algorithms due
to the relatively easy training process, and their high rate of success in both regression and
classification tasks in QSAR/QSPR studies (33,34). Notably, XGBoost is a recent gradient
bosting implementation developed by Chen and Guestrin (2016) (45) with important
improvements over previous GBM algorithms. It is designed to be both computationally
efficient (e.g. fast to execute), highly accurate and powerful. Since its introduction in 2016
the XGBoost algorithm has been gaining large traction in the ML community due to its
effectiveness in developing robust classification and regression models. The performance
metrics employed to evaluate the regression tasks include mean absolute error (MAE), mean
squared error (MSE), and coefficient of determination (RCV2
). […]”
A2. Also, I am curious whether the amount of the training dataset is enough to be able to
accurately predict the glass transition temperature results. As the authors have shown in Table
1, there are only 298 entries of species that were used for the machine learning algorithm. Is
the number of entries enough, especially when compared with thousands of entries that
machine learning algorithms usually need to generate accurate results? When it comes down
to certain functional groups, such as tertiary alcohols, diols, or triols, there are only 1 or 2 data
points to train the machine learning algorithm, which may not be enough to predict the glass
transition temperature of these types of compounds. I do not think the author should make the
predictions of the above type of compounds that have limited data (Figure 4-5). Instead, it
might be better to state out the limitation of the ML algorithm in predicting compounds of
certain types of functional groups to avoid future misuse by the readers.
We agree that the low number of data points as constrained by available dataset is not optimal.
Machine learning algorithms are powerful tools for statistical inference and the rule of thumb
is that the more the data the better the quality of prediction and model. This is particularly true
for more complex models and architectures, such as deep neural networks. However, during
machine learning model development true success is achieved when the model optimization
leads to the right balance between bias and variance. This equilibrium is obtained when a model
can estimate accurate predictions, while concurrently it can minimize the error on unseen data.
There are various techniques that can be used to reach this objective, and cross-validation
techniques are among the most accurate. Notably, the nested-cross validation is among the best
and most complex cross-validation techniques in circulation, and it is suitable for datasets
composed by <300 data points. tgBoost is a very strong model because it can reproduce
experimental data, but mostly because the 18.3 K error has been estimated by thorough and
iterative splits and recombination of data in train/test/validation datasets during its
development.
We agree that the model would benefit greatly from more datapoints during its development,
and we encourage the community to address this problem by collecting more experimental
measurements of Tg of atmospherically-relevant organic molecules. The model is wrapped in
a Python package to be released to the public (via Github), and its pipeline enables quick
updates by allowing retraining on new additional data. Therefore, we recognize the limitations
posed to the low number of experimental measurements, but we believe tgBoost has accurate
predictive power, error estimation and potential for improvements in the future.
We believe that it is useful to include Tg predictions for tertiary alcohols, dyols and tryols in
Fig. 4-5 to show how tgBoost compares to our compositional parametrization. We do not claim
precise predictions for these compounds, but we think it is important to stress how tgBoost can
resolve different predictions for compositional isomers, which is one of the major
developments reached by this ML model. Both Fig. 4 and Fig. 5 act as proof-of-concept
demonstrations of the capabilities of the model to predict different Tg due to structural
differences. In Fig. 4 the main observation is that besides the same chemical composition
tgBoost can predict different values for simple mono alcohols, as opposed to the compositional
parametrization. In Fig. 5 we stress that tgBoost can predict larger Tg with the addition of -OH
groups to the alkane chain, similarly to the compositional parametrization and previous studies
(Rothfuss and Petters, 2017). Besides, we agree that more data are needed to gain a better
insight on tgBoost performance in predicting Tg of tryols and dyols, or isomeric alcohols and
we stress that our predictions come with a MAE of 18.3 K. To address your comment, we have
included a new paragraph in the text to discuss model limitations. The text has been changed
and a paragraph addressing the limitations of the model has been added at Line 341:
“[…] These results should motivate future studies to adopt molecular embeddings and
machine learning algorithms to develop predictive models of organic molecules. Note that
there are limitations to be accounted when using models like tgBoost. As reported in Fig.4-
5 the model can discern among compositional isomers of simple mono-alcohols (i.e. primary,
secondary and tertiary alcohols) and it can simulate the increase in Tg by each -OH addition
observed by the compositional parametrization and the few available experimental data
points of dyols and tryols. However, tgBoost should be used with precautions when working
with these compound classes: as there are very limited amount of observational data for
tertiary alcohols, dyols and tryols, tgBoost might neglect possible trends in Tg for those
molecular classes. Therefore, when possible, it would be good to compare tgBoost
predictions with other Tg QSAR models developed on different datasets and physical
properties. These limitations underline the importance of collecting more experimental data
on Tg of atmospherically-relevant organic molecules.
We have conducted further proof-of-concept investigations on the performance of tgBoost
in distinguishing other compositional isomers and on the effects on Tg due to the addition of
carboxylic groups to an alkyl chain. Fig.6 […]”
B. The second comment is the representativity of the training dataset over the testing
dataset. The training entries need to be representative of the final testing entries for ML results
to be accurate. It looks like the entries used to predict the melting temperature of organic
compounds (Tetko et al.) often consist of pharmaceutical compounds that have heavy elements,
which are not representative of the atmospheric molecules that the authors are trying to predict.
Maybe the author should screen the Tetko et al. entries and remove these non-representative
species when training the ML algorithm.
We developed two regression models for Tm prediction trained on the Tetko and Bradley
datasets (see SI for more information on the former) and by using Deep Neural Network
architectures. We have performed initial screening to delete most of the heavier compounds
from the dataset (MW > 600) and to include only compounds with H, C, O, N, S, Cl, and Br
atoms. As a result, both datasets do not include organo-metallics, while they consider organics
which could have some environmental and toxicological activity. We clarify this info in the
revised manuscript:
“[…] Data cleaning is composed by three different steps: filtering of molecules that cannot be
recognized by RDKit, conversion of SMILES strings to their canonical form, and averaging
over the target property for compounds that have multiple entries with different Tg or Tm values.
We have performed initial screening to delete most of the heavier compounds from the
datasets (MW > 600) and to include only compounds with H, C, O, N, S, Cl, and Br atoms.
[…]”
C. Thirdly, when the author evaluated the effectiveness of the ML algorithm, the testing
compounds were also in the training datasets (lines 263-266). My understanding is that the
training datesets and the testing datasets should be different groups of data to avoid double
counting the results and over-estimating the effectiveness of the model (Figure 5-7). The author
should probably exclude some data from the training sets and use those data only in the testing
datasets to test how accurate the ML algorithm is.
We have addressed the limitations of tgBoost in predicting Tg values of dyols, tryols and
isomeric alcohols by adding a paragraph at line 341 (original text). We have also addressed
why we are using all the available data to train our final model and how the error has been
estimated rigorously via a nested cross-validation (please see our detailed response for A2).
D. The last major comment is method comparison. The author concludes that the ML
algorithm outperforms the elemental composition model that the group developed earlier. Li et
al. (2019) already shows that the O:C ratio model and the volatility models perform better than
the elemental ratio model based on average absolute value of the relative error (AAVRE). I
wonder whether the author can compare the results from the ML results with the O:C model
and the two volatility models to discuss the pros and cons of each model regarding AAVRE or
MAE.
We have added a new section in the Supporting Information to compare the performances of
the model by Li et al. (2020) and tgBoost:
1.2 Comparison of tgBoost predictions to Tg estimations from O:C and C0
Li et al. (3) developed Tg parameterizations based on molecular O:C ratio and C0
. They
compared their Tg predictions with Tg estimated through the Boyer-Kauzman rule on Tm
from EPI Suite for compounds in the dataset from Shiraiwa et al. (4). Their estimated
absolute mean percentage relative error (MPE) (i.e., defined as AAVRE in their study) is 6%
with R = 0.96. We have compared the Tg predicted by the tgBoost model with Tg estimated
through the Boyer-Kauzman rule (2) on Tm from EPI Suite for compounds in the dataset
from Shiraiwa et al. (4). Our MPE is 10.6% with R = 0.8. Note that, the C0 used by Li et al.
(3) was evaluated using the EVAPORATION model by Compernolle et al. (5), and that the
Tg values estimated from Tm evaluated by EPI Suite were based on the MPBPWIN model.
Both MPBPWIN and EVAPORATION are QSAR models developed on a mix of
experimental measurements and model predictions, and they use chemical species boiling
points to build their QSAR. EVAPORATION and MPBPWIN use a combination of different
estimation methods, but they are both using derivations of the Antoine equation. As a result,
both MPBPWIN and EVAPORATION have predictions strongly correlated to boiling point
values. These approaches might introduce a correlation bias based on the similar estimation
methods and linked to the same variable implicitly used for the prediction. As a result, even
if C0 is suited to predict Tg, there might a correlation bias to account when comparing the
estimated MPE of the two methods in relation to the Tg estimated from Tm evaluated by EPI
Suite.
Minor Comments:
Line 53-55: The author should probably include that Rothfuss and Petters (19) also used
functional groups to predict Tg and that Zhang et al. (12) also used saturation mass
concentration to predict Tg.
We have changed the sentence with addition of the suggested references:
“[…] To fill such measurement gaps, various Tg parametrizations have been developed based
on molecular properties including molar mass and atomic oxygen to carbon ratio (13),
saturation mass concentration (14, 15), and elemental composition (e.g., number of C, H, N,
O, S) (8,14). Moreover, Rothfuss and Petters (19) have introduced an empirical group
contribution estimation based on functional groups presence within a molecule. Their
results suggest functionalization is a crucial predictive parameter for molecular Tg. […]”
Line 58: I suggest including a few citations to show another thermodynamic model (Octaviani
et al 2019) and chemical transport models also include the viscosity of SOA (Shrivastava et
al., 2017; Schmedding et al., 2019, 2020)
Following your comment, we have changed the sentence to add the suggested references to:
“[…] These parameterizations are simple, practical, and versatile prediction methods, which
have been applied to estimate SOA viscosity of SOA for high-resolution mass spectrometry
(8,16-18) and also implemented into thermodynamic (19,20), gas-phase chemistry models (9)
and chemical transport models (5,21,22). […]”
20. Octaviani, M.; Shrivastava, M.; Zaveri, R. A.; Zelenyuk, A.; Zhang, Y.; Rasool,
Q. Z.; Bell, D. M.; Riva, M.; Glasius, M.; Surratt, J. D., Modeling the Size Distribution
and Chemical Composition of Secondary Organic Aerosols during the Reactive Uptake of
Isoprene-Derived Epoxydiols under Low-Humidity Condition. ACS Earth and Space
Chemistry 2021, 5 (11), 3247-3257.
21. Schmedding, R.; Ma, M.; Zhang, Y.; Farrell, S.; Pye, H. O. T.; Chen, Y.; Wang,
C.-t.; Rasool, Q. Z.; Budisulistiorini, S. H.; Ault, A. P.; Surratt, J. D.; Vizuete, W., αPinene-Derived organic coatings on acidic sulfate aerosol impacts secondary organic
aerosol formation from isoprene in a box model. Atmos. Environ. 2019, (213), 456-462.
22. Schmedding, R.; Rasool, Q. Z.; Zhang, Y.; Pye, H. O. T.; Zhang, H.; Chen,
Y.; Surratt, J. D.; Lopez-Hilfiker, F. D.; Thornton, J. A.; Goldstein, A. H.; Vizuete, W.,
Predicting secondary organic aerosol phase state and viscosity and its effect on multiphase
chemistry in a regional-scale air quality model. Atmos. Chem. Phys. 2020, 20 (13), 8201-
8225.
Figure 4: why was the prediction for the tertiary alcohol not shown?
The prediction for tertiary alcohols made by tgBoost is already included in Fig.4 as a blue line.
Figure 6: in the caption, the letters referring to each panel were out of order.
Letters and the caption have been corrected to:
“Figure 6: Estimated Tg of weakly functionalized isomeric molecules by functional group as a
function of the number of carbon atom within the molecule. Isomers are grouped as a) esters
and ketones, b) alcohols and aldehydes, and c) ethers and carboxylic acids. The respective
functional groups are positioned at the end of the alkyl chain for all species.”
Line 421 & 432: the pink and orange boxes were not included in Figure 8. Also, why would
author not exclude these types of data with known large errors in the ML training dataset? I
think by excluding these types of data, the ML results will improve.
The figures have been corrected to include the shaded orange and pink squares.
Fig. 8 shows the correlation plots between a) Tg predicted by tgBoost and Tg estimated from
the melting points predicted by EPI Suite; b) Tg predicted by the compositional parametrization
and Tg estimated from the melting points predicted by EPI Suite, for molecules from Shiraiwa
et al. (2014) dataset. Note that, in order to develop the tgBoost model we have only used values
from Koop’s dataset. As a result, the large deviations indicate only the large divergencies
between EPI Suite and tgBoost. As highlighted in response 2 for Reviewer#1 EPI Suite tends
to overestimate the values for highly functionalized compounds and for small and large
molecules. This is reflected in the “banana shaped” distribution of data along the correlation
line. Our results confirm EPI Suite limitations and suggest that tgBoost might be more accurate
in predicting Tg of compounds of large and small compounds, and of molecules with a high
number of functional groups.
References:
Shrivastava, M.; Lou, S.; Zelenyuk, A.; Easter, R. C.; Corley, R. A.; Thrall, B. D.; Rasch,
P. J.; Fast, J. D.; Massey Simonich, S. L.; Shen, H.; Tao, S., Global long-range transport and
lung cancer risk from polycyclic aromatic hydrocarbons shielded by coatings of organic
aerosol. Proc. Natl. Acad. Sci. USA 2017, 114 (6), 1246-1251.
Octaviani, M.; Shrivastava, M.; Zaveri, R. A.; Zelenyuk, A.; Zhang, Y.; Rasool, Q. Z.; Bell,
D. M.; Riva, M.; Glasius, M.; Surratt, J. D., Modeling the Size Distribution and Chemical
Composition of Secondary Organic Aerosols during the Reactive Uptake of Isoprene-Derived
Epoxydiols under Low-Humidity Condition. ACS Earth and Space Chemistry 2021, 5 (11),
3247-3257.
Schmedding, R.; Ma, M.; Zhang, Y.; Farrell, S.; Pye, H. O. T.; Chen, Y.; Wang, C.-
t.; Rasool, Q. Z.; Budisulistiorini, S. H.; Ault, A. P.; Surratt, J. D.; Vizuete, W., α-PineneDerived organic coatings on acidic sulfate aerosol impacts secondary organic aerosol formation
from isoprene in a box model. Atmos. Environ. 2019, (213), 456-462.
Schmedding, R.; Rasool, Q. Z.; Zhang, Y.; Pye, H. O. T.; Zhang, H.; Chen, Y.; Surratt, J.
D.; Lopez-Hilfiker, F. D.; Thornton, J. A.; Goldstein, A. H.; Vizuete, W., Predicting
secondary organic aerosol phase state and viscosity and its effect on multiphase chemistry in a
regional-scale air quality model. Atmos. Chem. Phys. 2020, 20 (13), 8201-8225.
Additional References
Coley CW, Barzilay R, Green WH, Jaakkola TS, Jensen KF. Convolutional Embedding of
Attributed Molecular Graphs for Physical Property Prediction. J Chem Inf Model.
2017;57(8):1757–72.
Compernolle, S., Ceulemans, K., and Müller, J.-F.: EVAPORATION: a new vapour pressure
estimation methodfor organic molecules including non-additivity and intramolecular
interactions, Atmos. Chem. Phys., 11, 9431–9450, https://doi.org/10.5194/acp-11-9431-2011,
2011.
Duvenaud, D. K., Maclaurin, D., Aguilera-Iparraguirre, J., Gomez-Bombarelli, R., Hirzel, T.,
Aspuru-Guzik, A., Adams, R. P. Convolutional Networks on Graphs for Learning Molecular
Fingerprints. NIPS 2015, 2215−2223.
EPA U. Estimation Programs Interface SuiteTM for Microsoft Windows v4.1.1. 636
Washington, DC, USA: United States Environmental Protection Agency; 2017.
Jain, A., Yalkowsky, S. H. Estimation of melting points of organic compounds‐II. J. Pharm.
Sci. 2006, 95, 2562–2618
Kearnes, S., McCloskey, K., Berndl, M., Pande, V., Riley, P. Molecular Graph Convolutions:
Moving Beyond Fingerprints. arXiv preprint arXiv:1603.00856, 2016.

Round 2

Revised manuscript submitted on 05 ማርች 2022

Editor’s decision letter

22-Mar-2022

Dear Dr Shiraiwa:

Manuscript ID: EA-ART-10-2021-000090.R1
TITLE: Predicting Glass Transition Temperature and Melting Point of Organic Compounds via Machine Learning and Molecular Embeddings

Thank you for your submission to Environmental Science: Atmospheres, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of your manuscript and the reviewers’ reports, I will be pleased to accept your manuscript for publication after revisions.

Please revise your manuscript to fully address the reviewers’ comments. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Please submit your revised manuscript as soon as possible using this link :

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/esatmos?link_removed

(This link goes straight to your account, without the need to log in to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/esatmos) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

Environmental Science: Atmospheres strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy from CASRAI, https://casrai.org/credit/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines http://www.rsc.org/journals-books-databases/journal-authors-reviewers/author-responsibilities/ for more information.

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Nønne Prisle
Associate Editor, Environmental Sciences: Atmospheres

************

Reviewer comments

Reviewer 2

The author made significant improvements on revised version of the manuscript and addressed most of my comments. There is one minor comment the author did not address regarding Figure 4. After addressing this one the manuscript should be fully ready to be published.

In Figure 4, the author showed the prediction of tertiary alcohols. The smallest tertiary alcohol should be tert-butyl alcohol, which has 4 carbon atoms. But it was not shown in Figure 4. Is there any reason why the Tg prediction of the tert-butyl alcohol was not shown?

In addition, there is one experimental data of tertiary alcohol by Koop et al., shown as having 2 carbon atoms in Figure 4. But the smallest tertiary alcohol should have at least 4 carbon atoms. Please correct this error.

I also do not understand why the prediction of primary alcohol starts at 3 carbon atoms, instead of 1 carbon atom in Figure 4. Can the author either include the alcohols with less than 3 carbon atoms, or explain why? Thank you.

Author response

This text has been copied from the PDF response to reviewers and does not include any figures, images or special characters.

REVIEWER REPORT(S):
Referee: 2

Comments to the Author
The author made significant improvements on revised version of the manuscript and addressed most of my comments. There is one minor comment the author did not address regarding Figure 4. After addressing this one the manuscript should be fully ready to be published.

In Figure 4, the author showed the prediction of tertiary alcohols. The smallest tertiary alcohol should be tert-butyl alcohol, which has 4 carbon atoms. But it was not shown in Figure 4. Is there any reason why the Tg prediction of the tert-butyl alcohol was not shown?

In addition, there is one experimental data of tertiary alcohol by Koop et al., shown as having 2 carbon atoms in Figure 4. But the smallest tertiary alcohol should have at least 4 carbon atoms. Please correct this error.

I also do not understand why the prediction of primary alcohol starts at 3 carbon atoms, instead of 1 carbon atom in Figure 4. Can the author either include the alcohols with less than 3 carbon atoms, or explain why? Thank you.
************

We thank Referee #2 for the review and positive evaluation of our reviewed manuscript and for pointing our error in Figure 4.

In Figure 4 we compare the Tg predictions of the developed tgBoost model and the compositional parametrization to experimental Tg values of molecules in Koop’s dataset. In order to maintain a level of structural consistency between primary, secondary and tertiary alcohols, we show only the predictions of mono-alcohols whose carbon skeleton is a linear alkane chain: primary alcohols have a terminal -OH group, and secondary and tertiary alcohols have the -OH group placed on the second and third carbon atoms of the alkyl chain, respectively. As a result, the Tg prediction for tert-butyl alcohol, the smallest tertiary alcohol from Koop’s dataset, is not shown in Figure 4. Following the latter assumption, the smallest tertiary alcohol would be 3-pentanol (i.e. CCC(O)CC), whose carbon chain is composed by 5 carbon atoms. We would like to stress that Koop’s dataset includes more triol and diol species than the ones addressed in Figure 4, and that these species were included during model training but are not shown in our qualitative analysis since they represent more complex branching carbon species. The same approach has been taken for other species shown in Figure 6 and 7 whose functional groups are always placed at the end of the corresponding linear alkyl chain.

Following this comment, we took this opportunity to carefully check all our original data and figures, and we have updated Figure 4, 5 and 6 and the respective captions. In Figure 4 we have included the correct values of linear tertiary alcohols, and added two data points to primary alcohols corresponding to methanol and ethanol. We have also updated Figure 5 to include three additional datapoints for triols, one additional data point for diols, and two additional data points for methanol and ethanol for mono-alcohols. In Figure 6 we have corrected the captions within the different panels, and regrouped isomeric species by elemental compositions. We have also updated the main text at lines 398-404 to correct for our initial error where we mislabeled ethers for esters and vice versa:

“[…] It illustrates the Tg predictions as a function of the number of carbon atoms of the species in compositional isomers grouped as a) ethers and alcohols, b) ketones and aldehydes, and c) esters and carboxylic acids, with the respective functional groups positioned at the end of the alkyl chain. Our results suggest the following trend for sensitivity of Tg to functional group addition: -COOH (carboxylic acid) > -C(=O)OR (ester) ≈ -OH (alcohol) > -C(=O) (carbonyl) ≈ -COR (ether) where the carbonyl group category comprises -C(=O)R (ketone) and -C(=O)H (aldehyde). […]”

And we have corrected the text at lines 410-416:

“[…] This effect may be due to conformational effects resulting from the addition of an alkoxycarbonyl group, which would induce lower flexibility in the aliphatic component and a lower degree of transformation between trans- and cis- stereoisomers in the carbon chain. It is expected that ketones and aldehydes have a relatively lower Tg compared to alcohols, and carboxylic acids as there are no functional groups that may be involved directly in hydrogen bonds in the bulk phase of pure materials. […]”

Accordingly, we have changed the text at lines 30-31:

“[…] -COOH (carboxylic acid) > -C(=O)OR (ester) ≈ -OH (alcohol) > -C(=O)R (ketone) ≈ -COR (ether) ≈ -C(=O)H (aldehyde). […]”

And at lines 577-578:

“[…]-COOH (carboxylic acid) > -C(=O)OR (ester) ≈ -OH (alcohol) > -C(=O) (carbonyl) ≈ -COR (ether). […]”

Finally, we have corrected Fig. S4 and the corresponding caption in the Supporting Information to correct for the same mislabeling error.

Round 3

Revised manuscript submitted on 25 ማርች 2022

Editor’s decision letter

02-Apr-2022

Dear Dr Shiraiwa:

Manuscript ID: EA-ART-10-2021-000090.R2
TITLE: Predicting Glass Transition Temperature and Melting Point of Organic Compounds via Machine Learning and Molecular Embeddings

Thank you for submitting your revised manuscript to Environmental Science: Atmospheres. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Environmental Science: Atmospheres. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

We will publicise your paper on our Twitter account @EnvSciRSC – to aid our publicity of your work please fill out this form: https://form.jotform.com/211263048265047

How was your experience with us? Let us know your feedback by completing our short 5 minute survey: https://www.smartsurvey.co.uk/s/RSC-author-satisfaction-energyenvironment/

Thank you for publishing with Environmental Science: Atmospheres, a journal published by the Royal Society of Chemistry – connecting the world of science to advance chemical knowledge for a better future.

With best wishes,

Dr Nønne Prisle
Associate Editor, Environmental Sciences: Atmospheres

Reviewer comments

Reviewer 2

The author addressed all the issues the reviewer raised and the manuscript is suitable to be published.

Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.

From the journal Environmental Science: Atmospheres Peer review history

Predicting glass transition temperature and melting point of organic compounds via machine learning and molecular embeddings

Round 1

Reviewer 1

Reviewer 2

Round 2

Reviewer 2

Round 3

Reviewer 2

Transparent peer review