From the journal Digital Discovery Peer review history

Repurposing quantum chemical descriptor datasets for on-the-fly generation of informative reaction representations: application to hydrogen atom transfer reactions

Round 1

Manuscript submitted on 13 Feb 2024
 

12-Mar-2024

Dear Dr Stuyver:

Manuscript ID: DD-ART-02-2024-000043
TITLE: Repurposing Quantum Chemical Descriptor Datasets for on-the-Fly Generation of Informative Reaction Representations: Application to Hydrogen Atom Transfer Reactions

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of your manuscript and the reviewers’ reports, I will be pleased to accept your manuscript for publication after revisions.

Please revise your manuscript to fully address the reviewers’ comments. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link :

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log in to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process.   We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Joshua Schrier
Associate Editor, Digital Discovery

************


 
Reviewer 1

The article "Repurposing Quantum Chemical Descriptor Datasets for on-the-Fly Generation of Informative Reaction Representations: Application to Hydrogen Atom Transfer Reactions" describes the use of graph neural networks to train descriptor prediction models for bond dissociation free energies and a few other simulated both electronic and steric descriptors based on large existing molecular databases that can then be used to generate features for downstream prediction tasks of activation barriers for chemical reactions. The article is well-written and demonstrates the application of these surrogate descriptor prediction models for supervised learning based on a diverse set of datasets. This approach is shown to decrease the data needed to train accurate supervised learning models generally and lead to predictive models. Overall, I think the manuscripts fits the readership of Digital Discovery very well and demonstrates the importance of making use of large available (quantum chemical) descriptor databases. Nevertheless, I do think there are a few aspects that should be addressed before publication. I recommend minor revisions. Please find my detailed comments below.

Major aspects:
page 1, left column: The authors state the following: "For tasks for which hundreds of thousands to millions of data points are available, e.g., forward and retrosynthesis prediction, accurate and robust machine learning (ML) models have been designed." I think this sentence requires to be backed up by literature. I think this sentence makes it sound like these problems are solved, but the opposite is true in my opinion. I do not think that the models that have been built are either accurate or robust. I think this is in part because the "popular USPTO benchmark dataset" is of relatively low quality, and while there have been attempts to curate it, it is simply a very hard task.

From a fundamental point of view, using a machine learning model to predict QM descriptors, which are then in turn used as representation for downstream supervised learning models, is a variant of finetuning pre-trained models by only modifying the prediction head. I think this should be mentioned explicitly in the introduction and should be acknowledged. In my opinion, this approach is then essentially transfer learning. The authors claim to have used of transfer learning in some of the application examples, but I think already learning to predict descriptors and using the learned parameters as representation for another model is one way of performing transfer learning. In my opinion, this also explains why this approach works so well in the presented case studies.

I think Figure 7 takes quite some time to process at the moment. I think that Figure 7 could be easier to interpret if the various reaction sites are give a color and the same color is used in the table. Additionally, I think this figure could be condensed significantly by putting the predicted and experimental values directly next to the molecular structures. Furthermore, I think it would help a lot if the predicted values are converted to selectivity ratios, so that the values can be directly compared.

Minor aspects:
In the abstract, the authors state: "As a bonus, by basing their final predictions on physically meaningful descriptors, our models become inherently interpretable." I find this claim potentially dangerous. Interpretability even for multivariate linear regression is questionable, and this is even more the case for nonlinear supervised learning approaches. Since this interpretability is not even addressed in the manuscript, I think it would be best to simply remove this statement from the abstract, or rewrite it in a more cautious way.

Figure 3: It should be explained in the caption what the colors indicate.

Figure 5: It is very nice to see learning curves. I think it would be even better if the individual MAE values had error bars estimated based on the cross-validation.

Reviewer 2

“Repurposing Quantum Chemical Descriptor Datasets for on-the-Fly Generation of Informative Reaction Representations: Application to Hydrogen Atom Transfer Reactions”

The authors investigate repurposing existing quantum chemical property datasets to develop data-efficient machine learning models for predicting the activation energy of hydrogen atom transfer (HAT) reactions. They address the challenge of poor data efficiency in generic machine learning models, especially in specialized tasks like chemical reactivity prediction where limited training data is available. Their approach involves creating an informative reaction representation based on a valence bond analysis for HAT reactions and constructing a surrogate model using publicly available descriptors and geometries of organic radicals. Combining this surrogate model with a secondary reactivity model, they achieve significantly better performance on a dataset of HAT reactions.

The code accompanying the paper is available on a public data repository (GITHUB) with well-explained instructions on how to install and run the code. The dataset is well-documented and could be accessed and tested without much difficulty.


Reviewer 3

In this manuscript, the authors report the use of existing databases to train quantum-chemical descriptors within the VB theory, for their subsequent use in the prediction of HAT reaction energetics. While not fully innovative, the proposed learning strategy is sound, very interesting, and well executed. However, I share reviewer 1 concerns about the organization/writing of the manuscript, besides a couple of scientific questions. Hence, I recommend publication of the manuscript in Digital Discovery after minor revisions.

-In section 4.3, the authors use their VB-inspired representation to learn on the datasets introduced in section 3.5. The representation is trained on data computed at the M06-2X/def2-SVP + M06-2X/def2-TZVP level (section 3.2), but it is used to predict on datasets computed with different functionals and basis sets. Thus, it is not clear to me whether the computational level might be playing a role here. Could the authors comment on that?

-In general, the aim of descriptors is to capture the physical/chemical principles that explain the property of interest, so that ML exercises can succeed. Thus, if a theoretical framework exists to explain HAT reactions, it is no surprise that its exploitation in ML tasks will bring better accuracy (as seen in Table 1). For this reason, I think section 4.2 should be more concise. Figure 6 might be unnecessary as well.

-The large number of datasets and applications is often confusing, particularly in sections 3.2 and 3.5 of the methodology, and then in section 4.3. There is a lack of graphical aid to help understand this part of the manuscript. Also, some of the results (MAE, RMS) and characteristics of the dataset (size, type of reaction) could be collected in a table instead.

- Finally, one of the claims of the manuscript is the accuracy of the model on low-N cases (ie. its efficiency). I wonder if this claim is legitimate. After all, the application-specific models, even if done on small datasets, required the previous training of a separate model using the much larger HAT dataset. So, I would dispute the gain in efficiency associated with their strategy. This is just an opinion that I guess won’t be shared by the authors, but a comment on the manuscript would be appreciated.


 

Dear Prof. Schrier,

Thank you for your email on the 12th of March and the three referee reports on the manuscript. All reports were positive and only contained a handful of suggestions to improve the quality of the manuscript. We want to thank the reviewers for these thoughtful comments and have implemented the appropriate modifications. Below, we provide a point-by-point overview of these changes.

We hope that the revised manuscript can now be accepted in Digital Discovery.

Yours sincerely,
Thijs Stuyver
__________________________________________________________

************
REVIEWER REPORT(S):
Referee: 1

Comments to the Author
The article "Repurposing Quantum Chemical Descriptor Datasets for on-the-Fly Generation of Informative Reaction Representations: Application to Hydrogen Atom Transfer Reactions" describes the use of graph neural networks to train descriptor prediction models for bond dissociation free energies and a few other simulated both electronic and steric descriptors based on large existing molecular databases that can then be used to generate features for downstream prediction tasks of activation barriers for chemical reactions. The article is well-written and demonstrates the application of these surrogate descriptor prediction models for supervised learning based on a diverse set of datasets. This approach is shown to decrease the data needed to train accurate supervised learning models generally and lead to predictive models. Overall, I think the manuscripts fits the readership of Digital Discovery very well and demonstrates the importance of making use of large available (quantum chemical) descriptor databases. Nevertheless, I do think there are a few aspects that should be addressed before publication. I recommend minor revisions. Please find my detailed comments below.

Author reply: We want to thank the referee for this positive appraisal of our manuscript.

Major aspects:
page 1, left column: The authors state the following: "For tasks for which hundreds of thousands to millions of data points are available, e.g., forward and retrosynthesis prediction, accurate and robust machine learning (ML) models have been designed." I think this sentence requires to be backed up by literature. I think this sentence makes it sound like these problems are solved, but the opposite is true in my opinion. I do not think that the models that have been built are either accurate or robust. I think this is in part because the "popular USPTO benchmark dataset" is of relatively low quality, and while there have been attempts to curate it, it is simply a very hard task.

Author reply: We want to thank the referee for this remark. We fully agree with the argumentation that characterizing the current generation of models for forward and retrosynthesis as robust is probably a step too far – their performance indeed tends to deteriorate rapidly as soon as one leaves the training data distribution. At the same time, it is hard to deny that they are accurate – at least within the data distribution of the training data – since, as indicated further down the paragraph in the manuscript, accuracies of 90% and above are reached in cross-validation.

To introduce some more nuance on this point, we made the following modification to this specific sentence:

“… significant strides towards accurate machine learning (ML) models have been made”

Additionally, we added some additional references to retrosynthesis models that have been developed in recent years.

From a fundamental point of view, using a machine learning model to predict QM descriptors, which are then in turn used as representation for downstream supervised learning models, is a variant of finetuning pre-trained models by only modifying the prediction head. I think this should be mentioned explicitly in the introduction and should be acknowledged. In my opinion, this approach is then essentially transfer learning. The authors claim to have used of transfer learning in some of the application examples, but I think already learning to predict descriptors and using the learned parameters as representation for another model is one way of performing transfer learning. In my opinion, this also explains why this approach works so well in the presented case studies.

Author reply: We want to thank the referee for sharing this alternative viewpoint. In response, we included the following sentence in the introduction, together with a new reference:

“It should be noted that the presented strategy can also be regarded as an alternative to the more conventional approach of fine-tuning pre-trained models by modifying the prediction head.”

I think Figure 7 takes quite some time to process at the moment. I think that Figure 7 could be easier to interpret if the various reaction sites are give a color and the same color is used in the table. Additionally, I think this figure could be condensed significantly by putting the predicted and experimental values directly next to the molecular structures. Furthermore, I think it would help a lot if the predicted values are converted to selectivity ratios, so that the values can be directly compared.

Author reply: We want to thank the referee for this useful suggestion. We modified Figure 7, by removing the column ‘Site’ and adding a color code. We also added in the caption of the figure:
“Bold values represent the major regioisomer.”

We also would like to point out that we are training our models on DFT predicted activation energies, and consequently, our models are constrained by the accuracy of these approaches. Quantitatively predicting experimental ratios with DFT is challenging, though the qualitative trends are typically recovered faithfully. As such, we decided to quantitatively compare our predictions to DFT values; for the experimental ratios, we focus on qualitative agreement. Additionally, it should be noted also that there are two entries (3 and 6) without experimental rates, so that no comparison to experimental rates can be made there.

Minor aspects:
In the abstract, the authors state: "As a bonus, by basing their final predictions on physically meaningful descriptors, our models become inherently interpretable." I find this claim potentially dangerous. Interpretability even for multivariate linear regression is questionable, and this is even more the case for nonlinear supervised learning approaches. Since this interpretability is not even addressed in the manuscript, I think it would be best to simply remove this statement from the abstract, or rewrite it in a more cautious way.

Author reply: We agree with the referee that caution is needed, but as we indicate in the manuscript, we can in fact extract insights about the nature of the various datasets from the model predictions, e.g., by analyzing the feature importance, we can clearly distinguish between datasets that are mainly focused on alkoxy-radicals, and more chemically diverse datasets.

Nonetheless, we decided to rewrite the sentence in a more cautious way:

“… our models enable the extraction of chemical insights, providing an additional benefit.”

Figure 3: It should be explained in the caption what the colors indicate.

Author reply: We thank the suggestion of the referee, and we added the following sentence to the legend in all the figures with correlation plots (Figure 3, 4, 5):

“Note that color brightness here is inversely proportional to the density of the points, i.e., dark patches correspond to a high point density and vice versa.”

Figure 5: It is very nice to see learning curves. I think it would be even better if the individual MAE values had error bars estimated based on the cross-validation.

Author reply: We thank the referee for the suggestion and we modified the figure accordingly.

Referee: 2

Comments to the Author
“Repurposing Quantum Chemical Descriptor Datasets for on-the-Fly Generation of Informative Reaction Representations: Application to Hydrogen Atom Transfer Reactions”

The authors investigate repurposing existing quantum chemical property datasets to develop data-efficient machine learning models for predicting the activation energy of hydrogen atom transfer (HAT) reactions. They address the challenge of poor data efficiency in generic machine learning models, especially in specialized tasks like chemical reactivity prediction where limited training data is available. Their approach involves creating an informative reaction representation based on a valence bond analysis for HAT reactions and constructing a surrogate model using publicly available descriptors and geometries of organic radicals. Combining this surrogate model with a secondary reactivity model, they achieve significantly better performance on a dataset of HAT reactions.

The code accompanying the paper is available on a public data repository (GITHUB) with well-explained instructions on how to install and run the code. The dataset is well-documented and could be accessed and tested without much difficulty.

Author reply: We want to thank the referee for these kind words.

Referee: 3

Comments to the Author
In this manuscript, the authors report the use of existing databases to train quantum-chemical descriptors within the VB theory, for their subsequent use in the prediction of HAT reaction energetics. While not fully innovative, the proposed learning strategy is sound, very interesting, and well executed. However, I share reviewer 1 concerns about the organization/writing of the manuscript, besides a couple of scientific questions. Hence, I recommend publication of the manuscript in Digital Discovery after minor revisions.

Author reply: We want to thank the referee for this positive appraisal of our manuscript.

-In section 4.3, the authors use their VB-inspired representation to learn on the datasets introduced in section 3.5. The representation is trained on data computed at the M06-2X/def2-SVP + M06-2X/def2-TZVP level (section 3.2), but it is used to predict on datasets computed with different functionals and basis sets. Thus, it is not clear to me whether the computational level might be playing a role here. Could the authors comment on that?

Author reply: It is certainly possible that the heterogeneity of the levels of theory used in the various datasets introduces some noise in the representations, which can be expected to have a slight detrimental effect on the model accuracy. Nevertheless, in all datasets analyzed, levels of theory were selected based on benchmarking, so that we can expect the resulting DFT data to be reasonable, i.e., they should all be approximating the “true” value with a reasonable accuracy. This also means that the underlying physical connections between the input descriptors and the target output are preserved.

-In general, the aim of descriptors is to capture the physical/chemical principles that explain the property of interest, so that ML exercises can succeed. Thus, if a theoretical framework exists to explain HAT reactions, it is no surprise that its exploitation in ML tasks will bring better accuracy (as seen in Table 1). For this reason, I think section 4.2 should be more concise. Figure 6 might be unnecessary as well.

Author reply: We appreciate the referee’s opinion, but we still think that it is useful to do an in-depth comparison of model architectures on our in-house dataset, and to briefly discuss the importance of the various descriptors in the model predictions, which is one of the strengths of the presented approach in our opinion.

-The large number of datasets and applications is often confusing, particularly in sections 3.2 and 3.5 of the methodology, and then in section 4.3. There is a lack of graphical aid to help understand this part of the manuscript. Also, some of the results (MAE, RMS) and characteristics of the dataset (size, type of reaction) could be collected in a table instead.

Author reply: We thank the referee for point out this issue. We include a new table (Table 2) where we summarized the information regarding every dataset and application. We also add a sentence at the end of section 3.5:

“A summary with the main information of every dataset can be found in the table 2.”

- Finally, one of the claims of the manuscript is the accuracy of the model on low-N cases (ie. its efficiency). I wonder if this claim is legitimate. After all, the application-specific models, even if done on small datasets, required the previous training of a separate model using the much larger HAT dataset. So, I would dispute the gain in efficiency associated with their strategy. This is just an opinion that I guess won’t be shared by the authors, but a comment on the manuscript would be appreciated.

Author reply: We agree with the referee that training the descriptor prediction model inherently requires some computational overhead. However, the gain in efficiency that we are claiming is because we are avoiding costly QM calculations. These dwarf the cost of training the descriptor prediction model, and as we demonstrate on our in-house dataset, informative representations based on (predicted) descriptors are essential to be able to learn in the low data regime.

************




Round 2

Revised manuscript submitted on 17 Mar 2024
 

29-Mar-2024

Dear Dr Stuyver:

Manuscript ID: DD-ART-02-2024-000043.R1
TITLE: Repurposing Quantum Chemical Descriptor Datasets for on-the-Fly Generation of Informative Reaction Representations: Application to Hydrogen Atom Transfer Reactions

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our LinkedIn account [https://rsc.li/Digital_showcase] please fill out this form: https://form.jotform.com/213544038469056.

We are offering all corresponding authors on publications in gold open access RSC journals who are not already members of the Royal Society of Chemistry one year’s Affiliate membership. If you would like to find out more please email membership@rsc.org, including the promo code OA100 in your message. Learn all about our member benefits at https://www.rsc.org/membership-and-community/join/#benefit

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Dr Joshua Schrier
Associate Editor, Digital Discovery


 
Reviewer 3

The authors have addressed my comments, so I recommend the manuscript for publication in its current form.

Reviewer 1

The article "Repurposing Quantum Chemical Descriptor Datasets for on-the-Fly Generation of Informative Reaction Representations: Application to Hydrogen Atom Transfer Reactions" is a revised version of a manuscript that I reviewed previously. I think that the authors did an excellent job to address all the remarks of the reviewers. Hence, it is my pleasure to recommend this article for publication.

Reviewer 2

The authors have revised the manuscript and have addressed the comments by all the referees. The accompanying data and code is well documented.




Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.
Creative Commons BY license