From the journal Digital Discovery Peer review history

Reaction classification and yield prediction using the differential reaction fingerprint DRFP

Round 1

Manuscript submitted on 26 Aug 2021
 

24-Oct-2021

Dear Dr Probst:

Manuscript ID: DD-ART-08-2021-000006
TITLE: Reaction Classification and Yield Prediction using the Differential Reaction Fingerprint DRFP

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that major revisions are necessary.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy from CASRAI, https://casrai.org/credit/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process.   We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Kedar Hippalgaonkar
Associate Editor, Digital Discovery
Royal Society of Chemistry

************


 
Reviewer 1

The idea is interesting and the research question is important, however, this work especially the context presentation is still too preliminary to be published yet. I believe after a major revision, this work could reach adequate quality.

Overall, the scholar presentation in terms of clarification needs to be improved a lot. Although from the Github source code and pseudo code I could understand how the algorithm works, this level of detail should be represented in the main text as well. This is an interdisciplinary journal meaning the readers can be from different backgrounds, better graphic aiding can help researchers from material or chemistry backgrounds to understand better. The same authors have written very nice papers: "Extraction of organic chemistry grammar from unsupervised learning of chemical reactions" and "Mapping the Space of Chemical Reactions Using Attention-Based Neural Networks".
I believe they can achieve the same presentation quality.

For the encoding algorithm, the symmetric difference is used on the reactants + reagents substructures and the products substructures. However, there is a potential problem. For example, if there are no reagents and A + B >> C is the reaction, one can have a hashed fingerprint of this reaction. But if there is another reaction which is C >> A + B only consider the difference in substructure lists can lead to the same fingerprint.

The whole reaction-yield prediction experiment is unclear. Why there is the need to split datasets in this way. Why not just have a fixed test set and test the different fingerprints' performance? If there is a need to eliminate the effects of model hyperparameters, then an ensemble approach can be used. Please revise this section, so it can be much more clear about why different experiments are carried out and how exactly they are carried. In addition, please clarify the difference between the 4 test sets as well.

For each of the different experiments could you provide some figures instead of just one table, so that the properties of the datasets (the target value range, the distribution of the data such as the sub-gram data, the difference between your methods and the benchmarked methods (different information encoded, what are the differences) can be clearly shown to readers.






Reviewer 2

The manuscript titled “Reaction Classification and Yield Prediction using the Differential Reaction Fingerprint DRFP” describes a structural-based method for constructing reaction representations. The analysis compares a few ML representations that have been reported in the literature to the newly developed DRFP in both reaction classification and yield prediction. DRFP is not only an elegant method for representing reactions, but it does not take weeks to calculate the descriptors which makes it a valuable tool for on-the-fly predictions. The manuscript is suggested to be published with minor revisions.



- How does naively subtracting the reactant FP from the product fp perform on yield prediction? Basically Schnider et al. without any weighting. Does it come up with similar results to the DRFP?
- The authors state they standardize reactions by moving reagents to the reactants side of the reaction representation. What is the impact if reagents are not included at all? Obviously chemically speaking they should be important but in the reviewers experience sometimes performance is not greatly affected by exclusion of reagents.
- A brief description of the USPTO 1k TPL set should be included. It would only add two sentences and would not make the reader have to remember/lookup the details. It should also be noted these are templates not named reactions from NextMove/NameRxn just so the reader is aware.
- While the reaction shown in figure 1 is interesting, it is probably not the best example to show for the audience intended for this publication. It is suggested to provide a more straightforward reaction which appeals to the hybrid (chemistry and ML) audience that is intended. This is of minor importance and if the authors think this reaction best demonstrates the features, then it is ok.
- It should be noted in the case of the yield data from high throughput experimentation (HTE) that one cannot predict higher accuracy than the baseline reproducibility of the reactions. In chemistry the accepted metric is usually +-10%. While this representation is much more straightforward, easy to use, and time sparing than many alternatives the authors should dial back the claims of significant or impressive in performance. A lesson we should learn from the high throughput assay and property prediction field is that fitting a dataset without reproductivity metrics OR replicates can be done but we cannot predict better than the inherent error in the measurements.
- The authors did not try one-hot type descriptors. How does this compare to features with no structural information? The rebuttal to Ahneman et. al. essentially demonstrated that DFT descriptors are not necessary for high accuracy (ie the proper baselines were not run to statistically say their DFT method was better than more simple representations)


 

Dear Editor and Reviewers,
We would like to express our thanks for the valuable input. We think that incorporating the proposed changes has improved the quality of our manuscript markedly. Please find our point-by-point answers below. In addition, you will find all changes to the manuscript annotated (using the LaTeX package changes with default settings) in the pdf. The scripts to generate the newly added plots have been collected in a notebook that is available from the project repository on github.

===============
Editor
===============
1.
Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy from CASRAI, https://casrai.org/credit/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Answer: We have added the section “Author Contributions”.



=============== 
Reviewer 1
===============
1.
The idea is interesting and the research question is important, however, this work especially the context presentation is still too preliminary to be published yet. I believe after a major revision, this work could reach adequate quality.
Overall, the scholar presentation in terms of clarification needs to be improved a lot. Although from the Github source code and pseudo code I could understand how the algorithm works, this level of detail should be represented in the main text as well. This is an interdisciplinary journal meaning the readers can be from different backgrounds, better graphic aiding can help researchers from material or chemistry backgrounds to understand better. The same authors have written very nice papers: "Extraction of organic chemistry grammar from unsupervised learning of chemical reactions" and "Mapping the Space of Chemical Reactions Using Attention-Based Neural Networks".
I believe they can achieve the same presentation quality.

Answer: Thank you for this comment, the text was indeed lacking details to be well-understood by a more general audience. We made the following changes to the manuscript:

- Introduction; Text:
- - More detailed introduction to the reaction classification problem.
- - Clarification on what is meant by physics- and structure-based fingerprints.
- - Clarification on that attention weights are required related to the specific transformer-based model used by Schwaller et al.
- - Added a short description of circular substructures.
- - Added references to Figure 1 to the text.
- - Clarification on why hashing and folding are required.

- Introduction; Figure 1:
- - Changed the example to a simpler reaction in order to make the scheme easier to read.
- - Added the respective SMILES representations in addition to the structures.
- - Replaced the caption with a more detailed version.
- - Added an example of circular substructure extraction.

- Introduction & Results and Discussion
- - We have moved the now more detailed description of the fingerprint to it’s own subsection “Fingerprint Design” under Results and Discussions

- Methods; Molecular n-grams
- - Added a short introduction on SMILES.
- - Added a reference to the visual representation of the process in Figure 1.
- - Detailed the use of the term “molecular n-grams”.


2.
For the encoding algorithm, the symmetric difference is used on the reactants + reagents substructures and the products substructures. However, there is a potential problem. For example, if there are no reagents and A + B >> C is the reaction, one can have a hashed fingerprint of this reaction. But if there is another reaction which is C >> A + B only consider the difference in substructure lists can lead to the same fingerprint.

Answer: This is indeed the case. While we consider this a rare edge-case, we have added this as a limitation to the discussion.

Added text: “A current limitation of DRFP is that it fails to distinguish between a reaction and its reverse, e.g. A + B -> C + D and C + D -> A + B. However, we consider this to be an edge-case that, if necessary, could be addressed in a specialised variant of the fingerprint.”


3.
The whole reaction-yield prediction experiment is unclear. Why there is the need to split datasets in this way. Why not just have a fixed test set and test the different fingerprints' performance? If there is a need to eliminate the effects of model hyperparameters, then an ensemble approach can be used. Please revise this section, so it can be much more clear about why different experiments are carried out and how exactly they are carried. In addition, please clarify the difference between the 4 test sets as well.

Answer: Great point! We have added a paragraph to clarify the reaction-yield prediction experiment and explain the different splits.

Added text: “As a regression task, we investigate yield prediction, where given a chemical reactions the percentage of the product that is formed compared to the theoretical maximum has to be predicted. One of the best studied yield data sets comes from a high-throughput experimentation study by Ahneman et al. Numerous studies have previously used this data set model to evaluate the different machine learning models and representations (one-hot physical molecular and learned descriptors). The data set contains 10 random splits and 4 out-of-distribution test sets. In the out-of-distribution test sets, the split is made on the additives, which strongly influence the reactivity. Hence, the models have to extrapolate to unseen additives to perform well.”.


4.
For each of the different experiments could you provide some figures instead of just one table, so that the properties of the datasets (the target value range, the distribution of the data such as the sub-gram data, the difference between your methods and the benchmarked methods (different information encoded, what are the differences) can be clearly shown to readers.

Answer: We added a visual analysis, including a TMAP, on the Schneider 50k data set (Figure 3). In addition, we have added enhanced reaction plots for the experiments on the Buchwald-Hartwig data set, which include 2D KDE plots that show the distributions of both the ground truth as well as the predicted values (Figure 4). In the text, we then compared our results to those of Yield-BERT / Augmented Yield-BERT.
For the gram and sub-gram splits of the USPTO data set, we added the reasoning behind the split.


===============
Reviewer 2
===============
The manuscript titled “Reaction Classification and Yield Prediction using the Differential Reaction Fingerprint DRFP” describes a structural-based method for constructing reaction representations. The analysis compares a few ML representations that have been reported in the literature to the newly developed DRFP in both reaction classification and yield prediction. DRFP is not only an elegant method for representing reactions, but it does not take weeks to calculate the descriptors which makes it a valuable tool for on-the-fly predictions. The manuscript is suggested to be published with minor revisions.


1.
How does naively subtracting the reactant FP from the product fp perform on yield prediction? Basically Schnider et al. without any weighting. Does it come up with similar results to the DRFP?

Answer: An initial version of the fingerprint performed such a subtraction (asymmetric set difference S_products-S_reactants). However, we then moved to the exclusive or (symmetric set difference) as it performed better.
The subtraction version reached an accuracy of 0.852 (compared to the 0.917 of the released DRFP version) in the kNN experiment on the USPTO 1k TPL data set. We have added the results of this version to Table 1 and discuss it in the text.
A version using the subtraction instead of the exclusive or can also be found as a branch in the DRFP github repository (https://github.com/reymond-group/drfp/tree/subtraction).


2.
The authors state they standardize reactions by moving reagents to the reactants side of the reaction representation. What is the impact if reagents are not included at all? Obviously chemically speaking they should be important but in the reviewers experience sometimes performance is not greatly affected by exclusion of reagents.
Answer: The issue regarding the separation of reactants and reagents is that often it is not known which molecules are the reactants and which are the reagents. This means, that in order to keep the fingerprint generally applicable, we cannot distinguish between the two. In addition, the high-throughput yield data sets would not support such an experiment, as the reactions only differ by their reagents while the reactant is fixed.

Added text: “Similar to the transformer-based learned fingerprint, DRFP does not distinguish between reactants and reagents, and accepts an arbitrary number of molecules on both sides of the chemical equation.” (This can now be found under the newly introduced subsection of Results and Discussion “Fingerprint Design”)

Added text: “One of the best studied yield data sets comes from a high-throughput experimentation study by Ahneman et al., which contains the yields of 4,608 palladium-catalysed Buchwald–Hartwig reactions with a fixed reactant and varying reagents.”


3.
A brief description of the USPTO 1k TPL set should be included. It would only add two sentences and would not make the reader have to remember/lookup the details. It should also be noted these are templates not named reactions from NextMove/NameRxn just so the reader is aware.

Answer: Thank you for the comment. We have included a paragraph to describe the USPTO 1k TPL data set.

Added text: “As a reaction classification task, we investigated the open-source USPTO 1k TPL dataset, which we previously introduced. In USPTO 1k TPL, the reaction classes were generated by extracting the 1,000 most common templates from the USPTO dataset. Atom-maps that are required to extract templates were predicted using RXNMapper. The task is to predict the corresponding template class given a chemical reaction.”

4.
While the reaction shown in figure 1 is interesting, it is probably not the best example to show for the audience intended for this publication. It is suggested to provide a more straightforward reaction which appeals to the hybrid (chemistry and ML) audience that is intended. This is of minor importance and if the authors think this reaction best demonstrates the features, then it is ok.

Answer: Thank you for this remark. We intended to use a reaction that would draw the attention of organic chemists; however, it is true that this is not appropriate for a general audience in terms of explaining the algorithm. We replaced the reaction in Figure 1 with a simple Favorskii rearrangement. In addition. We improved the explanation and added SMILES strings to further clarify the workings of the algorithm.


5.
It should be noted in the case of the yield data from high throughput experimentation (HTE) that one cannot predict higher accuracy than the baseline reproducibility of the reactions. In chemistry the accepted metric is usually +-10%. While this representation is much more straightforward, easy to use, and time sparing than many alternatives the authors should dial back the claims of significant or impressive in performance. A lesson we should learn from the high throughput assay and property prediction field is that fitting a dataset without reproductivity metrics OR replicates can be done but we cannot predict better than the inherent error in the measurements.

Answer: When performance differs by a small margin, we replaced the strong “outperform” with a weaker “performs better than”. In addition, we added the following sentence to the conclusion:

Added text: “While our method only slightly improves on the classification and prediction accuracies of other state-of-the-art methods, its value lies within its conceptual simplicity, low use of computational resources, and reproducibility.”


6.
The authors did not try one-hot type descriptors. How does this compare to features with no structural information? The rebuttal to Ahneman et. al. essentially demonstrated that DFT descriptors are not necessary for high accuracy (ie the proper baselines were not run to statistically say their DFT method was better than more simple representations)

Answer: One-hot encodings make sense for high-throughput experiments when the number of molecules used in the reactions is limited and the same molecules are used multiple times. For the yield experiments we now include the results of one-hot encodings as a comparison.
However, the USPTO TPL data set contains more than 580k unique molecules, and only 27k appear more than 3 times. On such a dataset one-hot encodings become impractical.




Round 2

Revised manuscript submitted on 05 Dec 2021
 

12-Jan-2022

Dear Dr Probst:

Manuscript ID: DD-ART-08-2021-000006.R1
TITLE: Reaction Classification and Yield Prediction using the Differential Reaction Fingerprint DRFP

Thank you for submitting your revised manuscript to Digital Discovery. After considering the changes you have made, I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery, if you are interested in this opportunity please contact me for more information.

Discover more Royal Society of Chemistry author services and benefits here:

https://www.rsc.org/journals-books-databases/about-journals/benefits-of-publishing-with-us/ 

Thank you for publishing with Digital Discovery, a journal published by the Royal Society of Chemistry – the world’s leading chemistry community, advancing excellence in the chemical sciences.

With best wishes,

Dr Kedar Hippalgaonkar
Associate Editor, Digital Discovery
Royal Society of Chemistry


 
Reviewer 1

I think now the major questions are addressed, however, I really think the author needs to solve the C - A +B in the future. Overall it is in condition ready to be published.

Reviewer 2

The reviewer thanks the authors for their very thorough consideration of the review comments and addressing the concerns. The authors have sufficiently revised the manuscript and answered the questions the reviewer had. It is suggested to be accepted.




Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.
Creative Commons BY license