From the journal Digital Discovery Peer review history

Investigating the reliability and interpretability of machine learning frameworks for chemical retrosynthesis

Round 1

Manuscript submitted on 13 jan 2024
 

26-Feb-2024

Dear Dr Zhang:

Manuscript ID: DD-ART-01-2024-000007
TITLE: Investigating the Reliability and Interpretability of Machine Learning Frameworks for Chemical Retrosynthesis

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that major revisions are necessary.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process.   We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry

************


 
Reviewer 1

Hastedt and co-workers present a fairly comprehensive benchmark of several one-step retrosynthesis ML approaches. They highlight the importance of going beyond the typical top-k metrics used to evaluate such approaches, which is important for the practical application of such models. In addition, the authors used interpretability approaches to qualitatively evaluate the models’ “chemical understanding”. While the methodological novelty may be limited, the work is of high practical importance as the field of ML-guided synthesis planning is finding broader application also in industry.

Overall, I found the work to be thorough and the manuscript clear and well written. The data and code used for the experiments is also available on GitHub. My view is that the manuscript is ready for publication after minor corrections; see comments below.

- Page 3: “In 2017, Segler and Waller devised the first deep-learning model to smartly rank templates” -> “smartly” could be replaced by an adverb (it seems to imply previous approaches were not smart).
- The proposed Diversity metric rewards models that return a distribution of proposed reaction classes that resembles the prior distribution of reactions observed across all precedents. I wonder if this is a reasonable expectation. Wouldn’t it be reasonable to assume that the distribution of proposed reaction classes should in fact biased, considering it is conditioned on a specific target molecule that might be more easily achieved via certain transformations? In fact, isn’t a bias toward the more suitable reaction classes indicative of learning? If the proposed distribution is the same as the prior one, without knowledge of the molecule that needs to be synthesized, couldn’t this be seen as a failure of the model to learn the chemistry that is needed to achieve the target molecule of interest? All in all, I feel that the rationale behind this metric may need to be revised and/or discussed in more detail.
- Also on the diversity metric: rather than measuring similarity based on reaction classes, wouldn’t it be possible to define similarity/diversity using, e.g., reaction fingerprints (e.g., https://pubs.rsc.org/en/content/articlelanding/2022/dd/d1dd00006c)? This may allow to evaluate the diversity or proposed retrosynthetic steps (e.g. by looking at the distribution and dispersion of pairwise similarities for each model) without having to classify reactions by type or compare to a reference distribution. It may be worth considering or discussing.
- Page 12: “This difference is possibly linked to a dissimilar method of calculating the top-k accuracy.” It might be good to add a sentence clarifying how the top-k calculations differ.
- The “GNNExplainer” seems to be referred to also as the “GraphExplainer” and the “Explainer”. If these three terms do indeed refer to the same method, there should be consistency across the text to avoid possible confusion.
- Page 16: “Case Study 4 - Kinetic Inhibitor” -> “Kinase Inhibitor”
- Case Study 5, across all text and figures: “Waferin” -> “Warfarin”
- Page 18: “The node (atom) features are only updated once, at the end of the message passing operation”. This does not seem to be right, or phrased correctly, certainly not for all D-MPNN implementations (e.g., see https://pubs.acs.org/doi/full/10.1021/acs.jcim.9b00237). In D-MPNNs with edge-centered updates the atom features are concatenated to the bond ones, but then are still updated via message passing, and aggregated into the final atom embeddings (only once) after message passing. But the data/information in the initial atom features is still used and updated during message passing. I think I see what the authors may have wanted to convey but it might be worth rephrasing the sentence to be more precise.

Reviewer 2

Please fix the references in the paper. There are too many uncertain and inappropriately formatted citations. Without fixing those problems, it is not able to review the manuscript.

Reviewer 3

This paper is very well written and explains well the limits of the use of single metrics for evaluating retrosynthesis models. I think there is value in asking for careful evaluation of these models for the benefit of the whole community.
The code and the data to reproduce the results are reported and publicly available.
Although I do appreciate all the details furnished in the introduction to define the types of models, the validation methods, and associate metrics. I think part 3 could probably come a little earlier or this manuscript should be advertised as a mini-review. The discussion of the metrics is very valuable, but I am a little less convinced about the use of a selected number of case studies to conclude the chemical interpretability of the models... maybe some of them could go to the SI part of the manuscript to bring the conclusions a little earlier.
There are a few typos in "SELFIES" at the end of the manuscript.

Reviewer 4

The manuscript of Hastedt and co-workers takes a deep dive into the evaluation of one-step retrosynthesis models and propose a benchmark framework in addition to investigating the interpretability of such models. The study offers plenty of material and is to some extent an interesting read. However, it would require extensive editing before it is ready for publication.

The authors offer an extensive overview of the field of one-step retrosynthesis models in Section 2.1, but at a length of over three pages it is way too extensive and detailed. This kind of overview is better suited for a review, and a recent one is indeed referenced by the authors, and this should suffice for the interested reader. Although, the authors need to update Ref 5 as it is now published in a peer-reviewed journal (https://doi.org/10.1002/wcms.1694). My strong recommendation is to cut Section 2.1 by at least 50%.

It is of course highly subjective how to categorize single-step retrosynthesis models, but the authors do deviate from the consensus and Ref 5, with their alternative definition of semi-template methods. Yes, models like MEGAN depend on atom-mapping but they do not rely on the extraction of a template. Equating atom-mapping with template stretches the definition of templates considerably. I recommend that the authors insert 2-3 sentences discussing the different categorization of the models.

The authors base their benchmark on USPTO-50K with the motivation “The USPTO-50k is the preferred dataset for retrosynthesis thanks to its rich data”. This almost laughably controversial. From what I gather from the section, the “rich data” refers to the classification information as well as the improved NextMove atom-mapping, but this hardly covers for the limited number of reaction types available in the dataset and the low number of data points. It has been proven several times (see Ref 48 for instance) that the performance on UPSTO-50K is no transferable to higher volume datasets with more diverse reactions. Hence, any conclusion drawn on USPTO-50K is very limiting. The authors at least need to acknowledge that, and they should preferably repeat some of their evaluation on a larger dataset drawn from the USPTO set (e.g. USPTO-MIT, full USPTO, or PaRoutes USPTO).

Dropbox is not the best medium to distribute research material. I would recommend the authors to use a services meant for research dissemination that provides long-term, fixed identifiers such as Zenodo or FigShare.

With regards to the evaluation metrics, I have several remarks. First, I don’t think it is as uncommon to evaluate one-step retrosynthesis model with something else than top-n accuracy as the authors make it out. It has been acknowledged for many years that this is an insufficient metric for one-step performance. Furthermore, I think that all the evaluation metrics should be re-scaled and reported on a common scale from 0 to 1, with 1 being the best. It would aid in for instance the interpretation of Table 1, where it is now rather difficult to judge the different models as you must think about the magnitude and scale of each number. Also, in Table 2 it just plain confusing to have top-k accuracy being in percentage and round-trip accuracy in fractions.

The authors have chosen two metrics for diversity, but I would recommend picking one and stick with it. My recommendation is diversity (Div) as it is easier to understand and interpret. It should be acknowledged by the authors a diverse retrosynthesis is not always possible. For instance, a molecule could have only one bond that can be disconnected in one specific reaction.

The authors need to elaborate on why template-based models can generate invalid SMILES, as this might seem counterintuitive to the reader. It should be explained that it has a different origin than the invalid SMILES from generative models.

The authors need to elaborate that some reaction classes will lead to increased complexity, like protections, and that is perfectly alright. Hence it is not always desirable to have a lowering on the SCScore. In a related issue, the authors need to acknowledge the merit of evaluation one-step models in a multi-step fashion, more than they currently do. Yes, it is important that the model provide feasible reactions, but it does not matter much if the proposals always lower the SCScore, produces unique and diverse solutions, etc., if those solutions does not lead to starting material that is purchasable, i.e., you find a synthetic route for your target molecule. I think this needs to be emphasized further in Section 2.2.

Lastly, the authors use XAI to compare the internal model reasoning between two GNNs and a transformer model. They use a masking algorithm together with maximizing mutual information to gather node importance from the GNN algorithms and the attention with the highest values between reactant and product tokens to gather importance for the transformer model. The intention of the evaluation is interesting, but it comes out as a different study altogether, and arguably this subject deserves proper attention in it is own paper. Moreover, the comparison in inherently biased towards EGAT and DMPNN for the task of reaction center prediction, as this is their primary task, whereas the task of the transformer model is generation. Expecting a transformer to give attention to the reaction center by using a random attention map is flawed logic. Transformer attention considers important functional groups, discerning molecular features as well as previous tokens to then generate new molecules token-by-token.

This is also the reason why the transformer will fail to identify stabilizing functional groups, as this effect will be hidden by other features important for the model to remember. Using attention is such a way and subsequently drawing conclusions is biased and should be removed. Comparisons between the GNNs (and possibly GCN as well) can be informative, especially because the same XAI method is used across the approaches. However, this may not fit the article as it is comparing between different modalities. I understand the authors intention to include the text-based modality, but this should not be done as an afterthought. In conclusion, the transformer XAI methods should be improved or removed from the comparisons completely, as the comparisons made currently do not measure identical values between the different models and are false comparisons.


 

Comments from Referee #1:

Comment 1) Page 3: “In 2017, Segler and Waller devised the first deep-learning model to smartly rank templates” -> “smartly” could be replaced by an adverb (it seems to imply previous approaches were not smart)

Response:
- Thank you for pointing out the wording. We replaced the word with “rank templates by probability” accordingly.

Comment 2a) The proposed Diversity metric rewards models that return a distribution of proposed reaction classes that resembles the prior distribution of reactions observed across all precedents. I wonder if this is a reasonable expectation. Wouldn’t it be reasonable to assume that the distribution of proposed reaction classes should in fact biased, considering it is conditioned on a specific target molecule that might be more easily achieved via certain transformations? In fact, isn’t a bias toward the more suitable reaction classes indicative of learning? If the proposed distribution is the same as the prior one, without knowledge of the molecule that needs to be synthesized, couldn’t this be seen as a failure of the model to learn the chemistry that is needed to achieve the target molecule of interest? All in all, I feel that the rationale behind this metric may need to be revised and/or discussed in more detail.

Response:
- We thank you for this suggestion. The metric was initially chosen to reflect the algorithm’s ability to propose all different reaction types in the dataset. However, we agree that bias can indeed be desirable to indicate the model’s progress in learning about feasible reaction chemistry for a specific target. Additionally, as the metric is not as informative as Div, we decided to remove it from the script to avoid confusion. We left the reaction class distribution plot in the ESI as we believe that it still could be insightful for an interested reader to observe certain trends for a specific model category.

Comment 2b) Also on the diversity metric: rather than measuring similarity based on reaction classes, wouldn’t it be possible to define similarity/diversity using, e.g., reaction fingerprints (e.g. https://pubs.rsc.org/en/content/articlelanding/2022/dd/d1dd00006c)? This may allow to evaluate the diversity or proposed retrosynthetic steps (e.g. by looking at the distribution and dispersion of pairwise similarities for each model) without having to classify reactions by type or compare to a reference distribution. It may be worth considering or discussing.

Response:
- Thank you for bringing up this interesting idea. We have added a short introduction to this idea within Section 2.2.2 – Diversity. For our work, the reactions are classified to offer greater interpretability/understanding of preferred reaction types for a given model. However, if no reaction class information is available in the dataset, the idea of measuring diversity by pairwise dispersion is indeed a good alternative.
- It now reads:
“However, it should be noted that there are other methods to measure diversity. For example, one could use data-driven reaction fingerprints (e.g. rxnfp or DRFP) to measure average pairwise dispersion between reactions, with a larger dispersion indicating a higher diversity. Nonetheless, this would come with reduced interpretability.”

Comment 3) Page 12: “This difference is possibly linked to a dissimilar method of calculating the top-k accuracy.” It might be good to add a sentence clarifying how the top-k calculations differ.

Response:
- Thank you for the suggestion. Upon closer examination of the code, we could not find any flaws with the top-k accuracy. Instead, we found the performance difference due to different “optimal” hyperparameters used in the paper compared to the GitHub repo, namely:
Paper: GNN hidden_dim: 512, MPNN depth: 5 (for 1st model), 7 (for 2nd model)
Github: GNN hidden_dim: 256, MPNN depth: 10 (for 1st & 2nd model)
- Additionally, for LocalRetro, the authors have updated their top-k accuracy metric on their GitHub. Their updated results for the top-k accuracy are in agreement with our study.
- We have therefore removed the sentence about dissimilar top-k accuracy in Section 3.2.1. for LocalRetro and added a sentence for the different hyperparameter selection of G2Retro.

Comment 4) The “GNNExplainer” seems to be referred to also as the “GraphExplainer” and the “Explainer”. If these three terms do indeed refer to the same method, there should be consistency across the text to avoid possible confusion.

Response:
- We appreciate highlighting this inconsistency within our manuscript. To ensure consistency, we have rephrased all occurrences to GNNExplainer.

Comments 5a/b) Page 16: “Case Study 4 - Kinetic Inhibitor” -> “Kinase Inhibitor; Case Study 5, across all text and figures: “Waferin” -> “Warfarin”
Response:
- Thank you for pointing out the typos. We have updated the naming of the two molecules, accordingly.

Comment 6) The node (atom) features are only updated once, at the end of the message passing operation”. This does not seem to be right, or phrased correctly, certainly not for all D-MPNN implementations […].In D-MPNNs with edge-centered updates the atom features are concatenated to the bond ones, but then are still updated via message passing, and aggregated into the final atom embeddings (only once) after message passing. But the data/information in the initial atom features is still used and updated during message passing. I think I see what the authors may have wanted to convey but it might be worth rephrasing the sentence to be more precise.

Response:
- We appreciate the reviewer’s comment, and we agree that our wording does not convey clearly the principled operation of the D-MPNN. According to this suggestion, we have rephrased the sentence in Section 3.3.3, item 3.
- It now reads:
“The D-MPNN differs from the conventional Message Passing Network in a major fashion: The messages in the graph propagate via directed edges (bonds) rather than nodes (atoms). This has the advantage of preventing information from being passed back and forth between adjacent nodes. Furthermore, in the case of edge-centered updates, the finalised node embeddings are constructed by aggregating the updated edge embeddings along with initial node features. Subsequently, the atoms (nodes) incorporate a larger proportion of initial atom features.”


Comments from Referee #2:

Comment 1) Please fix the references in the paper. There are too many uncertain and inappropriately formatted citations. Without fixing those problems, it is not able to review the manuscript.

Response:
- Thank you for your comment regarding referencing. We have changed references (5,13,27,29,38,43,45,47,58) from their (arXiv) preprint reference to the respective peer-reviewed publications. Regarding the formatting: the style is inherent to Digital Discovery Journal (which omits the publication’s title).


Comments from Referee #3:

Comment 1) Although I do appreciate all the details furnished in the introduction to define the types of models, the validation methods, and associate metrics. I think part 3 could probably come a little earlier or this manuscript should be advertised as a mini-review.

Response:
- We appreciate this suggestion and have accordingly revised Section 2.1 by cutting it by roughly 45-50%. Section 3 now comes 1 page earlier.

Comment 2) The discussion of the metrics is very valuable, but I am a little less convinced about the use of a selected number of case studies to conclude the chemical interpretability of the models... maybe some of them could go to the SI part of the manuscript to bring the conclusions a little earlier.

Response:
- We appreciate this suggestion. We decided to move the Warfarin case study to the ESI. Along with the reduction in Section 2 and an additional (added) benchmarking simulation for Section 3.2, the conclusion comes 1 page earlier.

Comment 3) There are a few typos in "SELFIES" at the end of the manuscript.
Response:
- We thank you for pointing out the typos. We have corrected the typos within Section 4, accordingly.


Comments from Referee #4:

Comment 1a) The authors offer an extensive overview of the field of one-step retrosynthesis models in Section 2.1, but at a length of over three pages it is way too extensive and detailed. This kind of overview is better suited for a review, and a recent one is indeed referenced by the authors, and this should suffice for the interested reader. […]. My strong recommendation is to cut Section 2.1 by at least 50%.

Response:
- Thank you for your suggestion. We have reduced the length of Section 2.1.1-2.1.3 by roughly 50% (Figures and Text) to streamline the introduction.

Comment 1b) Although, the authors need to update Ref 5 as it is now published in a peer-reviewed journal (https://doi.org/10.1002/wcms.1694).

Response:
- We appreciate bringing this to our attention. We have updated Ref. 5 along with references 13,27,29,38,43,45,47,58 from their (arXiv) preprint to the respective peer-reviewed publications.

Comment 2) It is of course highly subjective how to categorize single-step retrosynthesis models, but the authors do deviate from the consensus and Ref 5, with their alternative definition of semi-template methods. Yes, models like MEGAN depend on atom-mapping but they do not rely on the extraction of a template. Equating atom-mapping with template stretches the definition of templates considerably. I recommend that the authors insert 2-3 sentences discussing the different categorization of the models.

Response:
- We thank you for highlighting the difference in our model categorization to Ref. 5. Our categorization follows two principal statements in the following references:
1. Reference 5 (Zhong et. al.): “Since the chemical reaction in the datasets is atom-mapped, the transformations of atoms and bonds during the reaction can be automatically identified by comparing the product to their corresponding reactants. The retrosynthesis can be resolved by predicting these transformations.”
Following suit with their definition, MEGAN should be classified as a semi-template model, too, which was not done in their review paper – possibly an oversight by the authors.
2. Reference 27 (Schwaller et. al.): “However, correctly mapping the product back to the reactant atoms is still an unsolved problem, and, more disconcertingly, commonly used tools to find the atom mapping (e.g., NameRXN) are themselves based on libraries of expert rules and templates. This creates a vicious circle. Atom-mapping is based on templates and templates are based on atom mapping, and ultimately, seemingly automatic techniques are actually premised on handcrafted and often artisanal chemical rules.”
As USPTO-50k utilizes atom-mapping by NameRXN, there is a strong dependence between templates and atom mapping in the dataset. According to this, only models that do not utilize any information in the form of template/atom-mapping, can therefore be classified as template-free.
- Following the reviewer’s suggestion, we have introduced two additional sentences (and Schwaller et. al.’s) reference in Section 2.1 - Background to elucidate the categorization.
- It now reads:
“Utilising atom mapping, one can extract the sequence of atom and bond transformations during a reaction computationally. Thus, semi-template models address the prediction of the transformation sequence5. Since reaction templates are curated from atom mapping (and atom mapping algorithms themselves depend on templates and/or expert rules21) semi-template and template-based models share a certain degree of knowledge, thus giving rise to the naming convention. Within this paper, an algorithm utilising exact atom mapping is categorised as semi-template”

Comment 3a) The authors base their benchmark on USPTO-50K with the motivation “The USPTO-50k is the preferred dataset for retrosynthesis thanks to its rich data”. This almost laughably controversial.

Response:
- We appreciate this concern and we do acknowledge the fact that larger datasets exist for retrosynthesis prediction (such as USPTO-full/Pararoutes). We utilized the USPTO-50k in this work as the literature and research community primarily use this dataset for SOTA comparison and as performance advertisement within research abstracts. Below, we copied several recent (peer-reviewed) publications up until early 2024, that solely evaluate their model on the USPTO-50k. According to this, the USPTO-50k continues to be the preferred dataset within the community and in general for retrosynthesis predictions. We therefore follow the general opinion in the research community and our benchmark focuses on the USPTO-50k.
List of References:
2024: G-MATT: Single-step retrosynthesis prediction using molecular grammar tree transformer doi: 10.1002/aic.18244
2024: MARS: a motif-based autoregressive model for retrosynthesis prediction doi: 10.1093/bioinformatics/btae115
2024: RCsearcher: Reaction center identification in retrosynthesis via deep Q-learning doi: 10.1016/j.patcog.2024.110318
2023: Retrosynthesis prediction with local template retrieval (RetroKNN) doi: 10.1609/aaai.v37i4.25664
2023: Enhancing diversity in language based models for single-step retrosynthesis doi: 10.1039/D2DD00110A
2023: Retrosynthesis prediction using an end-to-end graph generative architecture for molecular graph editing (Graph2Edits) doi: 10.1038/s41467-023-38851-5
2023: G2Retro as a two-step graph generative models for retrosynthesis prediction doi: 10.1038/s42004-023-00897-3
2023: SynCluster: Reaction Type Clustering and Recommendation Framework for Synthesis Planning doi: 10.1021/jacsau.3c00607
2022: RetroComposer: Composing Templates for Template-Based Retrosynthesis Prediction doi: 10.3390/biom12091325
2022: Improving the performance of models for one-step retrosynthesis through re-ranking doi: 10.1186/s13321-022-00594-8

Comment 3b) From what I gather from the section, the “rich data” refers to the classification information as well as the improved NextMove atom-mapping, but this hardly covers for the limited number of reaction types available in the dataset and the low number of data points. It has been proven several times (see Ref 48 for instance) that the performance on UPSTO-50K is no transferable to higher volume datasets with more diverse reactions. Hence, any conclusion drawn on USPTO-50K is very limiting. The authors at least need to acknowledge that, and they should preferably repeat some of their evaluation on a larger dataset drawn from the USPTO set (e.g. USPTO-MIT, full USPTO, or PaRoutes USPTO).

Response:
- Thank you for this suggestion. We extended our methodology for the top-performing models from each category to USPTO-Pararoutes (~1M reactions) – the cleaner version of the USPTO-full. We added Section 3.2.3. – Scalability of Benchmarking Results for this. From our findings, we see that the performance on UPSTO-50k is reasonably transferable to larger datasets drawn from the USPTO, although the difference (and magnitude) in rt-accuracy becomes smaller for all 3 models. The drawn conclusion is limited as we have only tested 3/12 models on this dataset.
- A similar finding was made previously by Maziarz et. al. (ref 18.) for the proprietary Pistachio dataset, which is a superset of the USPTO-full / Pararoutes: “Surprisingly, model ranking on USPTO-50K transfers to Pistachio quite well, although all results are substantially degraded, e.g. in terms of top-50 accuracy all models still fall below 55% […]”.
- On the other hand, we cannot be certain that our results would be transferable to datasets outside the distribution of the USPTO.

Comment 4) Dropbox is not the best medium to distribute research material. I would recommend the authors to use a services meant for research dissemination that provides long-term, fixed identifiers such as Zenodo or FigShare.

Response:
- Thank you for raising the matter of data distribution. The data is now hosted on FigShare and the links in the paper and GitHub have been updated to reflect this change.

Comment 5a) With regards to the evaluation metrics, I have several remarks. First, I don’t think it is as uncommon to evaluate one-step retrosynthesis model with something else than top-n accuracy as the authors make it out. It has been acknowledged for many years that this is an insufficient metric for one-step performance.

Response:
- Thank you for pointing this out. You are correct that this issue has been known for several years (which we acknowledged in Section 1 – Introduction by citing references 5,17).
- However, researchers (see list of references below) are still heavily reliant on the top-k accuracy to compare their models and to claim “SOTA” performance until now (early 2024). As the references listed below are published in peer-reviewed journals; we suggest that it is still considered “common-practise” to only evaluate the model on the top-k accuracy and publish the findings, accordingly.
- As of our knowledge, only a selected few algorithms like LocalRetro, Retroformer and RetroPrime utilise round-trip accuracies, but with different forward models/ways of calculating the round-trip accuracy. There is no unifying framework for this, which we provide within our paper.
- Other papers utilize MaxFrag accuracy, which, under the hood, applies the same principle as the top-k accuracy, thereby inheriting the same flaws. The invalidity metric has been only applied to template-free models, as we have acknowledged in the paper. We extend this measure to semi-template models. The other metrics employed in the paper are not reported (duplicity/diversity) in single-step papers in general.
- In Section 2.2.2 (first sentence), we acknowledged the presence of other evaluation metrics. To further clarify that the top-k accuracy is the most popular rather than the only metric, we changed “current” to “most popular” in Section 1 – Introduction.
List of References:
2023: Single-step retrosynthesis prediction by leveraging commonly preserved substructures doi: 10.1038/s41467-023-37969-w
2023: BiG2S: A dual task graph-to-sequence model for the end-to-end template-free reaction prediction doi: 10.1007/s10489-023-05048-8
2023: Retrosynthesis prediction with an interpretable deep-learning framework based on molecular assembly tasks doi: 10.1038/s41467-023-41698-5
- Please also see comment 3a for other references relying heavily on the top-k accuracy for performance comparison (2024 G-Matt, 2023 RetroKNN, 2023 Graph2Edits, 2023 G2Retro, 2023 SynCluster etc.)

Comment 5b) Furthermore, I think that all the evaluation metrics should be re-scaled and reported on a common scale from 0 to 1, with 1 being the best. It would aid in for instance the interpretation of Table 1, where it is now rather difficult to judge the different models as you must think about the magnitude and scale of each number.

Response:
- We are very thankful for this great advice. All metrics have been rescaled from 0 to 1 with 1 being the best in Table 1. This change directly leads to an important change in the methodology: the invalidity metric now becomes a validity metric and this has been changed throughout the metric. The invalidity can be calculated as Inv_k = 1 – Val_k.
- The SCScore metric remains unscaled but based on the results, it is highly unlikely that it will exceed 1 or fall below 0 i.e. the highest SC difference in the paper is 0.47 with the lowest being 0.3.

Comment 5c) Also, in Table 2 it just plain confusing to have top-k accuracy being in percentage and round-trip accuracy in fractions.

Response:
- Thank you for your suggestion. We have changed the round-trip accuracy to percentages (as opposed to Table 1, where rt-accuracy is in fractions). This is because the top-k accuracy is usually reported as a percentage in the literature and we utilize Table 2 to corroborate our results with the literature. Thus, we opted for percentages in Table 2.

Comment 6) The authors have chosen two metrics for diversity, but I would recommend picking one and stick with it. My recommendation is diversity (Div) as it is easier to understand and interpret. It should be acknowledged by the authors a diverse retrosynthesis is not always possible. For instance, a molecule could have only one bond that can be disconnected in one specific reaction.

Response:
- We appreciate this suggestion. We removed the second metric accordingly and added information in Section 2.2.2 – Diversity to acknowledge the reviewer’s comment on the fact that diverse retrosynthesis is desired but not possible in all cases.
- It now reads:
“Finally, note that while a diverse set of predictions is desired, it might not always be possible e.g. for molecules that only have one feasible disconnection site.”

Comment 7) The authors need to elaborate on why template-based models can generate invalid SMILES, as this might seem counterintuitive to the reader. It should be explained that it has a different origin than the invalid SMILES from generative models.

Response:
- Thank you for the comment. We added a sentence to elaborate on the origin of “invalid” SMILES for template-based models in Section 3.2.2.
- It now reads:
“As template-based models guarantee to return a valid chemical transformation, the invalidity herein refers to the inability to retrieve a relevant template that matches the target i.e. a template whose subgraph pattern oT matches any subgraph o in the product molecule. As the number of relevant templates to a specific product is limited, the model will fail to return a relevant template after a certain top-k.

Comment 8a) The authors need to elaborate that some reaction classes will lead to increased complexity, like protections, and that is perfectly alright. Hence it is not always desirable to have a lowering on the SCScore.

Response:
- We thank you for raising this issue. Indeed, lowering the SCScore (when going from products to reactants) is not always desirable for some molecules. We have added a sentence in Section 2.2.2 – SCScore to clarify this.
- It now reads:
“It should be noted that a positive SC is not desired for all reaction classes such as protection reaction. As these only make up 1.2% of the dataset, the overall aim remains to maximise SC”

Comment 8b) In a related issue, the authors need to acknowledge the merit of evaluation one-step models in a multi-step fashion, more than they currently do. Yes, it is important that the model provide feasible reactions, but it does not matter much if the proposals always lower the SCScore, produces unique and diverse solutions, etc., if those solutions does not lead to starting material that is purchasable, i.e., you find a synthetic route for your target molecule. I think this needs to be emphasized further in Section 2.2

Response:
- Thank you for this comment. We have elaborated on this in Section 2.2. where we added a short description about the shortcomings of single-step benchmarking and how one can synergize between our benchmark and the multistep benchmark by Maziarz et. al..
- It now reads:
“As a final note: Our pipeline does not guarantee that a single-step model can find synthesis routes towards purchasable building blocks. We suggest that once a promising model is identified through our pipeline, it could be further validated for synthesis planning on the benchmark proposed by Maziarz et. al.18.”
- Additionally, we also further highlighted the disadvantage of this methodology – which is the high resource and time-requirements needed to benchmark a large selection of models (see ref. 48 Torren-Peraire et. al. – limitations/conclusion).

Comment 9) Lastly, the authors use XAI to compare the internal model reasoning between two GNNs and a transformer model. They use a masking algorithm together with maximizing mutual information to gather node importance from the GNN algorithms and the attention with the highest values between reactant and product tokens to gather importance for the transformer model. The intention of the evaluation is interesting, but it comes out as a different study altogether, and arguably this subject deserves proper attention in it is own paper.

Response:
- Thank you for your suggestion. The XAI could be considered a “separate” study from the benchmarking. However, as our benchmarking mostly concerns reaction feasibility, investigating the model interpretability is integral to understanding why certain models fail to propose mostly feasible reaction chemistry (i.e., semi-template and template-free models). We believe that answering this question is therefore an important addition to this study.

Comment 10a) Moreover, the comparison in inherently biased towards EGAT and DMPNN for the task of reaction center prediction, as this is their primary task, whereas the task of the transformer model is generation.

Response:
- We appreciate this comment, and we agree with the statement that the Transformer’s primary task is generation, whereas the GNNs’ task is reaction centre prediction. While stabilising functional groups often fall on/around the reaction centre, it is reasonable to assume that the comparison could be somewhat biased towards the semi-template (GNN) models. However, it is important to interpret the model architectures as they appear in the template-free and semi-template settings, which are sequence-to-sequence generation and classification, respectively. Our conclusions within Section 3.3.3 state: “1. Transformer sequence-to-sequence models […]”. We do not make any conclusion regarding the Transformer architecture for classification tasks in this regard.
- To address these concerns, we have added further content in Section 2.3.2 – Template-free Interpretability to clarify that the Transformer task is generation compared to classification and acknowledge the potential “bias” towards the classification GNN models in Section 2.3 – Black-Box Interpretability.
- It now reads:
“Furthermore, its main task concerns sequence generation, which is more challenging compared to the semi-template reaction centre classification”
and
“The aim of this study is to uncover whether the other two framework categories capture chemically important functional groups, sterics and charge transfers in the reaction. Note that these important thermodynamic features often appear in and around the reaction centre, potentially favouring the interpretability of ``reaction-centre aware" models.”

Comment 10b) Expecting a transformer to give attention to the reaction center by using a random attention map is flawed logic. Transformer attention considers important functional groups, discerning molecular features as well as previous tokens to then generate new molecules token-by-token. This is also the reason why the transformer will fail to identify stabilizing functional groups, as this effect will be hidden by other features important for the model to remember. Using attention is such a way and subsequently drawing conclusions is biased and should be removed.

Response:
- Thank you for highlighting the use of attention for interpretability. We highlighted a relevant reference below which demonstrates that attention provides human interpretability to the end-user and that attention (cross-attention) correlates with feature importance.
List of References:
Attention Interpretability Across NLP Tasks doi: 10.48550/arXiv.1909.11218
Attention is not not Explanation doi: 10.18653/v1/D19-1002 (This reference disproves the notion that attention is not explanation)
- Specifically, the first reference highlights the following findings:
1. While for single-sequence tasks, attentions are meaningless – attention holds strong token correlation/importance for tasks involving two sequences such as sequences translation (as in our case)
2. The most important attention weights in the Transformer are within the Cross Attention layer (the attention we extracted for our interpretability study)
3. Attention weights correlate strongly with feature importance in translation tasks i.e. the higher the attention, the more important a token is
4. Attention is most importantly human-interpretable
- Accordingly, the most important tokens in the product SMILES should receive the highest attention within the model. The extracted attention is not self-attention, therefore the attention is not concerned with previously generated tokens – only the input product tokens, and the extracted attention does not have to remember previously generated tokens.
- By doing a summation, we obtain the accumulated importance of a token/feature. For human interpretable retrosynthesis, the most important tokens should be the ones that undergo a change during translation (i.e. in the reaction centre) as this determines the reaction chemistry.
- We agree that attention is potentially not the most rigorous approach to Transformer XAI (which is a rapidly developing field in research), compared to newer attribution/gradient methods. Accordingly, we have added further content in Section 2.3.2 – Template-free interpretability highlighting different approaches (plus a relevant reference) and acknowledging the fact that attention might not be as rigorous as newer XAI methods.
- It now reads:
“While using attention directly might not be as rigorous as recent attribution/gradient methods61, they have been proven to provide a reliable measure of model interpretability to the end-user62”

Comment 10c) In conclusion, the transformer XAI methods should be improved or removed from the comparisons completely, as the comparisons made currently do not measure identical values between the different models and are false comparisons.

Response:
- While the GNNExplainer and Attention do not return identical values, they do provide a measure of importance for each token/node in the sequence/graph. This importance value can be interpreted for each architecture, independently. In the paper, the main conclusion and discussion in Section 3.3.3. are not presented as a comparison of the GNN versus Transformer architectures saying that one is superior to the other, but rather highlighting the human interpretability of each model for their respective retrosynthesis task.




Round 2

Revised manuscript submitted on 05 apr 2024
 

01-May-2024

Dear Dr Zhang:

Manuscript ID: DD-ART-01-2024-000007.R1
TITLE: Investigating the Reliability and Interpretability of Machine Learning Frameworks for Chemical Retrosynthesis

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of your manuscript and the reviewers’ reports, I will be pleased to accept your manuscript for publication after revisions.

Please revise your manuscript to fully address the reviewers’ comments. Note that we do not expect you to provide additional referencing information at this time (Referee 2 point 1). When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link :

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log in to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process.   We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry

************


 
Reviewer 4

I thank the author for the updated manuscript and commend them on their thorough and thoughtful response to the many remarks made by the reviewers. I believe that the manuscript reads very well, and I think it is ready for publication after some minor remarks have been addressed.

On the response to the remark about model categorization, I do not necessarily object to authors own categorization (although I hope that one day we as a community can agree on one categorization and nomenclature) and I do think that the authors have done a good work to justify their approach. However, it appears that the authors insist on equating atom-mapping with templates. The USPTO-50k dataset is indeed atom-mapped with NameRxn that uses expert rules to derive the atom-mapping, but first these SMARTS/SMIRKS are very different from templates extracted with RDChiral and used for retrosynthesis. Furthermore, there are plenty of atom-mappers that are not based on rules, such as the rxnmapper what is used to atom-map the USPTO-PaRoutes dataset. So the authors cannot claim as they do in the updated manuscript that “atom mapping algorithms themselves depend on templates and/or expert rules” and they need to update their text accordingly.

Thanks for including some results on the USPTO-PaRoutes dataset. The only remaining comment I have on this is that Table 4 should preferably use the same column headers as Table 1.

For the XAI part, I believe that authors have done a reasonable work in justifying why they want to keep this part of the manuscript and they have done a reasonable number of edits to the manuscript. I would still insist on tempering the conclusions from this part in the abstract, so that the readers that only read the abstract obtains a balanced and fair view of what the authors are actually concluding.

Reviewer 2

The following points should be amended before publication. If the following points are improved, I agree with the publication of the manuscript in Digital discovery.

1. Reference Style Issues: The referencing style used, as adopted from the Digital Discovery Journal, lacks crucial details such as publication titles and DOIs, which significantly hampers the review process. Although this minimalist style might be suitable for the final publication, it impedes the rigorous verification and assessment of sources during review. It is concerning that these issues persist despite previous comments. The current referencing approach forces reviewers to refer to the bib.tex file for verification, an impractical and cumbersome step.

2. Concerns About Black-box Interpretability in Section 3.3.3: The conclusions drawn in Section 3.3.3 regarding thermodynamic stability are based solely on a few case studies without statistical support. This method of deriving conclusions from limited, human-evaluated case studies, particularly when using low-level representations like SMILES or graphs, is problematic. Such examples, while illustrative, are insufficient for drawing general conclusions about model capabilities. I recommend revising or removing any claims regarding chemical stability unless they can be substantiated through robust computational methods or quantifiable assessments.

3. Choice of Metrics: The justification for selecting SCScore as a metric within the retrosynthesis framework is inadequately explained, described merely as a "natural choice." Given the documented limitations and variability of synthetic accessibility scores, a more rigorous approach would involve averaging multiple metrics (SAscore, SCScore, and RAscore) normalized to a consistent scale. The work by Skoraczyński et al. provides an excellent reference for understanding these metrics in greater depth.

4. Methodological Overview in Section 2.1.2: When introducing non-SMILES methodologies, it would be beneficial to first discuss similarity-based approaches before transitioning to graph-based methodologies. The following references offer a comprehensive overview of these approaches and should be included to enhance understanding:
- https://pubs.acs.org/doi/10.1021/acscentsci.7b00355
- https://link.springer.com/article/10.1186/s13321-020-00482-z
- https://www.nature.com/articles/s41467-022-28857-w

5. Inaccurate Citations in Section 2.1.2: References 38 and 39 are currently misplaced and do not accurately reflect the discussed content. Consider replacing them with the following sources that provide a more relevant and thorough discussion of the topics:

- https://chemrxiv.org/engage/chemrxiv/article-details/60c73ed6567dfe7e5fec388d
- https://iopscience.iop.org/article/10.1088/2632-2153/aba947
- https://link.springer.com/article/10.1186/s13321-023-00725-9


 

Comments from Referee #2:

Comment 1) Reference Style Issues:
The referencing style used, as adopted from the Digital Discovery Journal, lacks crucial details such as publication titles and DOIs, which significantly hampers the review process. Although this minimalist style might be suitable for the final publication, it impedes the rigorous verification and assessment of sources during review. It is concerning that these issues persist despite previous comments. The current referencing approach forces reviewers to refer to the bib.tex file for verification, an impractical and cumbersome step.

Response:
- We appreciate and acknowledge the reviewer’s concern about the minimalistic referencing style and apologise for rendering the review process difficult
- Possibly, we would suggest to the editor to provide two LaTeX templates on the RSC webpage to overcome this issue:
1. A template tailored to the peer-review process, including more detailed referencing information such as title and doi
2. A template (the current .tex document) for the final publication

Comment 2) Concerns About Black-box Interpretability in Section 3.3.3:
The conclusions drawn in Section 3.3.3 regarding thermodynamic stability are based solely on a few case studies without statistical support. This method of deriving conclusions from limited, human-evaluated case studies, particularly when using low-level representations like SMILES or graphs, is problematic. Such examples, while illustrative, are insufficient for drawing general conclusions about model capabilities. I recommend revising or removing any claims regarding chemical stability unless they can be substantiated through robust computational methods or quantifiable assessments.

Response:
- We thank the reviewer for their comment. We have entirely removed the conclusion about the chemical stability from Section 3.3.3, Point 1. Instead, we now focus on the model interpretability.
- We revised Section 3.3.3 Point 2, to show that the drawn conclusion is based on model interpretability and our hypothesis.

Comment 3) Choice of Metrics:
The justification for selecting SCScore as a metric within the retrosynthesis framework is inadequately explained, described merely as a "natural choice." Given the documented limitations and variability of synthetic accessibility scores, a more rigorous approach would involve averaging multiple metrics (SAscore, SCScore, and RAscore) normalized to a consistent scale. The work by Skoraczyński et al. provides an excellent reference for understanding these metrics in greater depth.

Response:
- We thank the reviewer for this suggestion and agree that the selection of the SCScore might appear arbitrary to the reader.
- We discussed the reviewer’s interesting idea in Section 2.2.2 – SCScore.

Comment 4) Methodological Overview in Section 2.1.2:
When introducing non-SMILES methodologies, it would be beneficial to first discuss similarity-based approaches before transitioning to graph-based methodologies. The following references offer a comprehensive overview of these approaches and should be included to enhance understanding:
- https://pubs.acs.org/doi/10.1021/acscentsci.7b00355
- https://link.springer.com/article/10.1186/s13321-020-00482-z
- https://www.nature.com/articles/s41467-022-28857-w

Response:
- Thank you for this suggestion. To enhance the reader’s understanding, we added these references in Section 2.1.2.
- It now reads:
“Another alternative to SMILES generation was previously explored in Ucak et. al. and Coley et. al. By comparing the molecular similarity of possible precursors to entries in the reaction database, one ensures the generation of valid reactants (although the reaction is not necessarily feasible).”

Comment 5) Inaccurate Citations in Section 2.1.2: References 38 and 39 are currently misplaced and do not accurately reflect the discussed content. Consider replacing them with the following sources that provide a more relevant and thorough discussion of the topics:
- https://chemrxiv.org/engage/chemrxiv/article-details/60c73ed6567dfe7e5fec388d
- https://iopscience.iop.org/article/10.1088/2632-2153/aba947
- https://link.springer.com/article/10.1186/s13321-023-00725-9

Response:
-Thank you for the suggestion. We have added the references to the end of Section 2.1.2 and removed references 38/39.

Comments from Referee #4:

Comment 1) On the response to the remark about model categorization, I do not necessarily object to authors own categorization (although I hope that one day we as a community can agree on one categorization and nomenclature) and I do think that the authors have done a good work to justify their approach. However, it appears that the authors insist on equating atom-mapping with templates. The USPTO-50k dataset is indeed atom-mapped with NameRxn that uses expert rules to derive the atom-mapping, but first these SMARTS/SMIRKS are very different from templates extracted with RDChiral and used for retrosynthesis. Furthermore, there are plenty of atom-mappers that are not based on rules, such as the rxnmapper what is used to atom-map the USPTO-PaRoutes dataset. So the authors cannot claim as they do in the updated manuscript that “atom mapping algorithms themselves depend on templates and/or expert rules” and they need to update their text accordingly.

Response:
- We thank the reviewer for this suggestion and acknowledge that the previous claim is not correct due to the existence of non-rule-based mappers. We removed this statement from the manuscript.

Comment 2) Thanks for including some results on the USPTO-PaRoutes dataset. The only remaining comment I have on this is that Table 4 should preferably use the same column headers as Table 1.

Response:
- Thank you for this comment. The table headers have been updated to match Table 1.
- We marginally reduced the font size of the headers, to facilitate the readability of Table 4.

Comment 3) For the XAI part, I believe that authors have done a reasonable work in justifying why they want to keep this part of the manuscript and they have done a reasonable number of edits to the manuscript. I would still insist on tempering the conclusions from this part in the abstract, so that the readers that only read the abstract obtains a balanced and fair view of what the authors are actually concluding.

Response:
- We acknowledge the reviewer’s concern and modified the abstract, according to their suggestion.
- Before, it read:
“For simple molecules, we demonstrate that Graph Neural Networks identify relevant functional groups within the product molecule, providing thermodynamic stabilisation over the reactant precursors. The popular Transformer fails to identify such crucial stabilisation.”
- It now reads:
“For simple molecules, we show that Graph Neural Networks identify relevant functional groups in the product molecule for a reaction, offering model interpretability. Sequence-to-sequence Transformers are not found to provide such an explanation.




Round 3

Revised manuscript submitted on 05 mai 2024
 

23-May-2024

Dear Dr Zhang:

Manuscript ID: DD-ART-01-2024-000007.R2
TITLE: Investigating the Reliability and Interpretability of Machine Learning Frameworks for Chemical Retrosynthesis

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our LinkedIn account [https://rsc.li/Digital_showcase] please fill out this form: https://form.jotform.com/213544038469056.

We are offering all corresponding authors on publications in gold open access RSC journals who are not already members of the Royal Society of Chemistry one year’s Affiliate membership. If you would like to find out more please email membership@rsc.org, including the promo code OA100 in your message. Learn all about our member benefits at https://www.rsc.org/membership-and-community/join/#benefit

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry


 
Reviewer 4

I thank the authors for incorporating the final remarks. I now think the study is ready for publication.

Reviewer 2

Authors addressed all concerns.




Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.
Creative Commons BY license