Peer review - Link-INVENT: generative linker design with reinforcement learning

25-Dec-2022

Dear Dr Patronov:

Manuscript ID: DD-ART-11-2022-000115
TITLE: Link-INVENT: Generative Linker Design with Reinforcement Learning

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of your manuscript and the reviewers’ reports, I will be pleased to accept your manuscript for publication after revisions - merry xmas and a happy new year!

Please revise your manuscript to fully address the reviewers’ comments. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link :

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log in to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Kedar Hippalgaonkar
Associate Editor, Digital Discovery
Royal Society of Chemistry

************

Reviewer comments

Reviewer 1

Data review:

1b. If using an external database, is an access date or version number provided? No. Authors can amend this by adding the ChEMBL version number.

2a. Are the data cleaning steps clearly and fully described, either in text or as a code pipeline? Yes, although the implementation could be more specific and may affect processing. E.g., what Python libraries/functions were used.

4b. Are baseline comparisons to simple/trivial models (for example, 1-nearest neighbour, random forest, most frequent class) provided? No. Such an example could include a library search method, or selecting linkers from the training set at random given a set of warheads.

6b. Are scripts to reproduce the findings in the paper provided? Yes, the software exists and can theoretically reproduce the results, although the JSON configuration files are not provided (at least to the reviewer).

6c. Have the authors clearly specified which versions of the software libraries they depend upon were used in the course of the work? No. This can be determined by the environment.yml in the software provided but the version of REINVENT used to produce the work must be specified.

Reviewer 2

The authors propose an extension of the REINVENT package for generative modeling to linker design, called Link-INVENT. The paper is well written, the approach well motivated, and the paper is a nice addition to the literature demonstrating the benefits of reinforcement learning to assist in the optimization of physico-chemical properties when generating molecular linkers.

I recommend this paper being accepted with only minor revisions (see below).

Figure 2. The green colour to highlight some of the structure is almost impossible to see. I would recommend a different way of emphasising this partial structure. The ratio of rotatable bonds was initially confusing to me, since this is a ratio/percentage but is not presented as such. I think this would be clearer either as a ratio or making the percentage more explicit (e.g. using the percentage sign %). Finally, the way linkers are displayed could be slightly clarified, since the two terminal atoms displayed (one at each end) are the attachment points rather than generated. This was initially confusing for me since the authors also use the "*" notation for the attachment point.

In Figure 3, the furthest right molecule appears to be missing a bond. Note this might just be a rendering issue on the version of the paper I'm reviewing.

I think the following sentence is overly strong and the authors should consider removing it: "This suggests that neither docking protocol was able to capture the constituent fragments’ binding poses." I believe the point the authors are trying to make is already captured by the previous sentence and this sentence that comes later in the manuscript: "DeLinker and SyntaLinker example molecules also show good pose agreement when docked with our protocol".

I commend the authors for "facilitating a fair comparison" when comparing their approach to others. An additional minor detail that should be noted is that the three approaches being compared (Link-INVENT, DeLinker, and SyntaLinker) all use different training sets.

On this note, were any of the ground-truth linkers in the case study in the training set? If not, what were the most similar linkers in the training set?

I think the discussion would benefit from some (albeit brief) discussion of the compute resources required, given inclusion of a docking protocol.

A few recent papers describing deep learning approaches for generating linkers were missing from the discussion:
- 3DLinker: An E(3) Equivariant Variational Autoencoder for Molecular Linker Design. Yinan Huang, Xingang Peng, Jianzhu Ma, Muhan Zhang Proceedings of the 39th International Conference on Machine Learning, PMLR 162:9280-9294, 2022.
- Deep generative design with 3D pharmacophoric constraints. Fergus Imrie, Thomas E. Hadfield, Anthony R. Bradley, Charlotte M. Deane. Chem. Sci., 2021,12, 14577-14589
- SyntaLinker-Hybrid: A deep learning approach for target specific drug design. Yu Feng, Yuyao Yang, Wenbin Deng, Hongming Chen, Ting Ran, Artificial Intelligence in the Life Sciences, Volume 2, December 2022, 100035

Finally, I think the paper could benefit from being slightly reduced in length in several places. I believe that can be done without eliminating any content from the manuscript. For example, some sentences are repeated a number of times, e.g. "See Supporting Information Fig. S7 for the Scoring Function transformation."

Reviewer 3

This paper is well written and can be published as is.

Author response

General Remark
We would like to thank the editor and the reviewers for providing constructive feedback. We have made the suggested changes and respond to each reviewers’ comments below.

Reviewer 1

Comment 1
If using an external database, is an access date or version number provided? No. Authors can amend this by adding the ChEMBL version number.

Response
The ChEMBL version used was 27. This information has been added to the manuscript in the following sentence (the change is bolded):

“Filter the raw ChEMBL data (version 27) to keep ‘drug-like’ compounds only (see Supporting Information for details).”
Comment 2
Are the data cleaning steps clearly and fully described, either in text or as a code pipeline? Yes, although the implementation could be more specific and may affect processing. E.g., what Python libraries/functions were used.

Response
The specific version of REINVENT used in the manuscript is version 3.2. The version numbers of all Python libraries used is available in the reinvent.yml conda environment file in the corresponding code repository: https://github.com/MolecularAI/Reinvent.

This information has been added to the manuscript in the following sentence (the change is bolded):

“Lastly, Link-INVENT was built on the latest version of REINVENT (3.2) and supports an extensive selection of physico-chemical properties that can be optimized through RL.”
Comment 3
Are baseline comparisons to simple/trivial models (for example, 1-nearest neighbour, random forest, most frequent class) provided? No. Such an example could include a library search method, or selecting linkers from the training set at random given a set of warheads.

Response
Thank you for the comment. We did not compare to a library search method as this was done in the DeLinker1 work which is also a deep learning based model for linker design. In the DeLinker work, the authors show that their method outperforms a library search.

In Experiment 1b: Fragment Linking, we compared Link-INVENT to both DeLinker and SyntaLinker2 (another deep learning based method) and show that Link-INVENT outperforms these previous models. We did not compare Link-INVENT to randomly sampling the training data. The training data was generated based on slicing ChEMBL compounds with reaction SMIRKS. Consequently, each pair of warheads has its own linker. Therefore, randomly sampling linkers from the training set would be identical to randomly sampling linkers from a library which has already been done in the DeLinker work (which we compare to) and removes the crucial conditioned sampling of Link-INVENT. Moreover, our objective is to highlight Link-INVENT’s ability to accomplish goal-directed learning: satisfying a user-defined multi-parameter optimization (MPO) objective. Link-INVENT can generate linkers satisfying the desired objective through reinforcement learning (RL) and is proven by the training plots.
Comment 4
Are scripts to reproduce the findings in the paper provided? Yes, the software exists and can theoretically reproduce the results, although the JSON configuration files are not provided (at least to the reviewer).

Response
The JSON configuration files used for the Illustrative Experiment can be found in the following repository: https://github.com/MolecularAI/ReinventCommunity. This information is expressed in the Data Availability section. We further provide the JSON configuration files for the remaining experiments in this revision.
Comment 5
Have the authors clearly specified which versions of the software libraries they depend upon were used in the course of the work? No. This can be determined by the environment.yml in the software provided but the version of REINVENT used to produce the work must be specified.

Response
The following response is identical to response 2:

The specific version of REINVENT used in the manuscript is version 3.2. The version numbers of all Python libraries used is available in the reinvent.yml conda environment file in the corresponding code repository: https://github.com/MolecularAI/Reinvent.

This information has been added to the manuscript in the following sentence (the change is bolded):

“Lastly, Link-INVENT was built on the latest version of REINVENT (3.2) and supports an extensive selection of physico-chemical properties that can be optimized through RL.”

Reviewer 2

Comment 1
Figure 2. The green colour to highlight some of the structure is almost impossible to see. I would recommend a different way of emphasising this partial structure. The ratio of rotatable bonds was initially confusing to me, since this is a ratio/percentage but is not presented as such. I think this would be clearer either as a ratio or making the percentage more explicit (e.g. using the percentage sign %). Finally, the way linkers are displayed could be slightly clarified, since the two terminal atoms displayed (one at each end) are the attachment points rather than generated. This was initially confusing for me since the authors also use the "*" notation for the attachment point.

Response
Thank you for your comment. We have made the following changes for better clarity:

All linker terminal atoms are now depicted by R-groups rather than asterisks which are reserved for denoting exit vectors of warheads (attachment points of warheads). This change affects the following figures: 2 and 9

When describing a linker in the context of the full molecule, it is now highlighted in yellow instead of green bonds. This change affects the following figures: 1, 2 and 3

The ratio of rotatable bonds is now described as a percentage. All figures referencing this metric now includes the “%” symbol. This change affects the following figures: 2 and 9
Comment 2
In Figure 3, the furthest right molecule appears to be missing a bond. Note this might just be a rendering issue on the version of the paper I'm reviewing.

Response
The missing bond should be a rendering issue. The latest version of the manuscript now depicts the linker by yellow highlighting rather than a green colour which should improve clarity.
Comment 3
I think the following sentence is overly strong and the authors should consider removing it: "This suggests that neither docking protocol was able to capture the constituent fragments’ binding poses." I believe the point the authors are trying to make is already captured by the previous sentence and this sentence that comes later in the manuscript: "DeLinker and SyntaLinker example molecules also show good pose agreement when docked with our protocol".

Response
Thank you for your comment. When designing the experiment to compare Link-INVENT with DeLinker and SyntaLinker, the common objective is to generate linkers such that the constituent fragments of the linked molecule have good pose agreement with the fragment crystal structures. Both DeLinker and SyntaLinker dock their generated molecules post-hoc using AutoDock Vina3. However, the poses they show have a high RMSD with the fragment crystal structures. Consequently, we design an alternative docking protocol using Glide4–7 to better capture the binding site dynamics. With our docking protocol, the fragment poses have much lower RMSD to the crystal structure and to facilitate a fair comparison, we apply our docking protocol to the DeLinker and SyntaLinker molecules.

Thus, the purpose of the sentence: "This suggests that neither docking protocol was able to capture the constituent fragments’ binding poses." is the rationale to why we introduced an alternative docking protocol rather than use AutoDock Vina.
Comment 4
I commend the authors for "facilitating a fair comparison" when comparing their approach to others. An additional minor detail that should be noted is that the three approaches being compared (Link-INVENT, DeLinker, and SyntaLinker) all use different training sets.

Response
Thank you for your comment. We have added a sentence explicitly acknowledging the difference in training data. The following sentence is newly added:

“We note that the training data used in DeLinker, SyntaLinker, and Link-INVENT are different which can contribute to differences in performance.”
Comment 5
On this note, were any of the ground-truth linkers in the case study in the training set? If not, what were the most similar linkers in the training set?

Response

The following 3 experiments contained reference linkers and the corresponding most similar linker in the training set is shown:

Experiment 1a: Fragment Linking

Reference Most Similar in Training Set
[*]C(CCNC(CC[*])=O)=O O=C(CC[*])NCCC(N[*])=O (0.694 Tanimoto Similarity)

The information above has been added to the Supporting Information. The following sentence has been modified in the main manuscript in the Experiment 1a: Fragment Linking section to highlight this information (change in bold):

It is important to note that information about the reference ligand was not available to the Link-INVENT agent during the generative process and is not present in the training set (see Supporting Information for more details).

Experiment 1b: Comparison Fragment Linking

Reference (this linker was present in the training set)
[*]OC(C(N[*])=O)C

The information above has been added to the Supporting Information. The following sentence has been added in the main manuscript in the Experiment 1b: Comparison Fragment Linking section to highlight this information:

“We note that the reference linker is present in the training data.”

The reference linker is a simple structure, containing relatively few atoms and is thus more likely to appear from reaction-based slicing of the ChEMBL data during preparation of the training data. Our objective was to show Link-INVENT’s capability of generating diverse linker ideas that satisfy the desired objective. This is shown explicitly in Fig. S8f in the Supporting Information showing thousands of unique Bemis-Murcko scaffolds generated across replicate runs. Since the warheads are fixed, the scaffolds indicate unique linker scaffolds being generated that all satisfy the desired objective. This information is conveyed in the main manuscript in the following line:

“We further note that since the fragments are held constant, this means that the unique scaffolds pertain to the linker itself. Therefore, Link-INVENT generates diverse linker ideas that satisfy the core constrained docking protocol.”

Moreover, if the goal were to recover a specific linker, one could define the Scoring Function such that a high reward is given to linkers with high Tanimoto similarity to the reference linker (and a Tanimoto similarity of 1 would indicate recovery of the linker). The ability of SMILES-based LSTM models such as REINVENT (Link-INVENT builds upon REINVENT) to satisfy “reference recovery” objectives has been validated in the GuacaMol benchmark.8

Experiment 2: Scaffold Hopping

Reference Most Similar in Training Set
[*]NC1=NN(C([*])=C1)C2CCCC2 [*]N1CCC(CC1)NC2=CC(C)=NC(N[*])=N2
0.288 Tanimoto Similarity

The information above has been added to the Supporting Information. The following sentence has been modified in the main manuscript in the Experiment 2: Scaffold Hopping section to highlight this information (change in bold):

Correspondingly, we define the Scoring Function with the following components (see Supporting Information S11 and S12 for Scoring Function transformations) and note that the reference linker is not present in the training set:

Experiment 3: PROTACs was designed to strictly highlight the flexibility of Link-INVENT’s Scoring Function to optimize explicitly for linker properties. Thus, we do not consider any ground-truth linker for this case study.
Comment 6
I think the discussion would benefit from some (albeit brief) discussion of the compute resources required, given inclusion of a docking protocol.

Response
Information on compute resources and compute time are provided in the Supporting Information in the “Hardware Experiment Compute Times” section.
Comment 7
A few recent papers describing deep learning approaches for generating linkers were missing from the discussion:
- 3DLinker: An E(3) Equivariant Variational Autoencoder for Molecular Linker Design. Yinan Huang, Xingang Peng, Jianzhu Ma, Muhan Zhang Proceedings of the 39th International Conference on Machine Learning, PMLR 162:9280-9294, 2022.
- Deep generative design with 3D pharmacophoric constraints. Fergus Imrie, Thomas E. Hadfield, Anthony R. Bradley, Charlotte M. Deane. Chem. Sci., 2021,12, 14577-14589
- SyntaLinker-Hybrid: A deep learning approach for target specific drug design. Yu Feng, Yuyao Yang, Wenbin Deng, Hongming Chen, Ting Ran, Artificial Intelligence in the Life Sciences, Volume 2, December 2022, 100035

Response
Thank you for the comment. These citations have been added to the manuscript and the introduction section has been modified accordingly to introduce these papers.
Comment 8
Finally, I think the paper could benefit from being slightly reduced in length in several places. I believe that can be done without eliminating any content from the manuscript. For example, some sentences are repeated a number of times, e.g. "See Supporting Information Fig. S7 for the Scoring Function transformation."

Response
Thank you for your comment. Repeated reference to the Supporting Information for Scoring Function transformations have now been removed. A single mention is used when first introducing the Scoring Function for each experiment.

References:
(1) Imrie, F.; Bradley, A. R.; van der Schaar, M.; Deane, C. M. Deep Generative Models for 3D Linker Design. J. Chem. Inf. Model. 2020, 60 (4), 1983–1995. https://doi.org/10.1021/acs.jcim.9b01120.
(2) Yang, Y.; Zheng, S.; Su, S.; Zhao, C.; Xu, J.; Chen, H. SyntaLinker: Automatic Fragment Linking with Deep Conditional Transformer Neural Networks. Chem. Sci. 2020, 11 (31), 8312–8322. https://doi.org/10.1039/D0SC03126G.
(3) Trott, O.; Olson, A. J. AutoDock Vina: Improving the Speed and Accuracy of Docking with a New Scoring Function, Efficient Optimization and Multithreading. J Comput Chem 2010, 31 (2), 455–461. https://doi.org/10.1002/jcc.21334.
(4) Friesner, R. A.; Banks, J. L.; Murphy, R. B.; Halgren, T. A.; Klicic, J. J.; Mainz, D. T.; Repasky, M. P.; Knoll, E. H.; Shelley, M.; Perry, J. K.; Shaw, D. E.; Francis, P.; Shenkin, P. S. Glide: A New Approach for Rapid, Accurate Docking and Scoring. 1. Method and Assessment of Docking Accuracy. J. Med. Chem. 2004, 47 (7), 1739–1749. https://doi.org/10.1021/jm0306430.
(5) Halgren, T. A.; Murphy, R. B.; Friesner, R. A.; Beard, H. S.; Frye, L. L.; Pollard, W. T.; Banks, J. L. Glide: A New Approach for Rapid, Accurate Docking and Scoring. 2. Enrichment Factors in Database Screening. J. Med. Chem. 2004, 47 (7), 1750–1759. https://doi.org/10.1021/jm030644s.
(6) Friesner, R. A.; Murphy, R. B.; Repasky, M. P.; Frye, L. L.; Greenwood, J. R.; Halgren, T. A.; Sanschagrin, P. C.; Mainz, D. T. Extra Precision Glide: Docking and Scoring Incorporating a Model of Hydrophobic Enclosure for Protein-Ligand Complexes. J. Med. Chem. 2006, 49 (21), 6177–6196. https://doi.org/10.1021/jm051256o.
(7) Schrödinger Release 2019-4: Glide, Schrödinger, LLC, New York, NY, 2019.
(8) GuacaMol: Benchmarking Models for de Novo Molecular Design | Journal of Chemical Information and Modeling. https://pubs.acs.org/doi/10.1021/acs.jcim.8b00839 (accessed 2023-01-12).

Editor’s decision letter

01-Feb-2023

Dear Dr Patronov:

Manuscript ID: DD-ART-11-2022-000115.R1
TITLE: Link-INVENT: Generative Linker Design with Reinforcement Learning

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

We are offering all corresponding authors on publications in gold open access RSC journals who are not already members of the Royal Society of Chemistry one year’s Affiliate membership. If you would like to find out more please email membership@rsc.org, including the promo code OA100 in your message. Learn all about our member benefits at https://www.rsc.org/membership-and-community/join/#benefit

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Dr Kedar Hippalgaonkar
Associate Editor, Digital Discovery
Royal Society of Chemistry

Reviewer comments

Reviewer 1

I thank the authors for making changes in response to all the reviewer comments. I am satisfied with the changes that improve the overall quality and FAIRness of the manuscript.

Reviewer 3

Publish as is

Reviewer 2

The authors have adequately addressed all of my questions/concerns. I look forward to seeing this paper in its published form.

From the journal Digital Discovery Peer review history

Round 1

Reviewer 1

Reviewer 2

Reviewer 3

Round 2

Reviewer 1

Reviewer 3

Reviewer 2

Transparent peer review