Peer review - Group SELFIES: a robust fragment-based molecular string representation

15-Feb-2023

Dear Mr Cheng:

Manuscript ID: DD-ART-01-2023-000012
TITLE: Group SELFIES: A Robust Fragment-Based Molecular String Representation

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that major revisions are necessary.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Kedar Hippalgaonkar
Associate Editor, Digital Discovery
Royal Society of Chemistry

************

Reviewer comments

Reviewer 1

This paper presents an interesting and valuable contribution to the field of molecular string representations. The authors introduce Group SELFIES, a novel representation that builds upon the robustness guarantees of SELFIES while adding the flexibility of group tokens. The group tokens allow the representation to capture chemical motifs such as functional groups or entire substructures. Thereby, the readability of the molecular strings is enhanced compared to SELFIES, and functional groups keep invariant upon modification of the string (which is not guaranteed by SMILES or SELFIES).

In my opinion, the most important point is the following: more complex molecular properties such as extended chirality can be represented with Group SELFIES (which is missing in SELFIES). For example, it can represent ferrocenes - Fig.7. This extension thereby solves at least partially a question which has recently been discussed from a community-perspective (Cell Patterns 3(10), 100588(2022)), and could be a stepping stone of digital discovery of molecules beyond organic chemistry.

The authors also performed computational experiments, and here I would like to highlight Fig.5 where the (expected) improvement over SELFIES is seen for randomly generated strings in terms of SAScore and QEC. This advance is appreciated as the generated molecules seem to resemble the structures from the ZINC database.

I have a few technical questions and suggestions:
1) The quality of many figures is very low. Please improve them.

2) Some of the references (such as [32]) are missing bibliometric information.

3) The authors modify the [BranchX] operation and introduce a symbol [pop]. They write "Unlike in SMILES, however, [Branch] and [pop] tokens need not come in pairs, which helps maintain robustness.". This sounds similar to the structure of SMILES's branch where a branch is opened and closed (by brackets). However, their unclosed brackets cannot be interpreted uniquely and thus are invalid. How is this problem circumvented in Group-SELFIES? What happens if the string has multiple consecutive [BranchX] or [pop]?

4) In the chapter "3.5 Determining Fragments", it is not entirely clear to me whether all fragments have been extracted from datasets and other sources by hand, or whether autonomous fragment discovery mechanisms have/can be applied. The latter one would be very interesting, and even if it has not been implemented, I would like to read more about how robustness can still be guaranteed when the fragment is autonomously extracted.

5) The authors write "One tradeoff of Group SELFIES is that encoding and decoding are usually slower than with SELFIES, likely due to overhead of RDKit operations.". I did not understand why RDKit should be the reason for the slowdown. The authors do not use RDKit during encoding/decoding if I understood correctly, just for computing validity scores. Can the authors please elaborate on that?

The paper is well-written, clearly presented, and easy to follow. The methodology and experiments are appropriately described. The authors have made a significant effort to make their open-source implementation available, which is a great resource for the scientific community.

Overall, this is a very useful extension to molecular string representations with the ability to move digital discovery beyond organic chemistry. For that reason, I highly recommend its acceptance for publication.

Reviewer 2

Please see attached

Reviewer 3

As a molecular string representation format, the concept of Group SELFIES and its utility are clearly presented and supported. I envision it could be used for lead compound generation and optimization in drug discovery. The publication of this paper could benefit the broad cheminformatics and machine learning community.

While the encoding and decoding of Group SELFIES will be handled by computers, I think it would help the readers and users of the toolkit if the authors could elaborate the encoding process in section 3.2. I had difficulty following Figure 2 in my first read, but no problem following Appendix A.1. It would be nice if tokens used in these two sections could be harmonized.

I think the manuscript is ready to publish after these minor revisions.

Reviewer 4

The manuscript and the associated code and data from Cheng et al. provide an interesting overview on the novel Group SELFIES approach and its potential to represent complex structures in a compact manner suitable for ML methods and molecular generative models. Nevertheless, some aspects shall be refined to make the extent of the work more understandable.

1. The code for the fragmentation schemes employed to gather the actual group set for the ZINC-250k dataset (Section 4.1, Figure 4) is not available: only the resulting text file with the predefined fragments is provided. Considering the importance of group definition for Group SELFIES, more actual examples on how this process takes place could be of interest to both readers and users. While the tutorial notebook does briefly introduce this topic, presenting the results of using either the “default” or the “MMPA” schemes, further details and a brief discussion of the performance and the kinds of groups resulting from the different strategies available in fragment_utils.py, over this same ZINC-250k dataset, would come in useful, possibly as an additional notebook.

2. The study in Section 4.3 (Distribution Learning, involving Table 2 and Figure 6) is not available in the repository, while the other examples in the paper are. From the MOSES paper and repository, the benchmark may be assumed to be called from the command-line interface of this tool, but having the script would be more consistent with the rest of the provided experiments. More importantly, due to the absence of this specific script/notebook, the specific fragmentation and group selection strategy followed to build the corresponding group set for Group SELFIES is only hinted at the text, but not directly available.

3. While specific details on the protocol for Distribution Learning can be readily checked on either the article or the repository for the MOSES framework, a brief mention on MOSES being based on the ZINC Clean Leads dataset, the approximate number of molecules it contains (~2M) and the train/test/scaffold split would make the section clearer and more self-contained.

4. The “tutorial.ipynb” notebook includes a clear explanation of how the fragments (or “groups”) are expressed and defined, which is quite lacking in the manuscript, that only points at a “SMILES-like syntax” being possible, without any further details. Given that these groups are the cornerstone of the method, I think that the overall consistency of the work would improve if the manuscript provided a better description of this essential part of the Group SELFIES approach, together with the repository showing more direct examples on group set generation as raised in point #1.

5. The “extended chirality” section, which addresses a main issue of molecular strings representations, explicitly states “We leave the proper implementation of representing global chirality to future work”, without it being not under the actual “Future work” subsection. Depending on the extent of the preliminary work on this problem, it could be useful to have some code examples of the chiral groups in Figure 7 actually being employed in the Group SELFIES framework. E.g., some custom definition of how the corresponding group set dictionaries might be introduced instead of using the standard SMILES-like syntax of the other examples, even if it is not yet fully standardized. In any case, I find that the current organization can be somehow misleading regarding the extent of applicability of Group SELFIES, and it might become clearer by just moving this discussion on extended chirality under the “Future work” headline.

Author response

We thank the reviewers for their valuable time and feedback.

Referee 1:

1) The quality of many figures is very low. Please improve them.

The figure quality of Figures 1, 2, and 3 have been improved to use vector graphics.

2) Some of the references (such as [32]) are missing bibliometric information.

Thank you for catching these errors - the bibliography has been corrected and carefully checked.

3) The authors modify the [BranchX] operation and introduce a symbol [pop]. They write "Unlike in SMILES, however, [Branch] and [pop] tokens need not come in pairs, which helps maintain robustness.". This sounds similar to the structure of SMILES's branch where a branch is opened and closed (by brackets). However, their unclosed brackets cannot be interpreted uniquely and thus are invalid. How is this problem circumvented in Group-SELFIES? What happens if the string has multiple consecutive [BranchX] or [pop]?

In Group SELFIES, [Branch] tokens create a new branch, which persists until the next [pop] token is read, or until the string ends. If a [Branch] is never followed by a [pop], this just means that the decoder continues building on that branch until all tokens have been read. If multiple [pop] tokens are read, then branches are popped for each [pop] token, unless the decoder is on the starting main branch, in which case [pop] tokens are ignored.

For example, if [X][Branch][Branch] is read, then subsequent tokens will build on the branch created by the second [Branch] token. Once a [pop] token is read, then decoding returns to atom X and continues building on the branch created by the first [Branch] token. If X happens to already have full valency, then this branch is immediately popped, which may end decoding if there were no previous branches before [X].

A few sentences have been added to Section 3.2 to better clarify [Branch] and [pop].

4) In the chapter "3.5 Determining Fragments", it is not entirely clear to me whether all fragments have been extracted from datasets and other sources by hand, or whether autonomous fragment discovery mechanisms have/can be applied. The latter one would be very interesting, and even if it has not been implemented, I would like to read more about how robustness can still be guaranteed when the fragment is autonomously extracted.

In this study, fragments were extracted automatically using two basic methods we implemented. We believe that several other autonomous fragmentation algorithms can be readily applied, and we have cited these in Section 3.5. Any fragmentation algorithm which produces a set of SMILES strings can be readily applied by setting the attachment points to all atoms with available valency. Robustness is still guaranteed for all groups, whether obtained from autonomous fragmentation or manually specified, because any sequence of tokens interleaved with group tokens can always be interpreted by the decoder to navigate the attachment points of those groups.

5) The authors write "One tradeoff of Group SELFIES is that encoding and decoding are usually slower than with SELFIES, likely due to overhead of RDKit operations.". I did not understand why RDKit should be the reason for the slowdown. The authors do not use RDKit during encoding/decoding if I understood correctly, just for computing validity scores. Can the authors please elaborate on that?

The RDKit Mol data structure is used as an underlying molecule representation inside the encoder and decoder because it provides a suitable, well-maintained API for manipulating molecular graphs. We believe RDKit is the reason for the slowdown because when profiling our code, we found that RDKit operations took up the majority of time. Though RDKit is fast with a C++ implementation, our implementation of Group SELFIES consecutively uses several RDKit operations. The slowdown might be caused by repetitive data transfers from Python to RDKit and back. These data transfers would be redundant if all operations were implemented in a single language.

Referee 2:
Major comments
1. In Table 2, it would be interesting to supplement the MOSES metrics with other metrics that might help differentiate the performance of SELFIES and Group SELFIES. For example, Wasserstein distances between molecular weight, length, synthetic complexity, number of ring and substructure distributions.
We have added distribution plots of molecular weight, SAScore, logP, and QED in Appendix A.6, though they indicate that Group SELFIES and SELFIES have similar performance. We believe that FCD should capture a more intrinsic measure of distribution learning because ChemNet has likely learned what substructures are relevant for bioactivity, whereas molecular weight, SAScore, logP, and QED are largely a combination of atom-based contributions.

Substructure distributions should also be captured by the Frag and Scaf metrics in Table 2.

2. How does Group SELFIES compare to simply tokenizing common SELFIES substrings and increasing the vocabulary size? Appendix A.4 suggests performance is similar for random generation, but how about for simple generative models working with an expanded SELFIES substring vocabulary?

While it would be interesting to see how fragment inductive biases can be incorporated in regular SELFIES generative models, the main scope of this work is the representation of Group SELFIES itself. Additionally, we believe our results in Appendix A.4 provides sufficient indication that a generative model working with SELFIES substrings would have similar performance to a generative model using Group SELFIES. Regarding other work, JANUS, a genetic algorithm that uses regular SELFIES, uses mutations that add SELFIES substrings corresponding to random radius-3 fragments in the dataset, and performs quite well.

Nigam, A., Pollice, R., & Aspuru-Guzik, A. (2022). Parallel tempered genetic algorithm guided by deep neural networks for inverse molecular design. Digital Discovery, 1(4), 390-404.

3. Can the authors provide some analysis for various common molecular datasets (enamine, zinc, chembl, etc.) on how many groups are needed as a function of dataset size scaling? It would be useful to provide these pre-computed group sets and some indication about how many groups are needed to capture different percentiles of the total number of groups that occur in common datasets.

Studying the performance of generative models using Group SELFIES with different group set sizes is an interesting direction that we have added to Section 5.2 Future Work. This direction can place results in the context of several fragment-based analyses of these datasets.

4. Considering the ~65x slowdown in encoding Group SELFIES compared to SELFIES, and the comparable performance to substring SELFIES on random generation, it may be useful to provide utilities for translating group sets into substring SELFIES tokens that can be directly tokenized for use in string-based generative methods.

This utility already exists - because groups are stored as RDKit Mol objects, they can be readily translated to SELFIES substrings via Mol -> SMILES -> SELFIES.

5. In the common generative task of “infilling”, a motif or multiple motifs in a lead molecule are conserved, while “decorating” the molecule or selectively replacing a motif. Can the authors compare, perhaps using the random generation setup, how Group SELFIES performs against SELFIES and SMILES in this sort of infilling task?

Thank you for the suggestion. We believe that experiments presented in Appendix A.3 on the dataset of nonfullerene acceptors (NFA) provide sufficient indication that Group SELFIES can perform scaffold decoration while preserving the scaffold. NFA contains several conjugated aromatic systems. While random SELFIES rarely ever preserve aromatic rings, and random SMILES are almost never valid, Group SELFIES can preserve aromatic scaffolds and combine them in new ways.

Minor comments
1. In section 3.2, can the authors clarify why [Branch] and [pop] tokens do not need to come in pairs?

[Branch] and [pop] tokens do not need to come in pairs because whenever a [Branch] token is read, a branch is created that persists until a [pop] token is read, or until the end of the string. A [Branch] token that is not followed by a [pop] token just means that the branch created by this [Branch] token persists until the end of the string. A few sentences have been added to Section 3.2 for better explanation.

2. How is the “essential set” of chiral centers from eMolecules determined? Is this set sufficient to represent chiral centers across commonly used datasets, e.g., Chembl, Enamine?

The “essential set” of chiral centers was determined by encoding and decoding the entire dataset without any groups, then collecting all examples where the encoding and decoding did not return the same molecule, and then manually adding all necessary chiral centers to the essential set. We believe that this essential set likely covers all boron/carbon/nitrogen/sulfur/phosphorus chiral centers used in ChEMBL and Enamine.

Referee 3:

While the encoding and decoding of Group SELFIES will be handled by computers, I think it would help the readers and users of the toolkit if the authors could elaborate the encoding process in section 3.2. I had difficulty following Figure 2 in my first read, but no problem following Appendix A.1. It would be nice if tokens used in these two sections could be harmonized.

Figure 2 and its caption has been revised to be hopefully more clear. A paragraph explaining Figure 2 has also been added in Section 3.4.

Referee 4:
1. The code for the fragmentation schemes employed to gather the actual group set for the ZINC-250k dataset (Section 4.1, Figure 4) is not available: only the resulting text file with the predefined fragments is provided. Considering the importance of group definition for Group SELFIES, more actual examples on how this process takes place could be of interest to both readers and users. While the tutorial notebook does briefly introduce this topic, presenting the results of using either the “default” or the “MMPA” schemes, further details and a brief discussion of the performance and the kinds of groups resulting from the different strategies available in fragment_utils.py, over this same ZINC-250k dataset, would come in useful, possibly as an additional notebook.

We believe that a detailed study of how fragmentation strategies affect the performance of Group SELFIES-based generative models should be tackled by future work. Future work can also draw on a large body of literature on fragmentation algorithms which we cite in Section 3.5. The fragmentation strategies described in this work were intended to provide a simple set of groups to demonstrate basic functionality of Group SELFIES as a representation.

2. The study in Section 4.3 (Distribution Learning, involving Table 2 and Figure 6) is not available in the repository, while the other examples in the paper are. From the MOSES paper and repository, the benchmark may be assumed to be called from the command-line interface of this tool, but having the script would be more consistent with the rest of the provided experiments. More importantly, due to the absence of this specific script/notebook, the specific fragmentation and group selection strategy followed to build the corresponding group set for Group SELFIES is only hinted at the text, but not directly available.

We have accidentally lost access to the original training script, but we have reproduced it to the best of our ability and included it in the repository. The fragmentation strategy for the “useful set” of 30 groups was the “default” scheme, just as demonstrated in the tutorial notebook, though not the same number of groups was generated.

3. While specific details on the protocol for Distribution Learning can be readily checked on either the article or the repository for the MOSES framework, a brief mention on MOSES being based on the ZINC Clean Leads dataset, the approximate number of molecules it contains (~2M) and the train/test/scaffold split would make the section clearer and more self-contained.

This change has been added in Section 4.3.

4. The “tutorial.ipynb” notebook includes a clear explanation of how the fragments (or “groups”) are expressed and defined, which is quite lacking in the manuscript, that only points at a “SMILES-like syntax” being possible, without any further details. Given that these groups are the cornerstone of the method, I think that the overall consistency of the work would improve if the manuscript provided a better description of this essential part of the Group SELFIES approach, together with the repository showing more direct examples on group set generation as raised in point #1.

A code snippet of how groups are defined in Group SELFIES has been included in Section 3.3.

5. The “extended chirality” section, which addresses a main issue of molecular strings representations, explicitly states “We leave the proper implementation of representing global chirality to future work”, without it being not under the actual “Future work” subsection. Depending on the extent of the preliminary work on this problem, it could be useful to have some code examples of the chiral groups in Figure 7 actually being employed in the Group SELFIES framework. E.g., some custom definition of how the corresponding group set dictionaries might be introduced instead of using the standard SMILES-like syntax of the other examples, even if it is not yet fully standardized. In any case, I find that the current organization can be somehow misleading regarding the extent of applicability of Group SELFIES, and it might become clearer by just moving this discussion on extended chirality under the “Future work” headline.

The section on extended chirality was moved to future work. We have added details on how defining extended chirality might look like – i.e. by defining a 3D fragment with special atoms indicating attachment points.

Editor’s decision letter

28-Mar-2023

Dear Mr Cheng:

Manuscript ID: DD-ART-01-2023-000012.R1
TITLE: Group SELFIES: A Robust Fragment-Based Molecular String Representation

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

We are offering all corresponding authors on publications in gold open access RSC journals who are not already members of the Royal Society of Chemistry one year’s Affiliate membership. If you would like to find out more please email membership@rsc.org, including the promo code OA100 in your message. Learn all about our member benefits at https://www.rsc.org/membership-and-community/join/#benefit

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Dr Kedar Hippalgaonkar
Associate Editor, Digital Discovery
Royal Society of Chemistry

Reviewer comments

Reviewer 2

The authors have adequately responded to all comments and suggestions. I recommend publication of the article.

Reviewer 1

Thank you for the detailed answers to my and the other reviewers questions. Now, I recommend the acceptance of this manuscript as-is.

From the journal Digital Discovery Peer review history

Round 1

Reviewer 1

Reviewer 2

Reviewer 3

Reviewer 4

Round 2

Reviewer 2

Reviewer 1

Transparent peer review