From the journal Digital Discovery Peer review history

Recent advances in the self-referencing embedded strings (SELFIES) library

Round 1

Manuscript submitted on 17 Mar 2023
 

30-Apr-2023

Dear Mr Lo:

Manuscript ID: DD-TRV-03-2023-000044
TITLE: Recent advances in the Self-Referencing Embedded Strings (SELFIES) library

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of your manuscript and the reviewers’ reports, I will be pleased to accept your manuscript for publication after revisions.

Please revise your manuscript to fully address the reviewers’ comments. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link :

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log in to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process.   We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry

************


 
Reviewer 1

Congratulations on your excellent work in developing and improving the SELFIES library.

Reviewer 2

I am one of the contributors to the DECIMER and STOUT systems. We use SELFIES encoding extensively and consider it a major milestone in the overall history of chemical structure representation. The manuscript is important in advancing this exceptionally promising area of research, so I strongly support its publication in Digital Discovery. Because it is very well written, logically concise, and convincingly structured, I recommend its publication as is.

Nevertheless, I would like to encourage the authors to enrich their work with (more) examples and corresponding graphical illustrations. Even if this is not necessary from a purely scientific point of view, it would facilitate understanding and promote a wider dissemination of the topic in the (quite diverse) chemistry community. From my teaching experience, I can say that SELFIES are initially considered more difficult to understand than SMILES and even DeepSMILES - so I would place a clear bet on the outcome of "Future project 12: Experiment on readability of molecular string representations" in Ref. 20. Since I fully agree with the authors' vision “that SELFIES [should] become a standard computer representation for molecular matter”, broader chemical attention is mandatory.
Last but not least, in view of its ubiquitous availability, SELFIES would additionally benefit from direct implementation in widely used open cheminformatics libraries such as RDKit and/or CDK.

Reviewer 3

I believe that this tutorial has the potential to serve as a valuable educational resource in the field. However, several points need to be addressed and clarified to enhance the tutorial's effectiveness and comprehensiveness, ensuring its suitability for publication.

I have the following suggestions and questions.

1. The product rule could be better explained with a concrete example, such as a small molecule with multiple atom types. This would help readers understand the concept more easily.

2. The use of the Kleene star operation in the context-free grammar is not introduced, and although it is a valid notation in formal language theory, it may not be familiar to readers who are not well-versed in this area. It could be helpful for the authors to provide some explanation or use an alternative notation that is more accessible to a broader audience.

3. The handling of multi-valency in SELFIES is not well-explained in the tutorial. The authors should provide more information on this topic, as well as clarify whether SELFIES is limited to organic molecules or if it can handle a broader range of elements and bond types. A table or figure illustrating the applicable elements, bond types, and periodic table coverage would be helpful.

4. The encoding process (SMILES to SELFIES) is not adequately explained in the tutorial. If the reason for this omission is that encoding is simply the reverse of decoding, then the authors should clarify this point. If the procedures are different, the tutorial should include a detailed explanation of the encoding process. Additionally, the authors should clarify their use of the term "reverse translation" when referring to encoding, as the purpose is definitely unclear.

5. The paper emphasizes that SMILES are prone to syntactic and semantic errors and implies that the error rate is significant. The authors should provide a clear benchmark of invalidity rates for SMILES on large, diverse datasets using state-of-the-art models in molecular generation tasks. Such benchmarks should include percentages of parsing errors (e.g., unclosed rings, extra open or close parentheses), kekulization errors, and explicit valence errors, among others.
Similarly, regarding the model accuracy, we expect a performance gain due to zero invalid outcomes, but as seen in the previous works, there is a substantial performance drop in forward-, backward- reaction prediction tasks (https://pubs.acs.org/doi/full/10.1021/acs.jcim.1c01467), and fingerprint to SMILES/SELFIES translation tasks (https://jcheminf.biomedcentral.com/articles/10.1186/s13321-023-00693-0). Can the performance gain compensate for the loss? This is critical in terms of the applicability of SELFIES. Can the authors highlight the situations in which the selfies is a more favorable choice for deep learning tasks compared to SMILES?

6. If SMILES has a simple underlying grammar, as the authors suggest, why do state-of-the-art models struggle to understand it fully? The authors should address this question in conjunction with their discussion of invalidity rates, which are directly related to syntactic errors.

7. As far as i know, there are no tools available for converting molecular structures directly to SELFIES, bypassing SMILES altogether.
Considering the initial motivation for SELFIES, the limitation it addresses, the concepts it borrows from SMILES, and the prevalence of SMILES in the literature, it seems that SELFIES is strongly tied to SMILES. The authors should clarify the relationship between the two representations in the beginning, and discuss the possibility of creating SELFIES libraries independent of SMILES as outlook.


 

Referee: 1

Congratulations on your excellent work in developing and improving the SELFIES library.

Authors: We thank the reviewer for this kind comment.

---

Referee: 2

I am one of the contributors to the DECIMER and STOUT systems. We use SELFIES encoding extensively and consider it a major milestone in the overall history of chemical structure representation. The manuscript is important in advancing this exceptionally promising area of research, so I strongly support its publication in Digital Discovery. Because it is very well written, logically concise, and convincingly structured, I recommend its publication as is.

Authors: We thank the reviewer for this kind assessment.

Nevertheless, I would like to encourage the authors to enrich their work with (more) examples and corresponding graphical illustrations. Even if this is not necessary from a purely scientific point of view, it would facilitate understanding and promote a wider dissemination of the topic in the (quite diverse) chemistry community. From my teaching experience, I can say that SELFIES are initially considered more difficult to understand than SMILES and even DeepSMILES - so I would place a clear bet on the outcome of "Future project 12: Experiment on readability of molecular string representations" in Ref. 20. Since I fully agree with the authors' vision “that SELFIES [should] become a standard computer representation for molecular matter”, broader chemical attention is mandatory.

Authors: We thank the reviewer for this insightful comment. We have added additional examples throughout subsections III.C-III.E to better convey the SELFIES derivation rules.

Last but not least, in view of its ubiquitous availability, SELFIES would additionally benefit from direct implementation in widely used open cheminformatics libraries such as RDKit and/or CDK.

Authors: We thank the reviewer for this comment. We agree that SELFIES could benefit from its implementation in established cheminformatics libraries. Our current focus was making SELFIES a lightweight standalone library that could naturally be incorporated in any Python-based package. Hence, inclusion in RDKit would be very straightforward. Inclusion in CDK would likely require rewriting SELFIES in Java which could be a potential future project to pursue.

---

Referee: 3
I believe that this tutorial has the potential to serve as a valuable educational resource in the field. However, several points need to be addressed and clarified to enhance the tutorial's effectiveness and comprehensiveness, ensuring its suitability for publication. I have the following suggestions and questions.

1. The product rule could be better explained with a concrete example, such as a small molecule with multiple atom types. This would help readers understand the concept more easily.

Authors: We thank the reviewer for this comment. We provide concrete examples of the derivation rules at the ends of subsections III.C-III.E.

2. The use of the Kleene star operation in the context-free grammar is not introduced, and although it is a valid notation in formal language theory, it may not be familiar to readers who are not well-versed in this area. It could be helpful for the authors to provide some explanation or use an alternative notation that is more accessible to a broader audience.

Authors: We thank the reviewer for this important comment. We have added a footnote defining the Kleene star operation after it is first used.

3. The handling of multi-valency in SELFIES is not well-explained in the tutorial. The authors should provide more information on this topic, as well as clarify whether SELFIES is limited to organic molecules or if it can handle a broader range of elements and bond types. A table or figure illustrating the applicable elements, bond types, and periodic table coverage would be helpful.

We thank the reviewer for this remark. Table IV shows the default constraints of SELFIES, which illustrates the applicable elements and assumed valences. For bond types, SELFIES only supports single, double, and triple bonds. We have clarified this at the beginning of section IV.

4. The encoding process (SMILES to SELFIES) is not adequately explained in the tutorial. If the reason for this omission is that encoding is simply the reverse of decoding, then the authors should clarify this point. If the procedures are different, the tutorial should include a detailed explanation of the encoding process. Additionally, the authors should clarify their use of the term "reverse translation" when referring to encoding, as the purpose is definitely unclear.

Authors: We thank the reviewer for this insightful comment. Our use of “reverse translation” was intended to mean a translation in the reverse direction, as opposed to a strict functional inverse. We have clarified our usage of the phrase, and added additional comments to explain the behaviour of the encoder() function.

5. The paper emphasizes that SMILES are prone to syntactic and semantic errors and implies that the error rate is significant. The authors should provide a clear benchmark of invalidity rates for SMILES on large, diverse datasets using state-of-the-art models in molecular generation tasks. Such benchmarks should include percentages of parsing errors (e.g., unclosed rings, extra open or close parentheses), kekulization errors, and explicit valence errors, among others.
Similarly, regarding the model accuracy, we expect a performance gain due to zero invalid outcomes, but as seen in the previous works, there is a substantial performance drop in forward-, backward- reaction prediction tasks (https://pubs.acs.org/doi/full/10.1021/acs.jcim.1c01467), and fingerprint to SMILES/SELFIES translation tasks (https://jcheminf.biomedcentral.com/articles/10.1186/s13321-023-00693-0). Can the performance gain compensate for the loss? This is critical in terms of the applicability of SELFIES. Can the authors highlight the situations in which the selfies is a more favorable choice for deep learning tasks compared to SMILES?

Authors: We thank the reviewer for this important comment. We agree that a benchmark of invalidity rates and performance across various tasks would be both valuable and important. However, we believe that the subject and scope of these analyses make them more suitable for future work, as this tutorial article is intended to primarily serve as a description of the algorithms and API of the current selfies library.

6. If SMILES has a simple underlying grammar, as the authors suggest, why do state-of-the-art models struggle to understand it fully? The authors should address this question in conjunction with their discussion of invalidity rates, which are directly related to syntactic errors.

Authors: We thank the reviewer for this important point. Although the SMILES grammar is simple (by which we mean conceptual simpleness), SMILES can be fragile due to the rigidity of its grammar rules. For example, a single misplaced bracket or ring number could ruin the validity of the generated SMILES strings. We have clarified this in the introduction, when the concept of syntactic invalidity is first discussed.

7. As far as i know, there are no tools available for converting molecular structures directly to SELFIES, bypassing SMILES altogether. Considering the initial motivation for SELFIES, the limitation it addresses, the concepts it borrows from SMILES, and the prevalence of SMILES in the literature, it seems that SELFIES is strongly tied to SMILES. The authors should clarify the relationship between the two representations in the beginning, and discuss the possibility of creating SELFIES libraries independent of SMILES as outlook.

Authors: We thank the reviewer for this remark. Under the hood, the selfies encoder() and decoder() translation functions go through an internal molecular graph representation as an intermediate step, which was refactored in selfies v2. Although we have not exposed the SELFIES-to-graph utilities, this is an important step towards a SMILES-independent selfies library. We have clarified the relationship between SELFIES and SMILES with additional comments in section IV.A.




Round 2

Revised manuscript submitted on 11 Jun 2023
 

23-Jun-2023

Dear Mr Lo:

Manuscript ID: DD-TRV-03-2023-000044.R1
TITLE: Recent advances in the Self-Referencing Embedded Strings (SELFIES) library

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

We are offering all corresponding authors on publications in gold open access RSC journals who are not already members of the Royal Society of Chemistry one year’s Affiliate membership. If you would like to find out more please email membership@rsc.org, including the promo code OA100 in your message. Learn all about our member benefits at https://www.rsc.org/membership-and-community/join/#benefit

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry


 
Reviewer 3

Thanks for answering all of my questions.
I confirmed that the quality of the manuscript was improved according to my previous comments. I believe the article is more consistent now.




Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.
Creative Commons BY license