From the journal Digital Discovery Peer review history

Definition and exploration of realistic chemical spaces using the connectivity and cyclic features of ChEMBL and ZINC

Round 1

Manuscript submitted on 06 Sep 2022
 

24-Jan-2023

Dear Dr Cauchy:

Manuscript ID: DD-ART-09-2022-000092
TITLE: Definition and exploration of realistic chemical spaces using the connectivity and cyclic features of ChEMBL and ZINC.

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that major revisions are necessary.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process.   We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry

************


 
Reviewer 1

In this work, authors propose an interesting approach for limiting the chemical space of generative model exploration in order to focus on the “realistic” subspace. Being “data reviewer” I only focus on the data and code availability.

Comments to the “Data reviewer checklist”

1. DATA SOURCES
1a. There is no chapter “DATA” in the article. The data sources are either briefly mentioned in the Introduction and Methodology (ChEMBL25 and ZINC20) or introduced only when reporting results (QM9, PC9, GDBChEMBL, GDB11). The number of compounds retrieved for each dataset is not specified. Only the total number of molecules used from ChEMBL and ZINC together is provided in (!) Conclusion (!) section, which is not the place for this information to be given for the first time.

1b. The version numbers provided for each dataset partially compensate for the absence of the access date for these external sources. However, ZINC library is updated regularly, while the version numbers change only once in five years (ZINC15 was released in 2015, and ZINC20 – in 2020). Therefore, the simple statement of the version of this collection is not sufficient to trace back to the molecules that were downloaded. In addition, on ZINC website there are several filters that can be applied during the download of the collection and they influence a number of retrieved compounds. However, details of the download were not discussed in the manuscript.

1c. Slight bias lies in the removal nitrogroup-containing compounds from the dataset due to the compatibility issue with EvoMol (see section 1.3). Apparently, EvoMol does not process SMILES with ions and zwitterions and thus nitrogroup-containig molecules were discarded. However, this is the question of structures standardization and can be solved by a simple transformation rule - [#8-:2]-[#7+:1]=[O:3] >>[O:3]=[N:1]=[O:2]. The absence of nitro groups does not influence GCFscore, but influences filters based on ECFP4. Even if authors decide to still remove nitrogroup-containing compounds completely, the number of “lost” molecules should be provided.

2. DATA CLEANING

There is almost no mention of the data cleaning and SMILES standardization steps in the manuscript. The only information that the reader can get from the fuzzily named section “1.3. Computational detail” is previously discussed the removal of nitrogroup-contatining molecules and stereochemistry. However, it is not clear whether there was any other step of data cleaning and SMILES standardization, especially important when combining data from different sources. Also, it is not clear how authors dealt with the tautomer’s problem which can be crucial for such kind of research.

REVISION CHECKLIST FOR AUTHORS

I can recommend this article for publication only after these minor revisions:

1. A separate section “Data” should be added to the corrected version of the manuscript. There all used datasets (and separate subsets) should be discussed with the exact number of molecules in each of them being provided. A detailed description of data cleaning and standardization is also needed. A statement concerning the way of treating tautomers will be appreciated.

2. Another file containing the cycle features whitelist should be added online as claimed in the manuscript (page 5: “lists of connectivity and cycle features used to define the realistic chemical spaces are also available” ). On the figshare there is only “ChEMBL and ZINC ECFP dictionnaries for whitelisting” JSON file and no mention of cycle features whitelist.

3. Modification of the file names and their descriptions on the figshare page with respect to how corresponding results were described in the article will be appreciated. For example, it is a bit confusing that authors mention “sillywalk scores” on the figshare, even though such a term was never used in the manuscript itself.

4. I would recommend being more precise with numerical values (number of compounds in the library or number of connectivity features) by providing the exact number instead of possibly misleading vague approximations. For example, in Conclusions, it was mentioned that “nearly 1 000 000” connectivity features were generated, while in JSON there are 1 156 416 records.

Reviewer 2

In my opinion the following issues should be addressed before the paper is suitable for publication

1. It is not clear to me exactly what the nature of the filters are. Are they SMARTS strings (like Walters filters) or something else.

2. How much CPU time is required to filter a molecule?

3. If I understand correctly, the 9 molecules in Figure 7 that are marked with red have passed the cyclic feature filter. However, 5 of the 9 are macrocycles that don’t look synthesizable to me. A discussion of this should be added; preferably with the nearest analog in ChEMBL or ZINC to demonstrate synthesizability.

Jan Jensen (I choose to review this paper non-anonymously)

Reviewer 3

In this work, Cauchy, Da Mota, and coworkers developed an approach for filtering out unrealistic molecules based on their connectivity and cyclic features. As the authors stated in their cover letter, using connectivity features (i.e., fingerprints) as molecular filters is not a new concept, and multiple scores have already been created to evaluate synthetic feasibility (e.g., SAscore, RAscore). In my opinion, the paper lacks sufficient novelty and does not present a compelling argument demonstrating that the method proposed is more convenient and effective than other alternatives. For these reasons, I do not believe that this paper is suitable for publication in Digital Discovery.

Reviewer 4

In this manuscript, the authors introduced an extra checking list of fingerprints extracted from ChemBL and ZINC. This additional checking list contains general cyclic substructures and complements ECFP4 fingerprints extracted from the same database. These fragments, as substructures found in buyable compounds, could be used to examine the synthesizability of computer-generated molecules, and thus narrow down the chemical scope of generative models. The proposed methods could serve as a ready-to-use filtration for molecule discovery. I recommend publishing this work on digital discovery with subject to the following improvement/concerns.

1. The authors extracted fingerprints and GCF from ChemBL and ZINC databases. Those templates should be suitable for bio-molecule discovery. However, the authors also apply such templates to material discovery tasks like HOMO/LUMO optimization. Could the authors comment on whether such templates will filter out reasonable molecules with substructures not found in a bio-related molecule?
2. The authors argued that ECFP4 fingerprints are not sufficient for cycles. Generative models may take advantage and generate molecules that still pass filtration. Would ECFP6 fingerprints be able to filter out molecules that pass filter 1 but fail the GCF filter? Could the authors also explain why they choose the ECFP4 fingerprint instead of the ECFP6?
3. Could the authors give some details on computing resources and the running time of applying all proposed filters?
4. Some property labels under molecules in the picture are too small to read. I found some parts of the manuscript hard to follow. The authors may want to polish words and reorganize some sessions before publication. For example, I would suggest swapping sessions 2.1 and 2.2 so that the results part starts from the data structure of filtration lists, followed by applications in drug and material discovery.


 

First of all, we would like to thank the experts for their careful reading and for pointing out parts of the study that needed clarification. You will find below our answers (following the > sign) to each of these remarks.

REVIEWER REPORT(S):
Referee: 1

1. DATA SOURCES
1a. There is no chapter “DATA” in the article. The data sources are either briefly mentioned in the Introduction and Methodology (ChEMBL25 and ZINC20) or introduced only when reporting results (QM9, PC9, GDBChEMBL, GDB11). The number of compounds retrieved for each dataset is not specified. Only the total number of molecules used from ChEMBL and ZINC together is provided in (!) Conclusion (!) section, which is not the place for this information to be given for the first time.

> That's a very valid point. The article has been at first focused on the impact of white lists on the chemical spaces. The cost of building these lists convinced us of their potential interest to other researchers. It is indeed important to better describe these data. We have therefore added a DATA section that specifies the versions and composition of the data sets used. In this section we also clarify the role of these data either as tools for the generation (the white lists) or as reference lists for comparison. The descriptions on figshare have been also updated.

1b. The version numbers provided for each dataset partially compensate for the absence of the access date for these external sources. However, ZINC library is updated regularly, while the version numbers change only once in five years (ZINC15 was released in 2015, and ZINC20 – in 2020). Therefore, the simple statement of the version of this collection is not sufficient to trace back to the molecules that were downloaded. In addition, on ZINC website there are several filters that can be applied during the download of the collection and they influence a number of retrieved compounds. However, details of the download were not discussed in the manuscript.

> We hope that the new DATA section addresses this point concerning the ZINC database. Concerning the download of ChEMBL the PhD student do not recall any activation of any filters except the selection of the substances category.

1c. Slight bias lies in the removal nitrogroup-containing compounds from the dataset due to the compatibility issue with EvoMol (see section 1.3). Apparently, EvoMol does not process SMILES with ions and zwitterions and thus nitrogroup-containig molecules were discarded. However, this is the question of structures standardization and can be solved by a simple transformation rule - [#8-:2]-[#7+:1]=[O:3] >> [O:3]=[N:1]=[O:2]. The absence of nitro groups does not influence GCFscore, but influences filters based on ECFP4. Even if authors decide to still remove nitrogroup-containing compounds completely, the number of “lost” molecules should be provided.

> The text must not have been clear enough (especially the concluson) because the datasets were not filtered at all. The ECFP4 whitelists are based on the entire datasets and therefore include all ECFPs regardless of the generator boundaries. They include compounds containing nitro groups. The ECFP4 whitelists are not biased by EvoMol limitations. The subsets of these datasets were filtered out only during comparisons with the chemical space available with EvoMol (Table 1 for example). We have clarified this in the text.

> Concerning the impossibility in EvoMol to generate nitro groups, it is the consequence of a representation in the form of graph whose edges (bonds) are integers. Moreover the valences of the atoms cannot be exceeded since implicit hydrogens complete the valence shells. It is therefore not possible for the moment to generate this kind of function. The text has been modified to explain this bias on the chemical space available with our generator.

2. DATA CLEANING

There is almost no mention of the data cleaning and SMILES standardization steps in the manuscript. The only information that the reader can get from the fuzzily named section “1.3. Computational detail” is previously discussed the removal of nitrogroup-contatining molecules and stereochemistry. However, it is not clear whether there was any other step of data cleaning and SMILES standardization, especially important when combining data from different sources. Also, it is not clear how authors dealt with the tautomer’s problem which can be crucial for such kind of research.

> The SMILES standardization procedure has been discussed in the EvoMol article. But we agree that it is an important point here. So, the procedure and its impact on the datasets is discussed in the DATA section.

REVISION CHECKLIST FOR AUTHORS

I can recommend this article for publication only after these minor revisions:

1. A separate section “Data” should be added to the corrected version of the manuscript. There all used datasets (and separate subsets) should be discussed with the exact number of molecules in each of them being provided. A detailed description of data cleaning and standardization is also needed. A statement concerning the way of treating tautomers will be appreciated.

> We hope that the new DATA section meets the expectations of the reviewer.

2. Another file containing the cycle features whitelist should be added online as claimed in the manuscript (page 5: “lists of connectivity and cycle features used to define the realistic chemical spaces are also available” ). On the figshare there is only “ChEMBL and ZINC ECFP dictionnaries for whitelisting” JSON file and no mention of cycle features whitelist.

> Indeed. Sorry for the oversight. The cyclic features lists are now available on Figshare.

3. Modification of the file names and their descriptions on the figshare page with respect to how corresponding results were described in the article will be appreciated. For example, it is a bit confusing that authors mention “sillywalk scores” on the figshare, even though such a term was never used in the manuscript itself.

> Indeed, at first we used "silly walks", the original name of Paul Walters in homage to Monty Python. But during the writing process we preferred a less poetic but more precise name. We have now adapted the texts on figshare.

4. I would recommend being more precise with numerical values (number of compounds in the library or number of connectivity features) by providing the exact number instead of possibly misleading vague approximations. For example, in Conclusions, it was mentioned that “nearly 1 000 000” connectivity features were generated, while in JSON there are 1 156 416 records.

> The exact numbers are now specified in the methodology section and in conclusion. We hope that the changes made to this section meet the expectations of the referee and will allow future readers to better understand the work flow.

Referee: 2

Comments to the Author
In my opinion the following issues should be addressed before the paper is suitable for publication

1. It is not clear to me exactly what the nature of the filters are. Are they SMARTS strings (like Walters filters) or something else.

> Dear Porf Jensen, our article was not clear enough. P. Walters has proposed several filters. The filter named "rd_filters" is based on SMART strings that act as a black list of features that are not valid (https://github.com/PatWalters/rd_filters). The filter that inspired us is named "silly_walks". It is based on ECFP features which are used as a white list of valid features (https://github.com/PatWalters/silly_walks). Our work is based on the latter.


2. How much CPU time is required to filter a molecule?

> In the end of the 1.2 whitelists definitions subsection, we have added a paragraph that indicate the computational cost of these evalutions.

3. If I understand correctly, the 9 molecules in Figure 7 that are marked with red have passed the cyclic feature filter. However, 5 of the 9 are macrocycles that don’t look synthesizable to me. A discussion of this should be added; preferably with the nearest analog in ChEMBL or ZINC to demonstrate synthesizability.

> Thank you for this comment. We have added a discussion paragraph about them justifying their presence in the authorized cycles.

Jan Jensen (I choose to review this paper non-anonymously)


Referee: 3

Comments to the Author
In this work, Cauchy, Da Mota, and coworkers developed an approach for filtering out unrealistic molecules based on their connectivity and cyclic features. As the authors stated in their cover letter, using connectivity features (i.e., fingerprints) as molecular filters is not a new concept, and multiple scores have already been created to evaluate synthetic feasibility (e.g., SAscore, RAscore). In my opinion, the paper lacks sufficient novelty and does not present a compelling argument demonstrating that the method proposed is more convenient and effective than other alternatives. For these reasons, I do not believe that this paper is suitable for publication in Digital Discovery.

> We tried in the introduction to show that the current scores being based on averages of similarity to data sets, were interesting to estimate an average realism. But they do not allow to discriminate more subtle cases where the non-existence of the molecule is linked to a specific unstable chemical environment see exemple in the introduction. The method presented in this article proposes a solution to this problem. It furthermore puts forward that filtering based only on ECFP does not allow to treat correctly cyclic properties. The definition of a new generic cyclic descriptor is a contribution which seems to us particularly important. We have modified the abstract to explicitly highlight this difference. We are sorry if the referee was not convinced by our arguments.

Referee: 4

Comments to the Author
In this manuscript, the authors introduced an extra checking list of fingerprints extracted from ChemBL and ZINC. This additional checking list contains general cyclic substructures and complements ECFP4 fingerprints extracted from the same database. These fragments, as substructures found in buyable compounds, could be used to examine the synthesizability of computer-generated molecules, and thus narrow down the chemical scope of generative models. The proposed methods could serve as a ready-to-use filtration for molecule discovery. I recommend publishing this work on digital discovery with subject to the following improvement/concerns.

1. The authors extracted fingerprints and GCF from ChemBL and ZINC databases. Those templates should be suitable for bio-molecule discovery. However, the authors also apply such templates to material discovery tasks like HOMO/LUMO optimization. Could the authors comment on whether such templates will filter out reasonable molecules with substructures not found in a bio-related molecule?

> We have described the datasets in more detail in the methodology section. The presence in the ZINC dataset is not associated with bioactivity. It includes catalogs from general manufacturers. As stated in the introduction an iconic moleculefor molecular materials like the tetrathiafulvalene is not present in ChEMBL but can be found in the ZINC dataset.

2. The authors argued that ECFP4 fingerprints are not sufficient for cycles. Generative models may take advantage and generate molecules that still pass filtration. Would ECFP6 fingerprints be able to filter out molecules that pass filter 1 but fail the GCF filter? Could the authors also explain why they choose the ECFP4 fingerprint instead of the ECFP6?

> The ECPF6 descriptor is now discussed in the article.

3. Could the authors give some details on computing resources and the running time of applying all proposed filters?

> In the end of the 1.2 whitelists definitions subsection, we have added a paragraph that indicate the computational cost of these evalutions.

4. Some property labels under molecules in the picture are too small to read. I found some parts of the manuscript hard to follow. The authors may want to polish words and reorganize some sessions before publication. For example, I would suggest swapping sessions 2.1 and 2.2 so that the results part starts from the data structure of filtration lists, followed by applications in drug and material discovery.

> We are sorry if the reading of the data is not always easy. The figures showing lists of molecules are generated from the RDKit MolsToGridImage function. We tried unsuccessfully to change the size of the labels in these figures. The only solution we can propose for this article is to tell us which figure precisely is difficult to read and we could manually rewrite the corresponding labels. We have done it on the two figures representing the most common cyclic features. Another option would be to switch the figures to double columns. We have also made a few tweaks on the text hopping that the reading will be easier.

> We would like to keep section 2.1 before the exploration because we think that it illustrates clearly the importance of cyclic features. Cyclic features that are then used in the chemical space definitions. We have therefore renamed the section to emphasize its role.

> Kindly,
> Thomas Cauchy




Round 2

Revised manuscript submitted on 02 Mar 2023
 

22-Mar-2023

Dear Dr Cauchy:

Manuscript ID: DD-ART-09-2022-000092.R1
TITLE: Definition and exploration of realistic chemical spaces using the connectivity and cyclic features of ChEMBL and ZINC.

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of your manuscript and the reviewers’ reports, I will be pleased to accept your manuscript for publication after revisions.

Please revise your manuscript to fully address the reviewers’ comments. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link :

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log in to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process.   We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry

************


 
Reviewer 2

The authors have addressed my concerns

Reviewer 3

Personally, I do not believe that the article has demonstrated enough importance to warrant publication in Digital Discovery. However, after seeing the comments from the other reviewers, I acknowledge that my opinion might not be shared by other researchers.

Regarding the technical part, I do not have any major issues, and I think the authors have successfully addressed the comments from the referees. The only point that I find still confusing is the issue regarding nitro groups in EvoMol raised by referee 1. In the text, the authors state that "The generator does not generate radicals and charged atoms. Ions and zwitterions have therefore been left out of the datasets during the chemical space comparison part. In this representation, the nitro function is considered as a zwitterion" (p.4) and "It is important here to recall that EvoMol does not currently handle cases with formal charges, which unfortunately excludes nitro compounds" (p. 5). I believe referee 1 was referring to the biases in datasets generated by EvoMol (i.e., section 2.3, Figs 11, 12, and 13). For this reason, I think the authors should address this point as mentioned by referee 1 (i.e., either using a transformation rule or providing the number of lost molecules).

Rather than technical issues in the protocols presented, the main reason for my decision to ask for a major revision is that the manuscript does not present a compelling argument demonstrating the advantages of the proposed method compared to other alternatives. For example, multiple scores have already been created to evaluate synthetic feasibility (i.e., SAscore, RAscore). What are the advantages of the new methods compared to other filters? In Figure 10 and Table 5, the authors showed RAscore/SAscore distributions of compounds passing the new filter proposed. However, in the practical case shown in Figure 13, the authors did not mention if applying RAscore/SAscore or similar filters lead to similar results (i.e., are the five lowest LUMOs similar when applying RAscore/SAscore filters and the new filter proposed?). I think this point is especially relevant and should be addressed since, as the authors stated in their cover letter, using connectivity features as molecular filters is not a new concept.

Reviewer 4

The authors addressed all of my concerns and the manuscript is now suitable for publication.


 

REVIEWER REPORT(S):

Referee: 3

Comments to the Author
Personally, I do not believe that the article has demonstrated enough importance to warrant publication in Digital Discovery. However, after seeing the comments from the other reviewers, I acknowledge that my opinion might not be shared by other researchers.

Regarding the technical part, I do not have any major issues, and I think the authors have successfully addressed the comments from the referees. The only point that I find still confusing is the issue regarding nitro groups in EvoMol raised by referee 1. In the text, the authors state that "The generator does not generate radicals and charged atoms. Ions and zwitterions have therefore been left out of the datasets during the chemical space comparison part. In this representation, the nitro function is considered as a zwitterion" (p.4) and "It is important here to recall that EvoMol does not currently handle cases with formal charges, which unfortunately excludes nitro compounds" (p. 5). I believe referee 1 was referring to the biases in datasets generated by EvoMol (i.e., section 2.3, Figs 11, 12, and 13). For this reason, I think the authors should address this point as mentioned by referee 1 (i.e., either using a transformation rule or providing the number of lost molecules).

> Dear referee, the number of lost molecules from the catalogues are provided. However the Figs 11, 12 and 13 mentioned by the referee correspond to molecules generated by EvoMol. We cannot provide a discarded molecules number since at each step of the generation the program cannot mutate the RDkit graph object to produce a nitro. As stated in the 1.2 section : "In EvoMol, hydrogen atoms are treated implicitly and the bond orders are integers. The valence of the atoms is used as a reference to place the hydrogen atoms. The generator does not generate radicals and charged atoms." The transformation rule cannot be used since the RDkit graph object is not based on a SMARTS notation. This part of the chemical space cannot be reached by the program at its current state. However our method presented here, the white listing, is not limited to EvoMol. Any other generative program can used the provided ECFP4 and GCF lists to explore its own realistic chemical space.

Rather than technical issues in the protocols presented, the main reason for my decision to ask for a major revision is that the manuscript does not present a compelling argument demonstrating the advantages of the proposed method compared to other alternatives. For example, multiple scores have already been created to evaluate synthetic feasibility (i.e., SAscore, RAscore). What are the advantages of the new methods compared to other filters? In Figure 10 and Table 5, the authors showed RAscore/SAscore distributions of compounds passing the new filter proposed. However, in the practical case shown in Figure 13, the authors did not mention if applying RAscore/SAscore or similar filters lead to similar results (i.e., are the five lowest LUMOs similar when applying RAscore/SAscore filters and the new filter proposed?). I think this point is especially relevant and should be addressed since, as the authors stated in their cover letter, using connectivity features as molecular filters is not a new concept.

> We do agree that it is a relevant point. We have included the SAscore, CLscore, and RAscore to the Figures 1, 2, 3 and 12. The limitations of those scores as a realistic filter are now more thoroughly discussed in the introduction. We hope that this will help to clarify our motivations.
The SAscore and CLscore cannot be used as a filter because there is no obvious threshold, as noted in the introduction. Since the RAscore is found to be 0.0 (no synthetic route) for a molecule used in molecular electronic applications (see Figure 2), we know that its training set does not include all the chemistry expected for this kind of problems. Furthermore, the scores of the compounds in Figure 12 and the added corresponding discussion, show that none of these scores can be used directly as a filter.

Cordially




Round 3

Revised manuscript submitted on 30 Mar 2023
 

03-Apr-2023

Dear Dr Cauchy:

Manuscript ID: DD-ART-09-2022-000092.R2
TITLE: Definition and exploration of realistic chemical spaces using the connectivity and cyclic features of ChEMBL and ZINC.

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

We are offering all corresponding authors on publications in gold open access RSC journals who are not already members of the Royal Society of Chemistry one year’s Affiliate membership. If you would like to find out more please email membership@rsc.org, including the promo code OA100 in your message. Learn all about our member benefits at https://www.rsc.org/membership-and-community/join/#benefit

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Linda Hung
Associate Editor
Digital Discovery
Royal Society of Chemistry


 
Reviewer 3

I believe the authors have addressed all my comments. Their demonstration of how most common scores fail to filter compounds from Figure 12 was particularly useful. I congratulate all the authors on their work.




Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.
Creative Commons BY license