From the journal Digital Discovery Peer review history

You do not have JavaScript enabled. Please enable JavaScript to access the full features of the site or access our non-JavaScript page.

Round 1

Manuscript submitted on 14 Mar 2023

Editor’s decision letter

12-Apr-2023

Dear Dr Gomez-Bombarelli:

Manuscript ID: DD-ART-03-2023-000041
TITLE: Automated patent extraction powers generative modeling in focused chemical spaces

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that major revisions are necessary. Please pay particular attention to both the technical and data/code review aspects; regarding the latter it appears that a substantial reorganization effort is needed to make the work conform to our data/code availability standards.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Joshua Schrier
Associate Editor, Digital Discovery

************

Reviewer comments

Reviewer 1

The authors developed a framework to automatically extract molecular structures from the USPTO patent repository based on user-defined keyword searches and uses the molecules for generative chemistry.
The workflow is of interest for the technically minded reader, my main concern is that not all components of the workflow are open source and reusable by the interested reader.

How does the structures extracted with the workflow compare to the structures in SureChEMBL?

Is it only the exemplified compounds that is extracted from the patents or are other types of molecules like reactants also extracted and therefore included in the modelling?

The main value of the paper is the pipeline so it is unfortunate that parts of it is proprietary, is there any possibility to create a fully open source pipeline?

It is not best scientific practice to not download the SMILES for certain years because of additional challenges. How can the authors be certain that the results won't be impacted due to the missing SMILES?

The authors write "Any Markush structures in the dataset were filled in with ethyl groups." Does that mean that the modelling was done on structures that was not part of the patents?

Since the datasets are small, the results might be improved with data augmentation methods like randomizing the strings representing the molecules

Reviewer 2

The authors present a pipeline to train generative models on focused subsets of chemical space as obtained by filtering information from US patents. The authors motivate their choice by stating that patents remain relatively untapped while containing useful data.

Before starting with comments about the content of the manuscript, I would like to ask the authors to avoid using a tiny font size with no line spacing, which makes the reviewing process very annoying, as it is hard to read and hard to annotate.

Major comments:
- I think the paper lacks a crucial baseline for the part about property optimization, where one starts from a generative model pretrained on Zinc/Chembl, and just optimizes the two given properties. For this baseline, for the OPD use case, the patent compounds would be useful solely for training the oracle; the TKI compounds would not be necessary at all. Should this baseline achieve similar results as the models introduced by the authors, this would contradict one of the main statements of the manuscript.
- Unfortunately, it seems that neither of both use cases ends up being very convincing. The property optimization for TKI is basically only a similarity optimization - the fact that it is about kinase inhibitors is irrelevant. The property optimization for OPD suffers from the concerns also noted by the authors about adversarial generation of compounds. Hence, the statement "high-performing molecular structures" in the conclusion does not convince me.

Other comments:
- I find that the context in which the figures are first mentioned and what they contain makes it difficult to understand. When first mentioned in 3.1 and looking at the figures, one sees "post-hoc filter" but has no idea what it is about; it is only explained in 3.2.1, and only for the JTVAE case. It may be worth splitting the figure; see also the comment below.
- I find the presentation with the "post-hoc filter" confusing and biased. Why not call it simply "top 20%"? Even then, the comparison with the patent dataset distribution is unfair; the correct comparison would be to look at the "top 20%" from the patents as well.
- It is unclear how many compounds were sampled for Table 1, Figure 3, Figure 4, Figure 5. Most critical is Table 1: do I guess correctly that only 100 compounds were sampled? If so, this may lead to an over-optimistic estimate of the novelty: the GuacaMol paper relied on many more samples.
- When selecting the compounds from the patents after filtering by keyword: how do you make sure that the structures represent the actual compounds of interest (final products) and that for instance reagents or intermediates are not included?
- What is the rationale behind filling in the Markush structures with ethyl groups? Is this an adequate things to do? How many compounds does this represent compared to the total numbers?
- In 2.4.2: I would have loved to see comparisons with more models. It should be relatively easy to use the implementations in MOSES / GuacaMol, no?

Minor comments:
- in 2.2, there are two "empty citation"s.
- 2.3.2: It is not clear how the similarity is computed when starting from the Morgan fingerprints.
- Why using different fingerprints in 2.3.2 and in 2.4.1?
- In 2.4.1: I am not sure whether the comparison between ZINC-trained and TKI-trained models is very relevant: it is expected that they will lead to different distributions of compounds.
- I am not convinced by the statement that "[transformers] require large amounts of training data" (compared to RNNs). Can you elaborate?

Comments on the attached "data reviewer checklist":
- General comment: The authors say that the forks of the code are available in Zenodo. This is maybe true but hard to check: since the ZIP is 4 GB and takes 3 hours to download, I didn't think this counts as "accessible". I would suggest to split the model data (big / slow to download) and the code (should be small / fast to download) to make it easier for reviewers and readers to access the code easily.
- The instructions to set up the conda environment for PatentChem did not work for me. I had to use "conda env create --file environment.yml" instead. I couldn't test all the code, because the first step would have taken at least 5 hours for one single zip, and the next steps depend on it.
- 2a: "No" because for the data processing steps for the steps 4 and 5, no code is provided. Section 2.3 may give enough context for step 5, but not for step 4.
- 4a: can't say because of the download issue mentioned above. Not clear if it contains the updates to the JTVAE/REINVENT code. Not clear if it contains the code for the property predictors.
- 4b/4c: is lacking baselines as mentioned in the major comments above
- 5: The splitting procedure is not clear for the property predictor
- 6: it looks like the code contains the individual components (download from patents, extract SMILES, de novo models), but not the scripts to obtain the results shown in the paper (metrics, graphs, etc).

Author response

REVIEWER REPORT(S):
Referee: 1

Comments to the Author
The authors developed a framework to automatically extract molecular structures from the USPTO patent repository based on user-defined keyword searches and uses the molecules for generative chemistry.
The workflow is of interest for the technically minded reader, my main concern is that not all components of the workflow are open source and reusable by the interested reader.

How does the structures extracted with the workflow compare to the structures in SureChEMBL?

Our extracted structures are a subset of those in SureChEMBL. SureChEMBL extracts molecules from text (1976-2023), images (2007-2023), and MOL files, while we only get them from MOL files (2001-2023). SureChEMBL covers patent applications and patents granted by authorities other than the USPTO, while we only search USPTO granted patents. Despite its more comprehensive coverage, SureChEMBL is less straightforward to use for querying data in bulk. SureChEMBL provides options for extracting data in bulk (via their FTP server), but these downloadable bulk files only contain chemical structures and maps to patents, which are not sufficient for doing keyword queries in bulk without the setup of a database. Our workflow automates keyword queries with a simple and user-friendly approach, without requiring the setup of a local database (https://github.com/chembl/surechembl-data-client). Since automation and ease of use are important for our pipeline, this outweighs the fact that our patents set is less comprehensive. It is also worth noting that SureChEMBL’s web interface no longer appears to be reliable. Their website (https://www.surechembl.org/search/) is up, but searching doesn’t appear to be working (it always times out when we try). The most recent tweets from their official Twitter account (https://twitter.com/SureChEMBL), in January 2022 and April 2022, advise that SureChEMBL is unavailable pending updates, though files on their FTP server (https://ftp.ebi.ac.uk/pub/databases/chembl/SureChEMBL/data/) were updated as recently as February 2023. The community would benefit from the dissemination of a patent extraction and query approach that is not dependent on the maintenance of one organization’s cloud infrastructure.

Is it only the exemplified compounds that is extracted from the patents or are other types of molecules like reactants also extracted and therefore included in the modelling?

[Response duplicated for similar question by reviewer 2] The goal of our pipeline is to generate structures with limited domain knowledge beyond keywords, so we kept preprocessing to a minimum except for constraints that allowed for better computational tractability and basic filters on molecular mass. For example, we applied the 1000g/mol maximum cutoff for the OPD dataset because JT-VAE has a sequential decoding process that enumerates combinations of fragment pairs (very slow for large molecules). There are certainly some structures in our training datasets that are not domain-relevant (such as reagents or intermediates, as you note). However, the “false positives” (molecules that the model generates because it thinks they are relevant, when in reality they are not relevant) that come from this can be easily filtered out by the property labeling step. Just as a user can choose their own property-labeling method appropriate for their design task when using our code, they could also insert additional domain-knowledge-based preprocessing of the training dataset. Our current work demonstrates that the approach can still be useful even without this preprocessing, but additional filtering may improve results in some domains. We have provided some options for possible filters in our PatentChem code, such as minimum and maximum molecular weight and charged/neutral molecules. We have added more detail about these filters to section S2 of the supplementary information.

The main value of the paper is the pipeline so it is unfortunate that parts of it is proprietary, is there any possibility to create a fully open source pipeline?

[Similar answer to question 2a in the data reviewer checklist] The internal database is simply a form of data storage that we use and can be replaced by csv files or other file types in a local directory, and all calculations can be reproduced by following steps described in section 2.3.1. However, in case readers are interested in using similar forms of data storage as us, an open source database framework very similar to the one we use can be found at https://github.com/mkite-group/. We have added a line to section 2.1 to clarify this better.
We agree that the main value of the paper is the general pipeline to go from patents to generative modeling, but the type of property labeling depends on the user’s domain of interest. For the OPD application demonstrated in the paper, we did property labeling using the ORCA quantum chemistry package, which is freely available for academic use. All instructions to use this software are listed in section 2.3.1. Users wanting to apply our pipeline to their own applications would need to supply their own property-labeling approach.

It is not best scientific practice to not download the SMILES for certain years because of additional challenges. How can the authors be certain that the results won't be impacted due to the missing SMILES?

Our goal is to find a general region of domain-relevant chemical space rather than to extract a comprehensive set of domain-relevant molecules. As we demonstrate, this is doable even while omitting all patent years prior to 2001 (since they are not machine readable). For the same reason, omitting a subset of years after 2001 does not invalidate the approach either. That being said, we’ve now adapted the code to work with the patents from 2001-2004 and 2009 for the sake of completeness. In doing so, we found that some chemical patents from 2008 and 2010 were also missed in our original extraction because they shared a similar structure to the 2009 patents. We’ve added a note to the SI about these initial extraction problems since this impacted the initial training datasets for our generative models. All future users of our code will be able to avoid these problems when doing their own queries.

The authors write "Any Markush structures in the dataset were filled in with ethyl groups." Does that mean that the modelling was done on structures that was not part of the patents?

[Response duplicated for similar question by reviewer 2] The Markush structures are part of the dataset extracted from patents but aren’t complete molecules on their own. Some patent MOL files provide a core structure and describe several different functionalizations of that structure in the text of the patent. Since this data is not stored in a structured way, extracting the correct substituents for each Markush structure would require sophisticated natural language processing techniques and is far outside the scope of this work. The cores of Markush structures are often what plays the largest role in determining a property; different substituents or functionalizations may shift the property. Only 17% of the OPD dataset and 11% of the TKI dataset is made up of substituted Markush structures. Since the goal of this patent extraction process is to bootstrap a region of chemical space based on relevance to keyword queries rather than to obtain a specific set of molecules, this is acceptable. For some applications, users may wish to modify our code to use a different substitution other than ethyl groups. We’ve included a note about this in the README. We’ve added a note in section 2.2 about the proportion of the dataset that includes Markush structures.

Since the datasets are small, the results might be improved with data augmentation methods like randomizing the strings representing the molecules

We agree that data augmentation is a good suggestion to improve the reconstruction performance of the explored generative models. However, the main bottleneck for property optimization was the property predictors, whose performance is governed by the difficulty of the task, and the quality/quantity of property labels which are harder to augment with data augmentation techniques. The reconstruction performance of all tested models was good as seen from sampled molecules shown in SI, and the guacamol benchmarks shown in Table 1.

Referee: 2

Comments to the Author
The authors present a pipeline to train generative models on focused subsets of chemical space as obtained by filtering information from US patents. The authors motivate their choice by stating that patents remain relatively untapped while containing useful data.

Before starting with comments about the content of the manuscript, I would like to ask the authors to avoid using a tiny font size with no line spacing, which makes the reviewing process very annoying, as it is hard to read and hard to annotate.

Major comments:
- I think the paper lacks a crucial baseline for the part about property optimization, where one starts from a generative model pretrained on Zinc/Chembl, and just optimizes the two given properties. For this baseline, for the OPD use case, the patent compounds would be useful solely for training the oracle; the TKI compounds would not be necessary at all. Should this baseline achieve similar results as the models introduced by the authors, this would contradict one of the main statements of the manuscript.

Thanks very much for the suggestion, this is indeed a good baseline. We performed experiments by training a JTVAE model on ZINC and applying posthoc filters to optimize for OPD and TKI objectives. These results have been added to the paper.

- Unfortunately, it seems that neither of both use cases ends up being very convincing. The property optimization for TKI is basically only a similarity optimization - the fact that it is about kinase inhibitors is irrelevant. The property optimization for OPD suffers from the concerns also noted by the authors about adversarial generation of compounds. Hence, the statement "high-performing molecular structures" in the conclusion does not convince me.

Thanks for the comment. We would like to clarify that our statements about “high-performing” molecules are in reference to the training data. As we stated in the introduction, “[patented molecules] are likely to be high-performance since they merited the investment of a patent application”. Our results demonstrate that our models are capable of reproducing the property distributions of the training data. Additionally, our added baseline on ZINC and cross-optimization on OPD and TKI targets demonstrates that domain-focused training enforces structural priors that steer generation towards more optimal properties than training on a generic dataset.

Other comments:
- I find that the context in which the figures are first mentioned and what they contain makes it difficult to understand. When first mentioned in 3.1 and looking at the figures, one sees "post-hoc filter" but has no idea what it is about; it is only explained in 3.2.1, and only for the JTVAE case. It may be worth splitting the figure; see also the comment below.

Thanks for the comment. We agree that the presentation was a bit confusing. We have re-categorized the figures into one for distribution learning and the other for property optimization which we believe has resolved the post-hoc filter confusions.

- I find the presentation with the "post-hoc filter" confusing and biased. Why not call it simply "top 20%"? Even then, the comparison with the patent dataset distribution is unfair; the correct comparison would be to look at the "top 20%" from the patents as well.

The goal in the property optimization tasks is to generate novel compounds that are shifted towards more optimal properties than the dataset “seen” by the model during training. So we believe it is fair to compare with the training dataset rather than a fraction, since we are making use of the entire dataset to train rather than just 20%. We have however replotted all histograms with only novel samples since it is unfair to include training data points which might have been sampled.
The posthoc filter is 1) a filter based on the predictions of an approximate neural network predictor when the property is expensive to obtain as in the OPD case, or 2) simply the top 20% of sampled molecules based on the oracle property if it is cheaply available as in the TKI case.
We believe that the term ‘posthoc filter’ serves as an umbrella term for both these scenarios, while the term ‘top 20%’ seems to suggest only the latter case.
We have added a subsection describing in more detail what the terminology ‘posthoc-filter’ means, which reduces the ambiguity.

- It is unclear how many compounds were sampled for Table 1, Figure 3, Figure 4, Figure 5. Most critical is Table 1: do I guess correctly that only 100 compounds were sampled? If so, this may lead to an over-optimistic estimate of the novelty: the GuacaMol paper relied on many more samples.

Thanks for pointing this out. The number of sampled molecules was 1000 for all guacamol benchmarks. We have clarified the number in the Table caption. The histogram figures were plotted based on properties of between 600 and 1000 molecules depending on the computational budget and convergence of the property labeling and sampling steps. We believe that this is a sufficiently large number for our qualitative interpretations to be statistically meaningful. All sampled molecular data and plotting code is open-sourced with the paper for reproducibility.

- When selecting the compounds from the patents after filtering by keyword: how do you make sure that the structures represent the actual compounds of interest (final products) and that for instance reagents or intermediates are not included?

[Response duplicated for similar question by reviewer 1] The goal of our pipeline is to generate structures with limited domain knowledge beyond keywords, so we kept preprocessing to a minimum except for constraints that allowed for better computational tractability and basic filters on molecular mass. For example, we applied the 1000g/mol maximum cutoff for the OPD dataset because JT-VAE has a sequential decoding process that enumerates combinations of fragment pairs (very slow for large molecules). There are certainly some structures in our training datasets that are not domain-relevant (such as reagents or intermediates, as you note). However, the “false positives” (molecules that the model generates because it thinks they are relevant, when in reality they are not relevant) that come from this can be easily filtered out by the property labeling step. Just as a user can choose their own property-labeling method appropriate for their design task when using our code, they could also insert additional domain-knowledge-based preprocessing of the training dataset. Our current work demonstrates that the approach can still be useful even without this preprocessing, but additional filtering may improve results in some domains. We have provided some options for possible filters in our PatentChem code, such as minimum and maximum molecular weight and charged/neutral molecules. We have added more detail about these filters to section S2 of the supplementary information.

- What is the rationale behind filling in the Markush structures with ethyl groups? Is this an adequate things to do? How many compounds does this represent compared to the total numbers?

[Response duplicated for similar question by reviewer 1] The Markush structures are part of the dataset extracted from patents but aren’t complete molecules on their own. Some patent MOL files provide a core structure and describe several different functionalizations of that structure in the text of the patent. Since this data is not stored in a structured way, extracting the correct substituents for each Markush structure would require sophisticated natural language processing techniques and is far outside the scope of this work. The cores of Markush structures are often what plays the largest role in determining a property; different substituents or functionalizations may shift the property. Only 17% of the OPD dataset and 11% of the TKI dataset is made up of substituted Markush structures. Since the goal of this patent extraction process is to bootstrap a region of chemical space based on relevance to keyword queries rather than to obtain a specific set of molecules, this is acceptable. For some applications, users may wish to modify our code to use a different substitution other than ethyl groups. We’ve included a note about this in the README. We’ve also added a note in section 2.2 about the proportion of the dataset that includes Markush structures.

- In 2.4.2: I would have loved to see comparisons with more models. It should be relatively easy to use the implementations in MOSES / GuacaMol, no?

We agree that testing more models would have given more examples for the points we make. But we believe that we were able to get the main story of the paper across with the three models presented ranging across a variety of representations and architectures. The primary purpose of this paper is to demonstrate the utility of patents in bootstrapping chemical space, which is to some extent agnostic to the model architecture chosen, so we omitted a comprehensive benchmark of models which is beyond the scope of this work.

Minor comments:
- in 2.2, there are two "empty citation"s.

Thanks for pointing this out, it should be fixed now.

- 2.3.2: It is not clear how the similarity is computed when starting from the Morgan fingerprints.

Tanimoto similarity was the similarity measure of choice. We have updated section 2.3.2 with this information.

- Why using different fingerprints in 2.3.2 and in 2.4.1?

Thank you for pointing this out. It was an arbitrary choice, but we have re-plotted the PCA with the same fingerprints for consistency.

- In 2.4.1: I am not sure whether the comparison between ZINC-trained and TKI-trained models is very relevant: it is expected that they will lead to different distributions of compounds.

It is a way to show the utility that training on domain-focused data (TKI-patents) has over training on publicly accessible large databases (ZINC) that have a similar chemical space (drug-like molecules) but are less-focused on the domain of interest. We have added a line to section 2.4.1 providing more context on this.

- I am not convinced by the statement that "[transformers] require large amounts of training data" (compared to RNNs). Can you elaborate?

Thanks for the comment. On further thought, we agree that this statement is not very convincing, nor is it necessary. We have edited the paragraph to omit this line. As we state earlier in that paragraph, we chose Recurrent Neural Networks (RNNs) because they have been shown to be simple but powerful text-based models for complex distribution modeling tasks in molecules (https://www.nature.com/articles/s41467-022-30839-x).

Comments on the attached "data reviewer checklist":
- General comment: The authors say that the forks of the code are available in Zenodo. This is maybe true but hard to check: since the ZIP is 4 GB and takes 3 hours to download, I didn't think this counts as "accessible". I would suggest to split the model data (big / slow to download) and the code (should be small / fast to download) to make it easier for reviewers and readers to access the code easily.

Thanks for this comment. The download speed appears to be limited on Zenodo’s end (when we start downloading it ourselves, it estimates about 2 hours to complete the download on our ethernet at ~100MB/s). We have split the code into a separate, smaller Zenodo repository per your suggestion.

- The instructions to set up the conda environment for PatentChem did not work for me. I had to use "conda env create --file environment.yml" instead. I couldn't test all the code, because the first step would have taken at least 5 hours for one single zip, and the next steps depend on it.

Thanks for this feedback. We tested the original installation instructions on several different machines to confirm that they should work. In case they don’t for some people, we’ve added another line to the README instructions with "conda env create -f environment.yml" as an alternative. The mamba installation should do the same thing as conda but may be faster.
As noted in our code’s README, “The download speed seems to be restricted to ~5-10MB/s [by USPTO], which means downloading the full set could require > 4 days if done in series. Alternatively, you can run multiple downloads in parallel. Either way, we recommend starting the downloads in a tmux session to run in the background.”
We recognize that even if the downloads are done in parallel, this step could still be prohibitive for some users due to the very large space requirement (2.1TB). To help with this, we’ve created a single .tar.gz file containing all chemistry-relevant patents and excluding all irrelevant patents, which will effectively allow users to bypass steps 1 (download) and 2 (chemistry selection) of our pipeline. However, this file alone is still 70GB in its compressed form. Zenodo does not typically allow for file uploads over 50GB, but we’ve submitted a request for them to consider allowing our 70GB upload. They have accepted our request for a quota increase, and we are waiting for their instructions on how to proceed with the upload. We will update the README with instructions for downloading and using this file.
As you noted in your previous comment, Zenodo downloads can be even slower than USPTO downloads for large files (3 hours for 4GB). Thus, downloading a 70GB file from Zenodo (if they allow us to upload it) will still be very time consuming, so we expect users to do this in the background. However, we feel that providing this slower option may still be beneficial for users who do not have the disk space for 2.1+TB of data. We will also retain the original approach for faster download time and to allow users to download and process future patents that wouldn’t be included in a static Zenodo archive.

- 2a: "No" because for the data processing steps for the steps 4 and 5, no code is provided. Section 2.3 may give enough context for step 5, but not for step 4.

Step 4 is based on simple post-processing steps as mentioned in section S2 of the SI and removal of duplicate smiles strings. The database is just a form of storage that we use but can be replaced by storing data as csv files for instance in a local directory. Mkite (https://github.com/mkite-group/) is an open source database framework similar to what we used in case readers are interested in a similar form of data storage. We have added a line to section 2.1 to clarify this better.

- 4a: can't say because of the download issue mentioned above. Not clear if it contains the updates to the JTVAE/REINVENT code. Not clear if it contains the code for the property predictors.

Please see response to first comment for data reviewer checklist.

- 4b/4c: is lacking baselines as mentioned in the major comments above

We have resolved the comments on baselines as mentioned above.

- 5: The splitting procedure is not clear for the property predictor

Random splitting into train, val and test sets in the ratio 60:20:20 was used to train all models used in this work including the property predictor. We have clarified this point in section 3.2.1.

- 6: it looks like the code contains the individual components (download from patents, extract SMILES, de novo models), but not the scripts to obtain the results shown in the paper (metrics, graphs, etc).

We already provide code to reproduce all figures in the Zenodo repository along with the model code. This code will now be more easily accessible since the code and data are in separate repos.

Key modifications made besides reviewer comments:

The JTVAE metrics on the OPD dataset in Table 1 were updated with new values with more samples (1000) for the sake of consistency with the rest of the table. The previous values of those entries were based on a smaller sample size.
Figure S4 containing the PCA plot of the latent space, and its reference in the main text were removed since we felt that it wasn’t accurate to make statements about the smoothness of a high-dimensional latent space based on lower-dimensional projections.
We identified a bug in the code used for plotting Figure 3e which was causing the post-hoc filtering to be done based on DFT predictions instead of chemprop predictions. This code was corrected and the figure was re-plotted in the new version as Figure 4f.

Round 2

Revised manuscript submitted on 02 Jun 2023

Editor’s decision letter

14-Jun-2023

Dear Dr Gomez-Bombarelli:

Manuscript ID: DD-ART-03-2023-000041.R1
TITLE: Automated patent extraction powers generative modeling in focused chemical spaces

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below. (The minor point raised by the reviewer can be addressed in the proof stage; please do take a look.)

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

We are offering all corresponding authors on publications in gold open access RSC journals who are not already members of the Royal Society of Chemistry one year’s Affiliate membership. If you would like to find out more please email membership@rsc.org, including the promo code OA100 in your message. Learn all about our member benefits at https://www.rsc.org/membership-and-community/join/#benefit

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Dr Joshua Schrier
Associate Editor, Digital Discovery

Reviewer comments

Reviewer 1

The authors have responded well to my questions

Reviewer 2

I am positively surprised by how the authors addressed my concerns. I think the paper has become much more readable and clearer, especially in section 3.2.

It would have been interesting to see additional baseline models (among the very many that have been published in recent years), but I think that this is not essential for the publication of the paper.

Minor comment:
- The meaning of "GD" in Figure 4a is not explained and the abbreviation is never introduced. It is only clear from the section referring to that figure that it means gradient descent. I would suggest that the authors write out "gradient descent" or mention the abbreviation in the caption.

Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.