From the journal Digital Discovery Peer review history

Mining patents with large language models elucidates the chemical function landscape

Round 1

Manuscript submitted on 18 Jan 2024
 

21-Mar-2024

Dear Dr Ellington:

Manuscript ID: DD-ART-01-2024-000011
TITLE: Mining Patents with Large Language Models Elucidates the Chemical Function Landscape

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that major revisions are necessary.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process.   We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Joshua Schrier
Associate Editor, Digital Discovery

************


 
Reviewer 1


My comments:

Do the first 3500 characters (Patent summarization) of the description of the patent contain any information about chemical function labels?

Is it possible that 3500 characters contain chemicals with medical institute or chemical specific institute name? So the labels are predicated based on the chemical industry or institute name, instead of CID or IUPAC?

How many unique molecules and functional terms (were in the training dataset that was summarised by ChatGPT?). This information can be added in the abstract section.

Did ChatGPT generate any new functional label that was not there in the training dataset? Although the author did mention in the ethics statement, any finding about novel labels can be mentioned in the results section.

Gene ontology provides hierarchical relationship, did authors investigate whether the clustering have any hierarchical relationship. The number of the unique label cluster [i.e. summarization column] is “61675”, and the unique CID is “99,454”. So is it possible that a cluster which is common to a CID (many label in a row of “summarization” column of the dataset “CheF_100K_final.csv”,https://github.com/kosonocky/CheF/blob/main/results/CheF_100K_final.csv).

Is it possible with the current pipeline to restrict training for patents by data (with restriction of 3500 characters), so the reader can see if LLM can capture the difference in relationship between more recent ones vs old patented chemicals.

Does a representative term capture hierarchical relationships? Please mention this in the discussion section.

What input (a chemical IUPAC or CID ) is needed to predict the functional label, please mention this in the discussion section.


Best Regard,

Reviewer 2

The manuscript titled "Mining Patents with Large Language Models Elucidates the Chemical Function Landscape" delves into the utilization of a vast corpus of chemical literature to construct a Chemical Function (CheF) dataset, positing that it accurately reflects the chemical functionality landscape. Through comprehensive analyses and modeling, the study underscores the potential of employing text-derived functional labels in drug discovery, presenting an alternative paradigm to conventional structure-based methodologies. The manuscript is interesting, warranting acceptance pending the resolution of the following major concerns:

Several limitations in this research manuscript merit consideration:

1. The manuscript refers to "Google Scholar" three times, yet based on Figure S2, the correct reference should be "Google Patents."

2. While the manuscript predominantly explores compound functionalities pertinent to disease treatment, the inclusion of agriculture in Figure 3 appears incongruous. A cursory examination of patent US20230046892A1, cited in Figure S2, reveals its categorization under the agricultural domain, specifically under A01 (AGRICULTURE; FORESTRY; ANIMAL HUSBANDRY; HUNTING; TRAPPING; FISHING) in the International Patent Classification (IPC). Given the manuscript's primary focus on medical or pharmaceutical patents, it would be more fitting to concentrate on patents classified under A61 (MEDICAL OR VETERINARY SCIENCE; HYGIENE).

3. The IPC classification system indeed provides a meticulous categorization of patents, covering both their chemical structural and functional aspects. Compared to the summarization provided by ChatGPT, IPC offers a more coherent and precise classification framework for summarizing patents. Given that SureChEMBL and Google Patents offer access to IPC data, it is puzzling why the manuscript does not integrate IPC classifications.

4. The challenge of activity cliffs, where structurally similar compounds display significant variations in activity, is a prevalent issue in drug discovery. Patents primarily serve legal and commercial protection purposes, and not all molecules within them explicitly state their function. How did the authors tackle this issue in their study?

Reviewer 3

The authors present “CheF”, a dataset formed using large language models to elucidate the chemical function landscape. Overall, I enjoyed reading the article and found it interesting, I have primarily minor comments and clarifications (detailed further below). I enjoyed the interactive visualization, but it takes a little while to load and at first this makes the website look a little strange, the authors may want to consider optimizing the layout, or indicating that something is “loading”, or leaving a ghost image as a placeholder rather than having a big chunk of whitespace to the left of the web page while loading (this is especially pronounced for slower connections). The authors may also wish to link this website to their repository and dataset, the documentation is rather limited on the app currently.
There are no line numbers which makes writing this review a little challenging.
Abstract line 3: why “orthogonal methods”? I don’t see a clear orthogonality. Please explain or rephrase?
Abstract “approximately 100K molecules … corresponding 188K unique patents” – I appreciate this is detailed later in the paper, but a word or two in the abstract saying how these 100K were selected could be useful.
Abstract “identify drugs with target functionality … structure alone” – was any evaluation done?
End of introduction – again I had the same question – was there any comparison or evaluation done?
Results, second paragraph: “This was done to exclude over-patented molecules like penicillin with over 40,000 patents, most of which are irrelevant to its functionality” – but won’t this also exclude well known drugs? Can you do the training with and without these limits to explore the influence of this decision? I would be concerned that this will eliminate pretty much all known drugs in use, as they have extensive patent numbers.
Results, third paragraph: “the patent title, … description were scraped from Google Scholar” – Google Scholar or Google patents?
Results, third paragraph: “when considering the function of the primary patented molecule, of which the labeled molecule is an intermediate” – I’m afraid I don’t quite understand what this sentence is saying.
Figure 1: This is a nice overview, but I found some of the symbol choices confusing. Should “Molecule Database” be SureChEMBL? PubChem is not mentioned anywhere yet, but the PubChem “C” symbol is included in this image and retrieving CIDs and patent IDs is mentioned, the CIDs implies that PubChem was used, but where is this detailed? [comment added after: later on in the methods this is clearer, please consider restructuring and putting the methods before the results to avoid this confusion]
3rd paragraph after Fig 1 (p3): “And due to molecules” – delete “And”.
4th paragraph after Fig 1 (p3), second line: which fingerprint was used? Insufficient detail. If in the methods, please provide a cross-reference? [comment added later: I did not recall seeing this in the methods either]
Figure 4: consider putting the titles “hvc”, “electroluminescence”, “serotonin”, “5-ht” on the tops of the plot groups (ab, cd, ef, gh) to aid interpretation (it gets lost in the text of the caption). For (a), what is “ns” and “c” – are there some text snippets coming through the extraction that are less meaningful?
Last paragraph on p6: “To examine the best model’s capability in drug repurposing … to mitigate further pandemics” – does this also show that further refinement with keywords may help? Is this a future perspective?
Figure 5b – what is the name of CID 59611288? In PubChem (assuming it was a PubChem CID) I see only a long IUPAC name or SCHEMBL2709855, maybe at least adding the latter could help? But again, readers need to know where to search the CID to find out more, this is not clear yet. Please add a clarification about the CID into the caption (or consider shifting the methods as suggested).
Figure 5 caption: “true positives in green, false positives in red” – how were these determined?
Page 8 1st paragraph: “the entire 32M+ molecule database” – which database? It would be easier to understand if the database name was used.
P8 2nd paragraph: “further quality increases may result from integration of well-documented chemical-gene and chemical-disease relationships into CheF” – note that these are available in PubChem (and other resources).
P8 last paragraph of the results – I agree!
P8: the methods only appears here after the results, I would encourage the authors and journal to consider presenting the methods section before the results for this case, as it’s one of the most interesting parts of the paper and helps understand the results.
P8 1st paragraph of methods: “InChI keys” – correct to InChIKeys (no space), which is the InChI Trust’s spelling.
P8 1st paragraph of methods: “…and used to obtain PubChem CIDs” – how? There are many methods, and the mappings are not 1:1. More details are needed here.
P8 1st paragraph of methods: “Google Scholar” – as mentioned above, why Google Scholar and not e.g. patents?
P8 2nd paragraph: details like this are definitely interesting further up, please move the methods section above the results, it would make things a lot clearer.
Ethics statement: Is this an ethics statement or a discussion? To me this sounds more appropriate in the discussion. How can the authors be sure about “As patents typically describe the beneficial applications of molecules … unlikely … to identify novel toxic compounds” – I am not sure about this, there are not just beneficial applications of chemicals, there are also plenty of industrial processes described in patents (which are not necessarily “toxic” either) – or is the subset used in this paper only drug discovery-related material? This may be worthy of a little extra discussion / background.
Supplemental data: There seems a lot of supplemental data presented that was not so obvious (at least not to me) when reading the article – perhaps a clearer description of extra material could be beneficial in the main text.
Supplemental A: prompts – consider including this in the main text? The main text came across as very short, whereas the supplemental contains interesting information that was missing in the main text.
Table S1: This seems to show that further keyword requirement may be needed? Or how else to trim patents to relevant numbers? Consider adding names to the table for the benefit of chemists? The PubChem CID is just a number (but sometimes the names are long, which is also a problem).
Table S3: this is a tiny table (is it worthy of a table?) and could/should be integrated in the main text in my opinion – perhaps the caption and table could be merged into a paragraph in the main text.
Table S4: I find this very interesting and would recommend integrating this into the main text, I was missing a comparison in the main text and was surprised to see this only in the supplemental.
Table S6: as also commented above for another example, there are some single letters/numbers like “7”, “c” – what meaning do these have and had the authors considered a character limit? Some other text mining applications give penalties to or trim anything under a length of 3 or 4, for instance.
Table S7: Something looks a bit strange with the “anti – anti-malarial” row (is it two rows or one?). Again, several single digit numbers or single letters (6, 5, 4, a) – why do these entries occur? Also, some of these entries appear to be functional groups, whereas others are functions. How can CheF distinguish? How are users meant to interpret these?


 

We thank the editor and reviewers for providing useful feedback on our manuscript. Here, we address each of the comments in order.

Editor:

> Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy,

We have added a section on author contributions using the CRediT system. This is above the ‘Conflict of Interest’ and ‘Acknowledgement’ sections.

> The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

We have provided the ORCID IDs for all authors. They are as follows:
- Clayton Kosonocky: 0000-0002-6420-8615
- Claus Wilke: 0000-0002-7470-9261
- Edward Marcotte: 0000-0001-8808-180X
- Andrew Ellington: 0000-0001-6246-5338


Response to Reviewer 1:

> Do the first 3500 characters (Patent summarization) of the description of the patent contain any information about chemical function labels?

The first 3500 characters contain highly relevant information on the molecule’s function. Typically this includes background on the relevant subfield and the summary of claims. We have added a sentence in the second paragraph of the methods section to clarify this:

Line 91: “The first 3500 characters of the description were included because the start of the patent description typically contains relevant background, mechanistic information, and/or a summary of the claim.”

> Is it possible that 3500 characters contain chemicals with medical institute or chemical specific institute name? So the labels are predicated based on the chemical industry or institute name, instead of CID or IUPAC?

It is possible that the 3500 characters contain chemical industry or institute names instead of CID or IUPAC. However, this has no impact on the dataset. The dataset was created using the SureChEMBL database which was created by establishing molecule-patent associations from text mining irrespective of the location in the patent. We use these pre-established associations and then pass in the title, abstract, and first 3500 characters to summarize the functionality described by the patent irrespective of its molecule-patent associations. We have made the data processing steps more clear in the methods section to clarify this:

Line 79: The SureChEMBL database, a database of text-mined associations between molecules and the patents they are mentioned in, was shuffled and converted to chiral RDKit-canonicalized SMILES strings to remove malformed strings (Weininger, 1988; Papadatos et al., 2016; Landrum et al., 2013).

> How many unique molecules and functional terms (were in the training dataset that was summarised by ChatGPT?). This information can be added in the abstract section.

There were 99,182 molecules and 1,522 functional terms in the dataset that was later used to train the model. In the revised manuscript, these are mentioned in the methods section:

Line 112: This resulted in a 99,182-molecule dataset with 1,522 unique functional labels, deemed the Chemical Function (CheF) dataset.

And the results section:

Line: 248: The final CheF dataset consisted of 99,182 molecules and their 1,522 descriptive functional labels

We have listed approximate numbers in the abstract for readability:

Line 12: This dataset, comprising 631K molecule-function pairs, was created using an LLM- and embedding-based method to obtain 1.5K unique functional labels for approximately 100K randomly selected molecules from their corresponding 188K unique patents.

> Did ChatGPT generate any new functional label that was not there in the training dataset? Although the author did mention in the ethics statement, any finding about novel labels can be mentioned in the results section.

We created a dataset of molecules and their functional labels by using the pre-trained ChatGPT (gpt-3.5-turbo) to extract functional information from patents. There was no training dataset used to fine-tune ChatGPT further, and thus the question being asked does not apply to our situation. To clear up this misunderstanding, we have clarified the following sentence in the methods section:

Line 88: The patent title, abstract, and first 3500 characters of the description were summarized into concise functional labels using ChatGPT (gpt-3.5-turbo) with no further fine-tuning from July 15th, 2023, chosen for low cost and high speed.

> Gene ontology provides hierarchical relationship, did authors investigate whether the clustering have any hierarchical relationship. The number of the unique label cluster [i.e. summarization column] is “61675”, and the unique CID is “99,454”. So is it possible that a cluster which is common to a CID (many label in a row of “summarization” column of the dataset “CheF_100K_final.csv”

Investigating hierarchical relationships in the clustering would be beneficial to improve the dataset. We have added a few sentences explaining the additional benefits that creating an ontology would provide in future work:

Line 437: Increasing label quality and ignoring extraneous claims might be achieved through an LLM fine-tuned on high-quality examples or through the organization of functional labels into an ontology.

We are unsure what the referee is asking in the second half of this comment as it appears that the comment got cut off.

> Is it possible with the current pipeline to restrict training for patents by data (with restriction of 3500 characters), so the reader can see if LLM can capture the difference in relationship between more recent ones vs old patented chemicals.

Assessing the differences in functional terms that the LLM extracts from patents depending on the specific revision year of a patent is a worthwhile experiment. We have added this as Table S3 to show that identical labels were obtained regardless of the patent version, even in cases when publication dates differ by 7 years. This provides evidence that patent publication date has minimal impact on the generated terms.

> Does a representative term capture hierarchical relationships? Please mention this in the discussion section.

It is possible that the representative term would capture hierarchical relationships, but it is not guaranteed given the method used in the manuscript. This has been mentioned in the discussion section:

Line 439: While it is possible that some of the representative terms created with GPT-4 capture hierarchical relationships, it is not guaranteed from the method used herein.

> What input (a chemical IUPAC or CID ) is needed to predict the functional label, please mention this in the discussion section

The input to predict functional labels is a SMILES string that gets converted to an RDKit Daylight-based molecular fingerprint. We have clarified the methods section to emphasize this.

Line 350: To employ the text-based chemical function landscape for drug discovery, several multi-label classification models were trained on CheF to predict functional labels from Daylight molecular fingerprints



Response to Reviewer 2:

> The manuscript refers to "Google Scholar" three times, yet based on Figure S2, the correct reference should be "Google Patents."

We apologize as this was an error on our end. We have corrected all references from "Google Scholar" to "Google Patents" throughout the text and in Figures 1 and S2.

> While the manuscript predominantly explores compound functionalities pertinent to disease treatment, the inclusion of agriculture in Figure 3 appears incongruous. A cursory examination of patent US20230046892A1, cited in Figure S2, reveals its categorization under the agricultural domain, specifically under A01 (AGRICULTURE; FORESTRY; ANIMAL HUSBANDRY; HUNTING; TRAPPING; FISHING) in the International Patent Classification (IPC). Given the manuscript's primary focus on medical or pharmaceutical patents, it would be more fitting to concentrate on patents classified under A61 (MEDICAL OR VETERINARY SCIENCE; HYGIENE)

This is an interesting suggestion and we thank you for bringing it to our attention. While we focus on medicinal applications in some portions of this manuscript, we are hoping to capture molecular functionality in the broadest sense of the word, meaning the inclusion of non-human biological interactions (e.g., agricultural antifungal agents) and molecules with non-biological functions (e.g., polymer, electroluminescent). We have added a sentence discussing this in the manuscript to better clarify what we are accomplishing with our dataset:

Line 209: As our goal was to capture molecular functionality in the broadest sense, we chose to include patents irrespective of their International Patent Classification categories

> The IPC classification system indeed provides a meticulous categorization of patents, covering both their chemical structural and functional aspects. Compared to the summarization provided by ChatGPT, IPC offers a more coherent and precise classification framework for summarizing patents. Given that SureChEMBL and Google Patents offer access to IPC data, it is puzzling why the manuscript does not integrate IPC classifications.

This is a great suggestion as we have indeed neglected IPC categories in our manuscript. IPC categories appear suitable for classifying some functionalities, but seem insufficient at other levels of detail. To illustrate this, we created a table comparing IPC categories to ChatGPT-generated functional labels for the molecule SCHEMBL4156630 (PubChem CID: 87628486) associated with patent US-2009156572-A1 (Table S1). From this we can see that the IPC classifications include structural classifications (e.g., Ortho-condensed systems, Heterocyclic compounds containing) and broad disease classifications (e.g., Centrally acting analgesics, Antidepressants) but miss out on the fine-grained mechanistic terms that the ChatGPT-aided method captures (e.g., Dopamine-Antagonistic, Serotonin Agonist). The IPC classifications capture some terms that the ChatGPT-aided method misses, so in future work it may make sense to integrate these to fill in gaps and/or supplement the dataset. We have now mentioned this in the discussion:

Line 433: The inclusion of over-patented chemicals, like those in Table S2, could be accomplished through supplementation from other data sources like PubChem, PubMed, or International Patent Classification categories (Table S1).

> The challenge of activity cliffs, where structurally similar compounds display significant variations in activity, is a prevalent issue in drug discovery. Patents primarily serve legal and commercial protection purposes, and not all molecules within them explicitly state their function. How did the authors tackle this issue in their study?

Activity cliffs are indeed a major problem in drug discovery, causing the molecular property landscape to be rough if viewed at a fine-grained level. Our aim was to create a model that can predict the chemical function landscape at a coarse-grained level to give plausible suggestions of molecular functionality rather than providing guaranteed predictions of activity. Models created with the CheF dataset would be used to find leads that would then be optimized further through other means. As this is an important consideration, we have added a paragraph to the discussion clarifying this:

Line 446: The CheF dataset was created from patented molecules. This includes the active molecules responsible for the patent’s existence, but also derivatives that may or may not be active. Models trained on the CheF dataset are then learning a coarse-grained map of the chemical function landscape rather than a fine-grained map with activity cliffs. As such, we foresee CheF-trained models being used to annotate broad functionality at a high-level, capturing general chemical trends, rather than providing precise guarantees of activity.



Response to Reviewer 3:

> I enjoyed the interactive visualization, but it takes a little while to load and at first this makes the website look a little strange, the authors may want to consider optimizing the layout, or indicating that something is “loading”, or leaving a ghost image as a placeholder rather than having a big chunk of whitespace to the left of the web page while loading (this is especially pronounced for slower connections).

We appreciate your enthusiasm for the interactive visualization. We have made the following changes to the app to help ensure a smoother user experience:
1. “Loading” while the plot is loading
2. Additional instructions and information in the web app page

> The authors may also wish to link this website to their repository and dataset, the documentation is rather limited on the app currently.

We have added a link to our GitHub repository with additional information on how to use the web app.

> There are no line numbers which makes writing this review a little challenging.

We apologize for that and hope it didn’t impact the review process too much on your end. For our revisions we have added line numbers and have provided them in this document wherever changes have been made.

> Abstract line 3: why “orthogonal methods”? I don’t see a clear orthogonality. Please explain or rephrase?

We agree that this is phrased poorly. We have corrected this in the abstract to say “alternative” in the abstract:

Line 20: We believe that functional label-guided molecular discovery may serve as an alternative approach to traditional structure-based methods in the pursuit of designing novel functional molecules.

And in the introduction:
Line 74: We believe that functional label-guided molecular discovery may serve as an alternative approach to traditional structure-based methods in the pursuit of designing novel functional molecules

> Abstract “approximately 100K molecules … corresponding 188K unique patents” – I appreciate this is detailed later in the paper, but a word or two in the abstract saying how these 100K were selected could be useful.

We have clarified this sentence. It now reads:

Line 12: This dataset, comprising 631K molecule-function pairs, was created using an LLM- and embedding-based method to obtain 1.5K unique functional labels for approximately 100K randomly selected molecules from their corresponding 188K unique patents

> Abstract “identify drugs with target functionality … structure alone” – was any evaluation done? End of introduction – again I had the same question – was there any comparison or evaluation done?

These sentences were alluding to the experiments conducted in Figure 5 and Figure S7. Here, we quantified the model’s ability to predict functional labels given a structural input (5a) as measured by the ROC-AUC and PR-AUC. This was then illustrated through the examples in 5b, 5c, 5d, and Fig. S7. We have rephrased these sentences in the abstract and introduction.

Line 17: We then demonstrate through several examples that this text-based functional landscape can be leveraged to identify drugs with target functionality using a model able to predict functional profiles from structure alone.

Line 72: We then demonstrate through several examples that this text-based functional landscape can be harnessed to identify drugs with target functionality using a model able to predict functional profiles from structure alone.

> Results, second paragraph: “This was done to exclude over-patented molecules like penicillin with over 40,000 patents, most of which are irrelevant to its functionality” – but won’t this also exclude well known drugs? Can you do the training with and without these limits to explore the influence of this decision? I would be concerned that this will eliminate pretty much all known drugs in use, as they have extensive patent numbers.

We understand your concern with leaving out the over-patented molecules. Despite leaving out these well-studied molecules from the training set, we hypothesized that the model is still able to predict their functions. We outline the hypothesis in the results:

Line 214: This was done to exclude over-patented molecules like penicillin with over 40,000 patents, most of which are irrelevant to its functionality (Table S2). We acknowledge that this filter removes the most well-studied molecules from the dataset. However, we hypothesize that the impact of this holdout is minimal as models trained on the dataset will be able to infer functionality of well-studied molecules from their less-patented derivatives.

We then obtained the top predicted labels for several different over-patented molecules that it had not seen before in Table S8. We can see from this table that the success rate is quite high, meaning that excluding these molecules is not as detrimental as it would seem at first. The reason why we believe this works is that molecules, when they are the primary subject of the patent, are often patented with their derivatives, and models trained on the CheF dataset infer over-patented function based on their less patented derivatives.

We also added a sentence discussing the table in the results:

Line 354: Despite excluding over-patented molecules from the dataset, the CheF-trained model is often able to confidently retrodict their primary functions, giving evidence to our earlier hypothesis (Table S8).

And in the discussion:

Line 431: Additionally, by restricting the dataset to chemicals with <10 patents, it neglects important well-studied molecules like Penicillin. However, we found the impact of this omission to be negligible (Table S8).

> Results, third paragraph: “the patent title, … description were scraped from Google Scholar” – Google Scholar or Google patents?

Apologies, this was an error on our end. We have corrected all references from "Google Scholar" to "Google Patents" throughout the text and in Figure S2.

> Results, third paragraph: “when considering the function of the primary patented molecule, of which the labeled molecule is an intermediate” – I’m afraid I don’t quite understand what this sentence is saying.

We have rewritten this entire paragraph to better convey the results of the validation. It now reads as follows:

Line 223: The LLM-assisted function extraction method’s success was validated manually across 1,738 labels generated from a random 200 CheF molecules. Of these labels, 99.6% had correct syntax and 99.8% were relevant to their respective patent. In the SureChEMBL database, molecules can be linked to patents in which they serve as intermediates to the final patented molecule. Because of this, 77.9% of the labels correctly describe the labeled molecule’s function. However, if considering associations through synthesis, then 98.2% of the molecules are correctly described by their functional labels. This shows that the deviation from near-perfect accuracy is due to the molecule-patent associations rather than the ChatGPT-assisted functional extraction.

> Figure 1: This is a nice overview, but I found some of the symbol choices confusing. Should “Molecule Database” be SureChEMBL? PubChem is not mentioned anywhere yet, but the PubChem “C” symbol is included in this image and retrieving CIDs and patent IDs is mentioned, the CIDs implies that PubChem was used, but where is this detailed? [comment added after: later on in the methods this is clearer, please consider restructuring and putting the methods before the results to avoid this confusion]

We agree that moving the methods before the results is a better choice for the flow of the manuscript and have implemented this change. We also agree that the PubChem “C” symbol is confusing in Figure 1 and have replaced it with the ChEMBL symbol.

> 3rd paragraph after Fig 1 (p3): “And due to molecules” – delete “And”.

Line 255: We have deleted this “And”.

> 4th paragraph after Fig 1 (p3), second line: which fingerprint was used? Insufficient detail. If in the methods, please provide a cross-reference? [comment added later: I did not recall seeing this in the methods either]

The fingerprints were generated using RDKit’s RDKFingerprint() function which utilizes daylight fingerprints. We have clarified this in the main text as well as in the methods with citations to RDKit.

The methods now reads:

Line 135: The 99,182 molecules were converted to Daylight molecular fingerprints with the RDKfingerprint() method in RDKit (Landrum et al., 2013).

And the results
Line 261: To evaluate this hypothesis, we embedded the CheF dataset in structure space by converting the molecules to Daylight molecular fingerprints (binary vectors representing a molecule’s substructures), visualized with t-distributed Stochastic Neighbor Embedding (t-SNE) (Fig. 2, S5) (Landrum et al., 2013).

Line 350: To employ the text-based chemical function landscape for drug discovery, several multi-label classification models were trained on CheF to predict functional labels from Daylight molecular fingerprints (Table S7) (Landrum et al., 2013).

> Figure 4: consider putting the titles “hvc”, “electroluminescence”, “serotonin”, “5-ht” on the tops of the plot groups (ab, cd, ef, gh) to aid interpretation (it gets lost in the text of the caption). For (a), what is “ns” and “c” – are there some text snippets coming through the extraction that are less meaningful?

We have added the titles “hcv”, “electroluminescence”, etc. to the plot for easier interpretation (Figures 2, 4), as well as the supplemental examples (Figure S5).

For (a), the terms “ns” and “c” refer to the Nonstructural (NS) proteins found in HCV, and the “C” refers to the “C” in “Hepatitis C”. These short letters are an artifact of our label cleaning process rather than the functional label text extraction process as we split terms by each word for better label consolidation. This resulted in many useful labels like “polymerase, replication” but sometimes resulted in strange shortened labels like “ns” and “c”. Despite the occasional utility of single-character labels, we decided it was best to remove these labels and have done so in the updated manuscript. To correct this, we updated our dataset, re-trained our models, and changed any affected results and figures, including Figures 2, 3, 4, 5, S5, 7, and Tables S5, S6, S7. The updated dataset has been pushed to GitHub which now does not include the single character labels.

> Last paragraph on p6: “To examine the best model’s capability in drug repurposing … to mitigate further pandemics” – does this also show that further refinement with keywords may help? Is this a future perspective?

This is a future perspective alluding to the fact that since we were able to correctly identify relevant drugs for Hepatitis C Virus, it is plausible that future approaches in functional label-guided drug discovery could help in the rapid development and repurposing of antivirals for future pandemics. We have rewritten this sentence to clarify that this is a future perspective.

Line: 388: Beyond showing its power, this example suggests that functional label-guided drug discovery may serve as an additional approach for antiviral repurposing which could help contribute to mitigating future pandemics.

This statement doesn’t directly show that further refinement with keywords may help, though that does seem to be an avenue to potentially improve dataset quality. We discuss this in the discussion section:

Line 433: The inclusion of over-patented chemicals, like those in Table S2, could be accomplished through supplementation from other data sources like PubChem, PubMed, or International Patent Classification categories (Table S1).

> Figure 5b – what is the name of CID 59611288? In PubChem (assuming it was a PubChem CID) I see only a long IUPAC name or SCHEMBL2709855, maybe at least adding the latter could help? But again, readers need to know where to search the CID to find out more, this is not clear yet. Please add a clarification about the CID into the caption (or consider shifting the methods as suggested).

We agree that this is confusing and have added “SCHEMBL2709855” below the CID in Figure 5b as the chemical name is too long to fit. We have also changed the CID portion to read “PubChem CID: 59611288”.

> Figure 5 caption: “true positives in green, false positives in red” – how were these determined?

The true and false positives were determined by going through those molecule’s associated patents and determining if they mentioned serotonin or serotonin receptors. If they did, they were marked as a true positive. If they did not, they were marked as a false positive. This methodology has been clarified in the figure caption:

Line 400: Functional label-based drug candidate identification, showcasing the top 10 test set molecules for ’serotonin’ or ‘5-ht’; true positives in green and false positives in red, determined if their associated patents mentioned serotonin or serotonin receptors.

> Page 8 1st paragraph: “the entire 32M+ molecule database” – which database? It would be easier to understand if the database name was used.

This refers to the SChEMBL database. The text has been updated to make this clear:

Line 425: Since the CheF dataset is scalable to the entire 32M+ SureChEMBL database, we anticipate that many of these predictions will only get better into the future.

> P8 2nd paragraph: “further quality increases may result from integration of well-documented chemical-gene and chemical-disease relationships into CheF” – note that these are available in PubChem (and other resources).

We are aware that these are available but wanted to keep the focus of this paper on LLM-mined data. These relationships could be useful to help supplement missing functional labels and have updated the sentence discussing how, while it will benefit future work, we wanted to focus on LLM-extracted information in this manuscript and leave further merging and supplementation to future work:

Line 441: Further quality increases may result from integration of well-documented chemical-gene and chemical-disease relationships from PubChem into CheF. As the scope of the manuscript lies with using LLMs to mine functionality from text, we leave dataset merging and supplementation to future work.

> P8 last paragraph of the results – I agree!

We appreciate that you are excited about this method’s potential!

> P8: the methods only appears here after the results, I would encourage the authors and journal to consider presenting the methods section before the results for this case, as it’s one of the most interesting parts of the paper and helps understand the results.

We agree and have moved the methods to appear before the results section.

> P8 1st paragraph of methods: “InChI keys” – correct to InChIKeys (no space), which is the InChI Trust’s spelling.

Thank you for catching that. This has been corrected in the methods.

> P8 1st paragraph of methods: “…and used to obtain PubChem CIDs” – how? There are many methods, and the mappings are not 1:1. More details are needed here.

Thank you for pointing this out. We were not aware that the mappings are not 1:1 and checked to see if this caused any clashes in our dataset. In our CheF-100K dataset there were 273 total clashes. However, 171 of these clashes shared a patent meaning that they were silent. Based on our methodology, the remaining 102 clashes defaulted to the larger CID. The effect of this is that the labels are less comprehensive than they should be for 102 molecules in the dataset which, thankfully, is a very minor effect. In the future this should be handled by either using patent IDs from all clashing InChIKeys, or by obtaining CIDs and their Patent IDs via a different method.

We have clarified our methodology in the methods section, in which we now say:

Line 81: SMILES strings were converted to InChIKeys and used to obtain PubChem CIDs, using the larger CID when conflicting pairs exist.

> P8 1st paragraph of methods: “Google Scholar” – as mentioned above, why Google Scholar and not e.g. patents?

We have corrected all references from "Google Scholar" to "Google Patents" throughout the text and in Figures 1 and S2.

> P8 2nd paragraph: details like this are definitely interesting further up, please move the methods section above the results, it would make things a lot clearer.

We agree and have moved the methods to appear before the results section.

> Ethics statement: Is this an ethics statement or a discussion? To me this sounds more appropriate in the discussion. How can the authors be sure about “As patents typically describe the beneficial applications of molecules … unlikely … to identify novel toxic compounds” – I am not sure about this, there are not just beneficial applications of chemicals, there are also plenty of industrial processes described in patents (which are not necessarily “toxic” either) – or is the subset used in this paper only drug discovery-related material? This may be worthy of a little extra discussion / background.

We agree that the ethics statement should be part of the discussion instead of separate. This has been moved in the updated manuscript at the end of the discussion:

You are correct that many molecules are patented for industrial applications which can be repurposed for malicious purposes. As this is now apparent to us, we have removed this sentence due to the unsoundness of the claim. What remains of this paragraph seems to us a suitable discussion on the risks of this model, of which we believe are very minor as it is composed of data mined from existing public literature. The paragraph reads as follows:

Line 453: Consideration of ML chemistry dual-use often focuses on the identification of toxic chemicals and drugs of abuse. To test the dual use potential of CheF, functional labels for the chemical weapons VX and mustard gas were predicted from our model and were found to contain no obvious indications of malicious properties. On the contrary, drugs of abuse were more easily identifiable, as the development of neurological compounds remains a lucrative objective. 5-MeO-DMT, LSD, fentanyl, and morphine all had functional labels of their primary mechanism predicted with moderate confidence. However, benign molecules also predicted these same labels, indicating that it may be quite challenging to intentionally discover novel drugs of abuse using the methods contained herein.

> Supplemental data: There seems a lot of supplemental data presented that was not so obvious (at least not to me) when reading the article – perhaps a clearer description of extra material could be beneficial in the main text.

This is a good point. We have now ensured that every supplementary figure and table has a corresponding reference in the main text to help guide the reader through the information.

> Supplemental A: prompts – consider including this in the main text? The main text came across as very short, whereas the supplemental contains interesting information that was missing in the main text.

We think this is a good suggestion and have added this to the methods in the main text (Lines 156-181).

> Table S1: This seems to show that further keyword requirement may be needed? Or how else to trim patents to relevant numbers? Consider adding names to the table for the benefit of chemists? The PubChem CID is just a number (but sometimes the names are long, which is also a problem).

This does show that a way to filter irrelevant patents would be beneficial to include these molecules. We have added a reference to this table in our discussion on ways to improve dataset quality:

Line 435: These over-patented molecules could also be included through keyword filtering or by only using the most common terms for each molecule.

We have added the names of these molecules to the table in addition to PubChem CIDs where they can fit in what is now Table S2.

> Table S3: this is a tiny table (is it worthy of a table?) and could/should be integrated in the main text in my opinion – perhaps the caption and table could be merged into a paragraph in the main text.

We agree that this is a tiny table and should be removed and incorporated into the main text. We have made this change for this table and the other similarly small supplemental table in the revised manuscript. We outline both changes below, both in the methods and results. The methods now reads:

Line 116: ChatGPT patent summarization validation. Manual validation was performed on 200 molecules randomly chosen from the CheF dataset. These 200 molecules had 596 valid associated patents, and 1,738 ChatGPT summarized labels. These labels were manually validated to determine the ratio of correct syntax, relevance to patent, and relevance to the molecule of interest.

Line 121: Validation of ChatGPT-aided label consolidation. The first 500 of the 3,178 clusters of greater than one label (sorted in descending cluster size order) were evaluated for whether or not the clusters contained semantically common elements. The ChatGPT consolidated cluster labels were then analyzed for accuracy and representativeness. Common failure modes for clustering primarily included the grouping of grammatically similar, but not semantically similar labels (e.g., ahas-inhibiting, ikk-inhibiting). Failure modes for ChatGPT commonly included averaging the terms to the wrong shared common element (e.g., anti-fungal and anti-mycotic being consolidated to the label “anti”).

And the results reads:
Line 223: The LLM-assisted function extraction method’s success was validated manually across 1,738 labels generated from a random 200 CheF molecules. Of these labels, 99.6% had correct syntax and 99.8% were relevant to their respective patent. In the SureChEMBL database, molecules can be linked to patents in which they serve as intermediates to the final patented molecule. Because of this, 77.9% of the labels correctly describe the labeled molecule’s function. However, if considering associations through synthesis, then 98.2% of the molecules are correctly described by their functional labels. This shows that the deviation from near-perfect accuracy is due to the molecule-patent associations rather than the ChatGPT-assisted functional extraction.

Line 243: The embedding-based clustering and summarization process was validated across the 500 largest clusters. Of these, 99.2% contained semantically common elements and 97.6% of the cluster summarizations were accurate and representative of their constituent labels.

> Table S4: I find this very interesting and would recommend integrating this into the main text, I was missing a comparison in the main text and was surprised to see this only in the supplemental.

We agree and have moved this table to the main text. It is now Table 1.

> Table S6: as also commented above for another example, there are some single letters/numbers like “7”, “c” – what meaning do these have and had the authors considered a character limit? Some other text mining applications give penalties to or trim anything under a length of 3 or 4, for instance.

We have answered this question in an earlier response, which we paste below:

For (a), the terms “ns” and “c” refer to the Nonstructural (NS) proteins found in HCV, and the “C” refers to the “C” in “Hepatitis C”. These short letters are an artifact of our label cleaning process rather than the functional label text extraction process as we split terms by each word for better label consolidation. This resulted in many useful labels like “polymerase, replication” but sometimes resulted in strange shortened labels like “ns” and “c”. Despite the occasional utility of single-character labels, we decided it was best to remove these labels and have done so in the updated manuscript. To correct this, we updated our dataset, re-trained our models, and changed any affected results and figures, including Figures 2, 3, 4, 5, S5, 7, and Tables S5, S6, S7. The updated dataset has been pushed to GitHub which now does not include the single character labels.

> Table S7: Something looks a bit strange with the “anti – anti-malarial” row (is it two rows or one?). Again, several single digit numbers or single letters (6, 5, 4, a) – why do these entries occur? Also, some of these entries appear to be functional groups, whereas others are functions. How can CheF distinguish? How are users meant to interpret these?

This strange formatting was an error from LaTeX (which we used to compile the original pdf). This has been fixed in the updated manuscript.

As mentioned in the above response, we have removed the single-character labels.

The structural terms are somewhat unavoidable in the current implementation as filtering these algorithmically (or with an LLM) tends to remove receptors named after structural terms. An example of this is the 5-HT receptor, which is named after the structure of Serotonin (5-HT; 5-hydroxytryptamine). Removal of all structural terms would remove labels for this receptor. So any structural terms should be interpreted in the context of the other associated or predicted labels. We have mentioned this in the text in the methods section:

Line 109: Despite this, some structural terms remained which correspond either to receptor names (i.e., ATP, 5-HT), or to chemical moieties (i.e., Aryl, Azetidine).




Round 2

Revised manuscript submitted on 03 Apr 2024
 

19-Apr-2024

Dear Dr Ellington:

Manuscript ID: DD-ART-01-2024-000011.R1
TITLE: Mining Patents with Large Language Models Elucidates the Chemical Function Landscape

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

After careful evaluation of your manuscript and the reviewers’ reports, I will be pleased to accept your manuscript for publication after minor revisions.

Please revise your manuscript to fully address the reviewers’ comments. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link :

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log in to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process.   We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Joshua Schrier
Associate Editor, Digital Discovery

************
EDITOR'S NOTE:

It appears that the References section was omitted in the revised manuscript submitted for review. This needs to be corrected.


************


 
Reviewer 3

The authors have done a very good job addressing the reviewer comments, providing detailed explanations and revising the manuscript were possible - including new analyses in some cases. The manuscript reads clearly, especially now that the methods is before results and some of the supplementary material is included in the main text. They have clearly indicated where they prefer to leave some investigations for future work and this seems reasonable. The adjustments to the app are nice and will help users cross-discover the material.
It does appear that the submitted revised manuscript is missing the reference section, judging by the comment on line 500 - and the supporting information section may need some layout tweaks to avoid captions and tables breaking across pages. Scientifically, however, this seems ready to accept to me once those editorial issues are resolved, I think this will be a very valuable contribution.

Reviewer 1

Please ignore the part ( "The number of the unique..") in the comment 5. It was related to my comment no. 7, which has been answered.


 

Apologies for forgetting to render the references. I have updated the manuscript to include that.




Round 3

Revised manuscript submitted on 29 Apr 2024
 

03-May-2024

Dear Dr Ellington:

Manuscript ID: DD-ART-01-2024-000011.R2
TITLE: Mining Patents with Large Language Models Elucidates the Chemical Function Landscape

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our LinkedIn account [https://rsc.li/Digital_showcase] please fill out this form: https://form.jotform.com/213544038469056.

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Dr Joshua Schrier
Associate Editor, Digital Discovery


******
******

Please contact the journal at digitaldiscovery@rsc.org

************************************

DISCLAIMER:

This communication is from The Royal Society of Chemistry, a company incorporated in England by Royal Charter (registered number RC000524) and a charity registered in England and Wales (charity number 207890). Registered office: Burlington House, Piccadilly, London W1J 0BA. Telephone: +44 (0) 20 7437 8656.

The content of this communication (including any attachments) is confidential, and may be privileged or contain copyright material. It may not be relied upon or disclosed to any person other than the intended recipient(s) without the consent of The Royal Society of Chemistry. If you are not the intended recipient(s), please (1) notify us immediately by replying to this email, (2) delete all copies from your system, and (3) note that disclosure, distribution, copying or use of this communication is strictly prohibited.

Any advice given by The Royal Society of Chemistry has been carefully formulated but is based on the information available to it. The Royal Society of Chemistry cannot be held responsible for accuracy or completeness of this communication or any attachment. Any views or opinions presented in this email are solely those of the author and do not represent those of The Royal Society of Chemistry. The views expressed in this communication are personal to the sender and unless specifically stated, this e-mail does not constitute any part of an offer or contract. The Royal Society of Chemistry shall not be liable for any resulting damage or loss as a result of the use of this email and/or attachments, or for the consequences of any actions taken on the basis of the information provided. The Royal Society of Chemistry does not warrant that its emails or attachments are Virus-free; The Royal Society of Chemistry has taken reasonable precautions to ensure that no viruses are contained in this email, but does not accept any responsibility once this email has been transmitted. Please rely on your own screening of electronic communication.

More information on The Royal Society of Chemistry can be found on our website: www.rsc.org




Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.
Creative Commons BY license