Mining patents with large language models elucidates the chemical function landscape

The fundamental goal of small molecule discovery is to generate chemicals with target functionality. While this often proceeds through structure-based methods, we set out to investigate the practicality of methods that leverage the extensive corpus of chemical literature. We hypothesize that a sufficiently large text-derived chemical function dataset would mirror the actual landscape of chemical functionality. Such a landscape would implicitly capture complex physical and biological interactions given that chemical function arises from both a molecule's structure and its interacting partners. To evaluate this hypothesis, we built a Chemical Function (CheF) dataset of patent-derived functional labels. This dataset, comprising 631 K molecule–function pairs, was created using an LLM- and embedding-based method to obtain 1.5 K unique functional labels for approximately 100 K randomly selected molecules from their corresponding 188 K unique patents. We carry out a series of analyses demonstrating that the CheF dataset contains a semantically coherent textual representation of the functional landscape congruent with chemical structural relationships, thus approximating the actual chemical function landscape. We then demonstrate through several examples that this text-based functional landscape can be leveraged to identify drugs with target functionality using a model able to predict functional profiles from structure alone. We believe that functional label-guided molecular discovery may serve as an alternative approach to traditional structure-based methods in the pursuit of designing novel functional molecules.

C07D401/04 Heterocyclic compounds containing two or more hetero rings, having nitrogen atoms as the only ring hetero atoms, at least one ring being a six-membered ring with only one nitrogen atom containing two hetero rings directly linked by a ring-memberto-ring-member bond Dopamine-Antagonistic A61P25/00 Drugs for disorders of the nervous system Dopamine-Agonistic    (a) The optimal DBSCAN epsilon value was defined as the cutoff resulting in the smallest number of clusters without overtly false categories appearing (e.g., merging antiviral, antibacterial, and antifungal).The optimal epsilon was found to be 0.340 for the dataset considered herein (marked by black star), resulting in a consolidation from 29,854 labels to 20,030 clusters.The labels in each cluster were then consolidated with ChatGPT, creating a set of 20,030 labels.(b) t-SNE of the ada-002 text embeddings, colored by the top 10 largest clusters.The largest cluster, found to be all IUPAC structural terms, was removed from the dataset to reduce excessive nongeneralizable labels.
Figure S5: Additional CheF labels and their clusters in structure space.For each of the labels "crystal", "protease", "opioid", and "beta-lactamase", molecules in the CheF dataset were projected based on molecular fingerprints and colored if the selected label was contained by the molecule's set of descriptors.To measure degree of clustering for a single label, the max fingerprint Tanimoto similarity from each molecule containing the selected label, to the other molecules containing that label, compared against the max fingerprint Tanimoto similarity for a random subset of molecules of the same size was obtained, whereas to measure the coincidence between the primary and co-occurring labels, the max fingerprint Tanimoto similarity from each molecule containing the primary label to each molecule containing any of the 10 nearest neighbor labels was compared against the max fingerprint Tanimoto similarity to a random subset of molecules of the same size.

Figure S2 :
Figure S2: Example of LLM-based chemical function extraction.Patent IDs are used to retrieve the patent title, abstract, and description from Google Patents.ChatGPT is then prompted to extract out the chemical function of the molecule being described by the patent.

Figure S3 :
Figure S3: Most frequent patent summarizations.The most frequent patent summarizations do not immediately exhibit any dataset-independent biases.The bias towards broad treatment

Figure S4 :
Figure S4: DBSCAN clustering on ada-002 text embeddings reduces the number of labels.(a)The optimal DBSCAN epsilon value was defined as the cutoff resulting in the smallest number of clusters without overtly false categories appearing (e.g., merging antiviral, antibacterial, and antifungal).The optimal epsilon was found to be 0.340 for the dataset considered herein (marked by black star), resulting in a consolidation from 29,854 labels to 20,030 clusters.The labels in each cluster were then consolidated with ChatGPT, creating a set of 20,030 labels.(b) t-SNE of the ada-002 text embeddings, colored by the top 10 largest clusters.The largest cluster, found to be all IUPAC structural terms, was removed from the dataset to reduce excessive nongeneralizable labels.

Figure
Figure S6: K-means clustering on molecules containing 'hcv' elucidates Hepatitis C Virus (HCV) antiviral modalities.The top 20 most frequently occurring labels were obtained for each of 8 clusters to determine their modalities (if applicable).Cluster 4 was the only cluster to contain 'nucleoside' (n=65) and 'nucleotide' (n=12) in the top 20 labels, indicating this cluster primarily contained HCV antiviral nucleoside derivatives likely inhibiting the NS5B polymerase.Cluster 2 contained 'protease' (n=85), 'peptide' (n=35), and 'serine' (n=15), indicating that this cluster primarily contained peptidomimetic protease inhibitors acting on the NS3 serine protease.Cluster 5 contained 'protease' (n=108), 'macrocyclic' (n=42), and serine (n=8), indicating that this cluster contained macrocyclic compounds acting likely as NS3 serine protease inhibitors.Cluster 6 contained no specific mechanistic terms, alluding to the possible mechanism of these molecules inhibiting the NS5A protein.

Figure S7 :
Figure S7: Top 50 FDA-approved drugs predicted to contain the label 'hcv'.The Stage-4 approved drugs list from OpenTargets was passed through the CheF label prediction model.Results were sorted by 'hcv' probability.Relevant and high abundance labels displayed for clarity.Green cells represent approved-use labels from on the OpenTargets page, and red cells represent no approved usage relevant to the given term.

Table S2 . Randomly sampling from highly patented molecules would pollute the dataset labels
Electronic Supplementary Material (ESI) for Digital Discovery.This journal is © The Royal Society of Chemistry 2024 .To determine if random sampling could be an option to include molecules with a high number of patent associations (>10,000 patents per molecule), a random sample of 10 patents from 11 distinct molecules was obtained.These patents were categorized as follows: the patent describes the molecule's function (correct), the patent describes the final product's function of which the linked molecule is an intermediate (intermediate), and the patent is irrelevant to the linked molecule (irrelevant).

Table S4 :
GPT-4 graph community summarizations.All labels from the ten most abundant clusters were fed into GPT-4 for categorical summarization.These outputs were verified to be representative of the labels and were further consolidated by the authors into concise categories.

Table S5 : Arbitrary 20 CheF labels from each summarized co-occurrence neighborhood.
Modularity-based community detection was performed on the CheF co-occurrence graph to obtain 19 distinct communities.The communities appeared to broadly coincide with the semantic meaning of the contained labels, and the largest 10 communities were summarized to a common label.Shown are a random 20 labels from the first five summarized communities.

Table S6 : Arbitrary 20 CheF labels from each summarized co-occurrence neighborhood.
Modularity-based community detection was performed on the CheF co-occurrence graph to obtain 19 distinct communities.The communities appeared to broadly coincide with the semantic meaning of the contained labels, and the largest 10 communities were summarized to a common label.Shown are a random 20 labels from the second five summarized communities.

Table S7 :
Fingerprint models benchmarked on CheF.To assess a baseline benchmark on the CheF dataset of ~100K molecules, several molecular fingerprint-based models were trained on 90% of the training data and evaluated on the 10% test set holdout.Macro average ROC-AUC and PR-AUC was calculated across all 1,522 labels.Logistic regression (LR), random forest classifier (RFC), and a 2-layer multilayer perceptron (MLP) were trained.Parameters for LR and RFC were chosen to be common default values, whereas the MLP layer number and size were chosen through a 5-fold cross validation.

Table S8 : Structure-to-function model can retrodict functions of held-out over-patented molecules.
Early in the data pipeline, molecules with >10 patents were withheld to avoid data pollution.Shown are the top 10 predicted terms for five arbitrary over-patented molecules withheld from the dataset.Withholding over-patented molecules has apparently minimal detrimental impact on models trained on the CheF dataset as the model appears to infer over-patented function based on its less patented derivatives.