Automated patent extraction powers generative modeling in focused chemical spaces

Deep generative models have emerged as an exciting avenue for inverse molecular design, with progress coming from the interplay between training algorithms and molecular representations. One of the key challenges in their applicability to materials science and chemistry has been the lack of access to sizeable training datasets with property labels. Published patents contain the first disclosure of new materials prior to their publication in journals, and are a vast source of scientific knowledge that has remained relatively untapped in the field of data-driven molecular design. Because patents are filed seeking to protect specific uses, molecules in patents can be considered to be weakly labeled into application classes. Furthermore, patents published by the US Patent and Trademark Office (USPTO) are downloadable and have machine-readable text and molecular structures. In this work, we train domain-specific generative models using patent data sources by developing an automated pipeline to go from USPTO patent digital files to the generation of novel candidates with minimal human intervention. We test the approach on two in-class extracted datasets, one in organic electronics and another in tyrosine kinase inhibitors. We then evaluate the ability of generative models trained on these in-class datasets on two categories of tasks (distribution learning and property optimization), identify strengths and limitations, and suggest possible explanations and remedies that could be used to overcome these in practice.


Introduction
The efficient navigation of chemical space for the design of novel candidate molecules has long been of interest to chemists and materials scientists.With the rapid surge in interest for data-driven approaches, deep generative models have emerged as an exciting avenue for inverse molecular design.[1], [2] Progress in this field has come from the interplay between training algorithms and molecular representations.Over the last few years, approaches have used autoregressive, latent variable and reinforcement learning (RL) algorithms to generate string [3]- [7], and graph [8]- [11] representations of molecules.While fully unsupervised models can be trained on large unlabeled data (for instance the 100+ million known, individually synthesized molecules from PubChem), inverse molecular design requires some form of supervision to steer generation towards high-performance molecules at the extremes of the property distribution.[12] One of the key challenges in the applicability of such inverse design models to materials science and chemistry has been the lack of accessibility to sizeable labeled training datasets in these fields.[13] Published patents are an important source of scientific knowledge since the discovery of new materials and molecular candidates are disclosed in patents, years before their publication in scientific journals.[14], [15] Patent authorities such as the United States Patent and Trademark Office (USPTO), European Patent Office (EPO), Japanese Patent Office (JPO), and World Intellectual Property Organization (WIPO) make published patents accessible through their web interfaces.In the past decade, there has been significant progress in extracting and collating information from these sources programmatically to create large databases of chemical compounds [16], and reactions [17].This large body of extracted knowledge has immense potential in feeding 'data hungry' deep learning models, but has remained relatively untapped in the field of molecular design.
Since patents are filed seeking protection within a given application, they are thematically labeled into domains.This makes it relatively simple to extract domain-specific molecular structures.Moreover, they are likely to be high-performance since they merited the investment of a patent application, which allows us to create domain-specific generative models by training exclusively on molecules known to belong to the desired class.Our hypothesis is that training generative models on these smaller, but more meaningful datasets can automatically steer generation towards in-class high-performance molecules.
All post-2001 chemistry patents published by the USPTO contain ChemDraw CDX, MDL, and TIFF files of chemical structures, as required by the Complex Work Unit (CWU) Pilot Program.[18] This makes chemical structures more accessible in a computer readable format for large scale mining and screening efforts.In our work, we attempt to bridge the gap between these bulk data sources and data-driven chemical design, by developing an automated pipeline to isolate chemical structures from USPTO patents based on relevance to user-defined keywords, and demonstrating their utility as training data for deep generative models for molecular design.We choose three model types JTVAE [9], RNN+SELFIES [19], [20], and REINVENT+SELFIES [7] to explore a variety of representations (graph, SELFIES [21], and SELFIES respectively) and training algorithms (latent variable, autoregressive, and RL respectively), and show their applicability to learn data distributions in two patent-mined datasets that explore very different areas of the chemical space, i.e., organic photodiodes (OPD) and tyrosine kinase inhibitors (TKI).
We then test the ability of these models to perform property optimization in each of the following cases: 1) the property being optimized can be predicted accurately and cheaply, 2) Oracle property predictor is expensive, so we only have access to a proxy neural network predictor trained on oracle property data.[22]- [24] In the TKI case, we optimize for high structural similarity to held-out, FDA-approved TKI molecules.This is a means to test the ability of models to optimize a robust, well-defined objective function with a relatively narrow solution space.This is an example of case 1 since we can calculate the similarity between molecules cheaply without the need for an approximator.In the OPD case, we choose our optimization objective to be the identification of organic molecules with low optical gaps.This is an example of case 2 since we approximate expensive DFT-computed optical gaps with a neural network predictor.Materials with low optical gaps, especially those that are sensitive to wavelengths of light in the near infrared (NIR) region of the spectrum have seen a growing interest due to their ability to utilize a larger portion of the solar spectral range which was previously difficult to access.Their applications are diverse ranging from military equipment to biomedical and semi-transparent devices.[25]- [28] The key observations we make through our experiments are summarized as follows: 1) We identify that patent-mined datasets offer the ability to create focused in-domain datasets of high-performing molecular structures.Training generative models on these datasets allows us to create in-domain generators that can generate novel candidates that model property distributions of the training data well.This offers a way to bootstrap focused domains of chemical space with limited human intervention.2) Property optimization towards the edges of the training data distribution can be effective if we have access to a cheap oracle predictor, but is challenging when proxy neural network approximators are used.Proxy predictors are brittle (have the tendency to be adversarially attacked in our RL experiments), and difficult to train accurately end-to-end (learning properties from compressed latent space in JTVAE is difficult).

Pipeline overview
Our overall pipeline consists of six steps: 1) Download patents from USPTO, 2) Parse chemistry patents, 3) Shortlist patents based on keywords, 4) Standardize data and add to our in-house database, 5) Property labeling for supervised property optimization tasks (DFT calculated optical gaps for OPD, and similarity to FDA-approved drugs for TKI), and 6) Generative modeling for distribution learning (unsupervised) and property optimization (semi-supervised).Figure 1 shows a diagrammatic illustration of all steps involved.We make publicly available the code utilized in steps 1, 2, 3 and 6 along with this paper (URLs provided in Section 5).Step 4 involved storage of all data in a database, followed by de-duplication of SMILES strings [29] and simple post-processing steps as described in Section S2 in the SI.A detailed description of procedures used in step 5 are provided in Section 2.3.These steps can be replaced by any form of data storage and property labeling technique depending on the chosen domain.An open source database framework similar to the one we used can be found at [30].

Patent extraction
All granted USPTO patents from 2001 onward are available for download in weekly archives from the agency's Bulk Data Storage System (BDSS) at https://bulkdata.uspto.gov/data/patent/grant/redbook/<YEAR>/.We downloaded all of these archives from the BDSS using Python scripts by March 1, 2022.The compressed file size of all downloads was approximately 1.83 TB, including between 30 and 200 GB for each individual year.Next, we filtered out all patents that did not contain molecular structures in the form of CDX files.We encountered some difficulties in this filtering step with a subset of patent years due to inconsistent formatting and directory structures in the USPTO data (please refer to Section S1 for details).
For the remaining chemistry-related patents, we used RDKit [31] to convert MOL files to SMILES strings.
The number of new, unique SMILES strings extracted per year using this method are shown in Figure 2a.
We queried all chemistry-related patents by searching for keywords in each XML file.The TKI molecules shown in Figure 2b were found using the keywords "tyrosine kinase inhibitor", and the OPD molecules in Figure 2c are the result of querying for "organic electronic", "photodiode", and "organic photovoltaic".Any Markush structures in the dataset were filled in with ethyl groups because the particular substituents for each core molecule are not stored in a structured format that could be accessed without natural language processing; this included 17% of molecules from the OPD query and 11% of molecules from the TKI query.
Thus, we generated a list of domain-relevant SMILES strings related to each set of keywords.More details on post-processing/filtering applied to the data are provided in Section S2.

Similarity calculation for TKI
Each TKI molecule was labeled with its Tanimoto similarity to Erlotinib, a held-out FDA-approved inhibitor.The Tanimoto similarity was computed over Morgan fingerprints of size 2048 and radius 2. The implementation for similarity and fingerprinting were both obtained from RDKit.While Erlotinib is the primary running example showed in this work, we also labeled molecules with similarity to the other 26 held-out inhibitors for similar experiments involving them (for ex.see Figure S1).

Generative modeling 2.4.1 Evaluation tasks
We prepared two datasets: 1) OPD -Organic Photodiodes and 2) TKI -Tyrosine Kinase Inhibitors, covering two different chemical spaces.Models trained on these datasets were evaluated on two categories of tasks: 1) Distribution Learning -The ability of models to learn the training data distribution, and 2) Property Optimization -The ability of models to generate in-domain molecules that are optimized for a property of interest.Good performance on the latter task would require some or all of the generated samples to be superior in properties in comparison to the training data distribution.
For distribution-learning tasks, we evaluated models on the GuacaMol distribution learning benchmark metrics: Validity, Uniqueness, Novelty, KL Divergence and Frechet ChemNet Distance.[40], [41].We also visualized the ground-truth property distribution of the sampled data and compared it with that of the training data.A close match between the two is an indicator of success in learning the training data distribution.For property optimization, we performed a similar visualization.Here, a shift in distribution towards higher values of the objective function is an indicator of good performance.Finally, to test the value of domain-focused training on property optimization, we compared the patent-trained models against baseline models that were trained on the ZINC dataset [42] but optimized for OPD and TKI properties.It is considered good performance if the domain-trained models generate molecules with more optimal properties than the generic model trained on the ZINC dataset.This would suggest that the structural priors imposed on the models by training on the domain-specific patent datasets reflect in more optimal properties for that domain.More specifics on the task formulation for each dataset are given below.
For OPD tasks, the patent-mined OPD molecules were used as the training dataset.The property of interest in the distribution learning tasks was the DFT-computed optical gaps of sampled molecules.Since our aim was to generate molecular candidates with low optical gaps, the negative of the optical gaps as predicted by a proxy neural network predictor was used as the objective function which was maximized in the property-optimization tasks.
For TKI tasks, the patent-mined TKI dataset was used as the training dataset.The property of interest in the distribution learning tasks was the similarity between sampled molecules and Erlotinib, an FDAapproved inhibitor, to gauge the model's ability to optimize a robust, well-defined objective function with a relatively narrow solution space.This quantity was also used as the objective function which was maximized in the property-optimization tasks.In addition to the tasks described earlier in this section, an additional distribution learning task was introduced for this dataset.Molecules sampled from models trained on TKI and ZINC datasets, and 27 held-out FDA-approved TKI molecules were projected on a 2-dimensional space with Principal Component Analysis (PCA).Samples from TKI-trained models lying closer than the ZINCtrained samples to the held-out molecules, would indicate that the models have accurately learned information about molecular structure from the training dataset.It is a way to test the utility that training on domainfocused data (TKI-patents) has over training on publicly accessible large databases (ZINC) that have a similar chemical space (drug-like molecules) but are less-focused on the domain of interest.Morgan fingerprints of size 2048 and computed with radius 2 was the molecular representation used during PCA.

Generative models
We evaluated two categories of generative models, i.e. 1) text-based and 2) graph-based, on these tasks.
RNN+SELFIES and REINVENT+SELFIES fall under the first category while JTVAE falls under the second.RNN+SELFIES was only used for distribution learning tasks, REINVENT+SELFIES was used for only property optimization tasks, and JTVAE was used for both.SELFIES was used as the string representation of choice to ensure validity of structures generated.[21] We go over some of the implementation choices for each below.
Recurrent Neural Networks (RNNs) have been shown to be simple but powerful text-based models for distribution modeling tasks in molecules [43].They are trained using an auto-regressive training strategy where the next token is predicted at every time-step.The implementation from the MOSES Benchmarking platform [19] was used with some modifications pertaining to change in representation from SMILES to SELFIES.The trained RNN can be sampled by feeding a BOS (beginning of sentence) token, and sampling the probability distribution predicted by the model autoregressively.An LSTM [20] network with 3 hidden layers and dropout probability of 0.2 between layers was used, with a final linear layer to transform the LSTM output into the required output sequence size.All LSTM hidden layers and the final linear layer were of size 768, and a learning rate of 1e-3 was used for the Adam optimizer.
Junction Tree Variational Autoencoder (JTVAE) is a graph-based generative model that learns to sequentially decode graph substructures using Message Passing Neural Networks (MPNNs), and combine them to form complete molecular structures.[9] It maintains a vocabulary of substructures decomposed from the training data, that are used during the decoding step to ensure validity of generated molecules.The model is trained by training the encoder, decoder and property predictors end-to-end with a multi-task loss function.Once trained, the latent space can be either randomly sampled or optimized by utilizing gradients from the property predictors.In both cases, the sampled latent vectors are passed through the decoder to obtain molecular candidates.A graph Message Passing Network (MPN) with 3 layers was used in the graph encoder, and a graph GRU [44] network with 20 layers was used in the tree encoder, to form a concatenated latent vector of size 56.A learning rate of 1e-3 that was set to decay exponentially during the course of training was used for the Adam optimizer [45].More details given in Section S4.
REINVENT is a policy based Reinforcement Learning (RL) approach that learns to generate molecular structures optimized with a chosen objective function.[7] Training is performed in two steps: 1) A Prior RNN is pre-trained on a language modeling task, i.e., learning to predict the next token of the sequence by maximizing the likelihood on the training dataset.2) Then, an augmented likelihood function is defined to be the sum of the Prior likelihood and a score indicating the desirability of the sequence.The agent, which is initialized with the Prior RNN weights, is then fine-tuned to minimize the squared difference between the agent likelihood and the augmented likelihood on samples drawn from the Agent RNN.Sampling from the trained model is performed in identical fashion to RNN (described in previous paragraph).We once again use SELFIES representations of molecules.The Agent RNN was composed of three GRU cells [44], each of size 512, followed by a linear output layer.Pre-training and fine-tuning were carried out using an Adam optimizer with learning rates of 1e-4 and 5e-4 respectively.We retained the same architectural choices used by Olivecrona et al. since our task of similarity-based optimization is nearly identical to the similarity guided structure generation experiments described in their work.

Distribution learning
Table 1 compares the scores of the RNN+SELFIES and JTVAE models on the GuacaMol distribution learning benchmarks.Both models were able to generate molecules with relatively high validity, uniqueness and KL Divergence scores.We however found that JTVAE is superior to RNN+SELFIES in Novelty scores, and both models perform relatively poorly on Frechet ChemNet Distance scores.These observations may both be characteristics of the training datasets that we use being smaller and more domain-focused than the larger and more diverse drug datasets that have been benchmarked on these metrics in the past.
As can be seen in sub-figures (a), (b), (d), and (e) of Figure 3, both models generated molecules whose properties matched well with the training dataset.It can also be observed that RNN+SELFIES is able to match the distributions better than the JTVAE, which conforms with the observations made by [43].
Additionally, sub-figures (c) and (f) show that samples from TKI-trained models lie closer to held-out FDAapproved inhibitors than ZINC-trained samples, which indicates that both models have been able to learn structural information from the training datasets.
From these results, we conclude that the deep generative models explored in this work are effective tools to model property distributions of arbitrary small, chemically focused, training datasets automatically extracted from the patent literature.The models can thus sample novel, in-distribution molecular structures that resemble the training data in terms of structure and properties.Furthermore, this suggests that domainspecific, focused chemical spaces can be boostrapped automatically from the literature without user-defined heuristics for the domain, as evidenced by the GuacaMol distribution learning benchmarks in two very distinct chemical spaces.

Property optimization
We evaluated generative models trained on patent-extracted, domain-focused datasets for property optimization.We evaluated REINVENT+SELFIES, which uses reinforcement learning and a string-based representation and JTVAE, which performs optimization in the latent space and decodes locally optimal molecules, under this category of tasks.We identified that property optimization tasks towards edges of the training data distribution are challenging for a variety of reasons.We observed from our RL experiments that optimizers may push the designs out of the training domain which was particularly acute when a neural network predictor was used as a proxy for the oracle property.Here, the generative model can be thought of as performing an adversarial attack on the poorly-covered areas of the predictor.From our VAE experiments, we observed that it is sometimes challenging for proxy predictors to learn properties from compressed latent representations, and the unreliable objective function thereby leads to challenges in latent space optimization.
Both these challenges arise from coupling generation and property optimization end-to-end.By instead splitting these into two separate steps of random sampling and post-hoc filtering, we observed better shifts in property histograms.More details on the post-hoc filtering approach are provided below in Section 3.2.1.

Post-hoc filter
We use the term "post-hoc filter" to refer to a property screen conducted on molecules that were randomly sampled from trained models.It can use either the predictions of a proxy predictor when the oracle property is expensive as in OPD tasks, or the oracle itself when it is cheap to compute as in TKI tasks.The degree of the filter applied (which we chose to be top 20%) can be chosen based on the extent of screen to be performed.
As a proxy predictor for OPD tasks, we trained a Chemprop MPNN model [46] on the patent-mined OPD dataset to predict DFT-calculated optical gaps (See S3).A random train-val-test split (60:20:20) was used to train, tune and evaluate the model.

Approximate objective
All OPD optimization tasks required the use of a proxy neural network model since DFT simulations are computationally expensive and are typically not autodifferentiable, so it is not possible to train end-to-end generation and property scoring.[47], [48] In the JTVAE case, a Multi Layer Perceptron (MLP) was used as a proxy predictor to predict oracle DFT-calculated optical gaps from the latent space.As can be seen from Figure 4 (a), gradient descent over the latent space in JTVAE has almost no effect in shifting property distributions away from the OPD training data towards lower optical gaps.To improve the optimization performance, we utilized the Chemprop posthoc filter to selectively isolate decoded candidates having predicted optical gaps below the the 20th percentile.This was useful in shifting the distribution towards lower optical gaps as can be seen from Figure 4(b).The justification behind this approach was that learning properties from the latent space is a more challenging task than learning directly from the molecular graph.[49] The MLP predicting the optical gap from the latent space achieves an RMSE of 0.56 eV on the test set while the Chemprop model achieves an RMSE of 0.38 eV on the test set, which follows our intuition.The fact that JTVAE learns from a multi-task loss function composed of reconstruction and property terms, makes it a constrained optimization task that reduces the degrees of freedom of the MLP during training, and can hence make convergence more challenging.We observed similar challenges with coupling generators and property optimizers while training the REINVENT+SELFIES on the OPD dataset, where the Chemprop model de-scribed above was used as the proxy predictor modelling the reward function.Here, the generator could be thought of as performing an adversarial attack on the proxy predictor and converged at molecular candidates that optimized the proxy objective but were structurally unphysical.More details on JTVAE+MLP training are provided in Section S4 and details on REINVENT results on OPD data are provided in Section S3.2.
Apart from the described issue pertaining to the poor predictive performance of the MLP, there could be other potential reasons for the failure of gradient descent on the latent space.One possibility is the presence of cascading effects.The unreliability of the MLP could have caused the points reached by gradient descent (on the latent space) to be outside the data distribution that the decoder saw during training, causing the decoder to be unreliable and collapse to a distribution more similar to the training data.One way to investigate this failure mode in the future could be the use of decoder uncertainty estimation techniques to identify such points and restrict samples to low-uncertainty regions of the decoder.[50] Another possibility is that the latent space manifold of the trained model was "rough" with respect to the MLP-predicted property, rendering optimization techniques such as gradient descent challenging.This could be investigated in more depth by evaluating the 'roughness' of the latent space with metrics such as roughness index (ROGI).[23] Therefore, it should be noted that the coupled interactions between generators and property predictors is a complicated problem, and utilizing approaches such as the post-hoc filter could be relatively simple remedies to these pitfalls even without a detailed knowledge of the failure mode.Demonstration of post-hoc filter with another model (RNN+SELFIES) is shown in Figure 4(f), which can be used as a remedy for the adversarial attack issues observed in the REINVENT example again arising from coupling of generators (RNN) and property optimizers (RL).
Finally, Figure 4(c) is a baseline where training was performed on the ZINC dataset and post-hoc filters on the OPD target was applied.It can be clearly seen that sub-figure (b) is more shifted towards optimal properties than the ZINC baselines which suggests that the structural priors imposed by training on the domain-specific OPD patent dataset offers significant value in achieving optimal properties for that domain.For example, molecules incorporating structural priors such as conjugated rings have more potential in achieving low optical gaps than drug-like structures. (

Oracle objective
In TKI optimization tasks, the property of interest was similarity to a chosen query structure, which is a cheap and oracle property estimate that can be calculated at every step of optimization.In such cases where we have access to the oracle predictor, we observed better performance on optimization tasks.A post-hoc filter using the oracle predictor can also be utilized in this case as a way to generate a set of novel candidate molecules that are optimized in comparison to the training data (Figure 4(d) and (g) for JTVAE and RNN+SELFIES respectively).Similar to the example described in Section 3.2.2,we also compared with a ZINC-trained baseline optimized for the TKI target, and observe minor improvements in shifts for the TKI-trained model in comparison to the ZINC-trained baseline (see Figure 4(d) and (e)).This difference is not as significant as the OPD-ZINC baseline since the chemical spaces of ZINC and TKI datasets are fairly similar structurally.

An alternative interpretation
The above observations from JTVAE and REINVENT+SELFIES can also be interpreted with reference to terminology introduced by [51].While Kajino et al. primarily examine the existence of biases in Reinforcement Learning settings, the terminology can conceptually be extended to other types of generative models as well.In our tasks, both generative model and property predictor were trained on the same patent-mined dataset.This could have introduced reusing bias, which stems from effectively training and evaluating our model with information drawn from the same data source.In addition, during property optimization, the property predictor often sees unrealistic/nonphysical molecules which are far away from its training data distribution.This results in a misspecification bias, caused by the unreliability of the property predictor at points far away from the training data distribution.These two components of bias might have had a role to play in the observations we made in cases where a proxy predictor was used.Oracle property models on the other hand, are free from these two forms of bias.

Conclusions
In this work, we developed a framework to automatically extract molecular structures from the USPTO patent repository based on user-defined keyword searches, and generate datasets for machine learning in chemistry.We demonstrate the utility of the extracted datasets in training generative models for inverse molecular design tasks.We show that these datasets can be utilized to generate novel molecular structures with properties similar to the training dataset, in a completely unsupervised setting.We also evaluate model performance on supervised property optimization tasks, identify some limitations of existing models in shifting property distributions away from the training data regime, and suggest some possible explanations and remedies that could be used to overcome these in practice.The key observations we make through our experiments are summarized as follows: 1) We identify that patent-mined datasets offer the ability to create focused in-domain datasets of high-performing molecular structures and offers a way to bootstrap focused domains of chemical space with limited human intervention.2) Property optimization towards the edges of the training data distribution can be effective if we have access to a cheap oracle predictor, but is challenging when proxy neural network approximators are used.

Data and code availability
The code used to train models is publicly available.Going forward, users who do their own keyword queries with our code will be unaffected by the problems we initially encountered with certain years.

S2 Processing of patent-extracted data before model training
The goal of our pipeline is to generate structures with limited domain knowledge beyond keywords, so we kept processing/filtering to a minimum except for constraints that allowed for better computational tractability and basic filters on molecular mass.For example, we applied a 1000 g/mol maximum molecular mass cutoff on the OPD dataset primarily because JT-VAE has a sequential decoding process that enumerates combinations of fragment pairs, which scales with the size of fragments and is thus very slow for large molecules.This has the added benefit of eliminating polymers and large candidates (non-ideal for deposition techniques such as chemical vapor deposition).Similarly on the TKI dataset, we imposed maximum and minimum cutoffs of 700 g/mol and 250 g/mol respectively to eliminate candidates that are not "drug-like".We apply the minimum molecular mass constraint in the TKI case since our property optimization objective was similarity to held-out FDA approved drugs whose molecular masses typically fall above 250g/mol.
Our minimal filtering means there are some structures in our training datasets that are not domainrelevant (such as reagents or intermediates).However, the "false positives" (molecules that the model generates because it thinks they are relevant, when in reality they are not relevant) that come from this can be easily filtered out by the property labeling step.Just as a user can choose their own property-labeling method appropriate for their design task when using our code, they could also insert additional domainknowledge-based preprocessing of the training dataset.Our current work demonstrates that the approach can still be useful even without this preprocessing, but additional filtering may improve results in some domains.We have provided some options for possible filters in our PatentChem code, such as minimum and maximum molecular weight and charged/neutral molecules.

Figure 1 :
Figure 1: Diagram of the workflow.Patents are downloaded from USPTO, and chemistry patents are isolated.Keyword-based search is then performed to filter relevant patents and corresponding SMILES strings.A subset of molecules chosen based on computational budget are then labeled with properties.Generative models are trained to model the data distribution, which can be sampled to suggest novel candidates.

Figure 2 :
Figure 2: Bar charts depicting number of SMILES strings extracted as a function of publishing year.Strings extracted from patents published between 2005 and 2022 (a) before keyword-based filtering, (b) after application of TKI-based keyword search and (c) after application of OPD-based keyword search.SMILES were de-duplicated after sanitization by RDKit, such that all molecules within a given year are unique, and any molecule counted in a given year will not be counted in any future years.Years 2001-2004 are not shown and years 2008-2010 are incomplete due to inconsistencies in patent formatting (see Section S1 for details).

Figure 3 :
Figure 3: Results on distribution learning tasks.(a) and (b) show the property distributions of JTVAE-sampled molecules in comparison to training data properties, on OPD and TKI datasets respectively.(d) and (e) show the same distributions for molecules sampled from RNN+SELFIES.(c) and (f) show PCA projections of molecules randomly sampled from TKI-trained and ZINC-trained models, and held-out FDA approved inhibitors, for JTVAE and RNN+SELFIES respectively.

Figure 4 :
Figure 4: Results on property optimization tasks.(a), (b), (c), (d) and (e) show results for JTVAE, while (f) and (g) show results for RNN+SELFIES.(a) and (b) show OPD property distributions of molecules sampled by gradient descent (GD), and post-hoc filter respectively in comparison to training data properties.(c) shows the property distribution obtained by applying an OPD post-hoc filter to samples drawn from a ZINC-trained model.(d) and (e) are analogous to (b) and (c) but on TKI instead of OPD.(f) and (g) are analogous to (b) and (d) but on RNN+SELFIES instead of JTVAE.
Figure 5 (c) shows the Erlotinib similarity distribution of samples generated during training of REINVENT+SELFIES, which are clearly shifted towards higher values than the training data.(a) shows sample candidates along with their similarity scores, and (b) shows the improvement of similarity score as training progresses.

Figure 5 :
Figure 5: Results based on REINVENT+SELFIES model trained on the TKI dataset.(a) Candidates generated by REINVENT towards the end of training, with structural similarity to Erlotinib being the reward function.Similarity scores are indicated below each candidate.(b) Tanimoto similarity score computed between generated candidates and Erlotinib, as a function of training iterations.(c) Histograms showing properties of candidates sampled during agent training, in comparison with the training data distribution.

Figure S1 :Figure S2 :
Figure S1: Tanimoto similarity score computed between generated candidates and FDA approved TKI molecules, as a function of training iteration

Figure S3 :
Figure S3: Comparison between DFT-calculated optical gaps and chemproppredicted optical gaps, as calculated on the test set.RMSE on test set was 0.38 eV.

SinceFigure S4 :
Figure S4: Sample molecular structures obtained from random sampling of trained RNN+SELFIES model on a) OPD b) TKI dataset

Figure S5 :
Figure S5: Sample molecular structures obtained from random sampling of trained JTVAE model on a) OPD b) TKI dataset

Table 1 :
GuacaMol distribution learning benchmarks for 1000 samples drawn from RNN+SELFIES and JTVAE, on OPD and TKI datasets.Closer to 1.0 indicates better performance.