Camille
Bilodeau
a,
Wengong
Jin
b,
Hongyun
Xu
c,
Jillian A.
Emerson
c,
Sukrit
Mukhopadhyay
c,
Thomas H.
Kalantar
c,
Tommi
Jaakkola
b,
Regina
Barzilay
b and
Klavs F.
Jensen
*a
aDepartment of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139, USA. E-mail: kfjensen@mit.edu
bComputer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
cDow Chemical Company, Midland, MI 48674, USA
First published on 15th November 2021
While molecular discovery is critical for solving many scientific problems, the time and resource costs of experiments make it intractable to fully explore chemical space. Here, we present a generative modeling framework that proposes novel molecules that are 1) based on starting candidate structures and 2) optimized with respect to one or more objectives or constraints. We explore how this framework performs in an applied setting by focusing on the problem of optimizing molecules for aqueous solubility, using an experimental database containing data curated from the literature. The resulting model was capable of improving molecules with a range of starting solubilities. When synthetic feasibility was applied as a secondary optimization constraint (estimated using a combination of synthetic accessibility and retrosynthetic accessibility scores), the model generated synthetically feasible molecules 83.0% of the time (compared with 59.9% of the time without the constraint). To validate model performance experimentally, a set of candidate molecules was translated using the model and the solubilities of the candidate and generated molecules were verified experimentally. We additionally validated model performance via experimental measurements by holding out the top 100 most soluble molecules during training and showing that the model could rediscover 33 of those molecules. To determine the sensitivity of model performance to dataset size, we trained the model on different subsets of the initial training dataset. We found that model performance did not decrease significantly when the model was trained on a random 50% subset of the training data but did decrease when the model was trained on subsets containing only less soluble molecules (i.e., the bottom 50%). Overall, this framework serves as a tool for generating optimized, synthetically feasible molecules that can be applied to a range of problems in chemistry and chemical engineering.
To this end, instead of learning to traverse this vast chemical space, researchers have begun to pursue the inverse molecular design problem:4,5 that is, given a desired property (or set of properties) f(x), what is the molecule x that will optimize f(x)? In particular, researchers have focused on applying recent advances in deep generative modeling to molecular design.6–11 One caveat to these studies, however, is that they focus on optimizing properties that can be predicted, such as logP (partition coefficient) or QED (quantitative estimate of drug-likeliness).12–14 While this provides a convenient way to quickly evaluate properties for large numbers of molecules, it avoids many of the practical challenges associated with applying generative modeling to real applications in chemistry and chemical engineering. Most real molecular design applications involve optimizing properties that are experimentally measured and require synthesizing and experimentally evaluating candidate molecules. Experimental datasets often contain more error and are smaller than datasets calculated computationally. Additionally, in order to test newly generated molecules experimentally, it is necessary to consider their synthetic feasibility.
In this paper, we present a generative modeling framework and explore how this framework performs in an applied setting. Our framework employs graph-to-graph translation, which has been shown to outperform SMILES (simplified molecular input line entry system)-to-SMILES translation models on logP and QED optimization tasks,15 to translate a given molecular graph into an improved one by providing training pairs of “less optimal” and “more optimal” molecules. Here, we insert a graph-to-graph translation model into an automated workflow where the model is first trained and then used to translate all molecules in the initial dataset. This workflow is then repeated multiple times in order to 1) train a robust translator that can be applied to a given set of candidate molecules for lead optimization and, 2) discover new, highly optimized molecules.
We focus on the problem of designing molecules for high aqueous solubility because this property is critical for a wide variety of chemical applications ranging from geochemistry to drug design. We train our model on a dataset of experimentally obtained aqueous solubility measurements that have non-negligible known experimental error.16 We then evaluate our model based on its ability to discover novel molecules with high aqueous solubility that are also synthetically feasible. We assess our model's ability to improve candidate molecules in an experimental setting, by measuring the solubility of 3 candidate molecules and 4 molecules that are generated by translating those candidates. We also evaluate our model's ability to generate molecules that are more soluble (based on experimental measurements) than any in the training data by holding out the top 100 most soluble molecules during training. Additionally, to develop an understanding of how our model performs on small datasets, we train our model on multiple subsets of the training data and evaluate its performance. Finally, we explore how chemical structure and properties change during translation and idenfity the types of chemical transformations that the model uses to optimize molecules. In this way, we illustrate how our approach works given the following challenges: 1) training using an experimentally obtained dataset with non-negligible error, 2) constraining the model to generate synthetically feasible molecules, 3) evaluating our model using experimental measurements, and 4) evaluating the sensitivity of model performance to the size of the training dataset. This framework serves as a general approach for generating optimized molecules that can be reapplied to discovering new molecules for a range of applications in chemistry and chemical engineering. Additionally, this work is one of the first examples of training and testing generative modeling in an experimental setting17 (with one exception18) and the only work to our knowledge that thoroughly explores the concepts of experimental error, synthetic feasibility, and training dataset size in molecular discovery using generative modeling.
The graph-to-graph translation model learns to translate a given molecule (referred to as X) into a structurally related molecule with improved solubility (referred to as Y) by learning the relationship between the pairs of molecules, (X,Y), in the training set. The transformation learned by the model depends on how molecules are selected to construct pairs. In this work, we used the pairing algorithm described in Fig. 1a to pair molecules subject to two constraints, 1) molecules must be paired such that their difference in solubility is greater than a threshold value, logS(Y)–log
S(X) > α, and 2) molecules must be paired only with structurally similar molecules.
To achieve these goals, we first grouped molecules that were structurally similar to one another based on whether or not they shared a common scaffold, identified using the Bemis–Murcko algorithm.19 For all molecules sharing a common scaffold, we assigned molecules with solubility below the median to be X molecules and molecules with solubility above the median to be Y molecules. Any molecules that did not share a scaffold with any other molecules were removed from the dataset. Then, every X molecule was paired randomly with npairing different Y molecules, subject to the constraint, logS(Y)–log
S(X) > α. Similarly, every Y molecule was paired randomly with npairing different X molecules and any duplicate pairs were removed. This process had two hyperparameters, α and npairing. α enforces the minimum change in solubility for a given pair and for this work, we set this to be equal to the mean absolute error (MAE) of the predictor model, 0.78, to ensure that all pairs of molecules contained a solubility difference greater than the error in the predictive model. npairing determines how many times to pair each molecule and in this work was set to 20. Generally, we observed that pairing with fewer molecules resulted in a smaller training set, reducing model performance, while pairing with a greater number of molecules resulted in overrepresentation of a few highly soluble molecules in the dataset, biasing the learned transformation in the direction of those few molecules. We therefore expect that the choice of this hyperparameter may change for different applications and will be related to the distribution of molecules in the training set.
In order to improve the synthetic feasibility of the generated molecules, the model was trained for an additional constrained iteration. Molecules entering the pairing stage of this iteration were removed if they had a synthetic accessibility (SA) score23 of 3.5 or greater or a retrosynthetic accessibility (RA) score24 of less than 0.8. This constraint was also applied during the translation evaluation phase. The primary improvements made here to the previously published iterative algorithm were 1) an automated approach for pair list generation, 2) addition of pair list generation at the start of every iteration, 3) addition of a molecular weight minimum into the pairing process, and 4) incorporation of an additional synthesizability constraint, based on the synthetic accessibility score and the retrosynthetic accessibility score.
Once the solutions were made, the vials were put in an 8 × 12 plate and placed in a custom-built phase identification and characterization apparatus (PICA II).27 These formulations were imaged at 20 °C; each vial was removed from the plate by the robot arm, and a digital image was collected. After imaging, the vials were shaken at 695 rpm for 15 s. A second image was captured post-shake. The images were automatically stored in a fully searchable database and indexed according to a specific library identification number, automatically assigned experiment number, and position within the plate. The post-shake images were then analyzed to classify the solution as “soluble” (clear) or “insoluble” (cloudy/phase separated), as shown in Fig. 2.
While model performance after a single iteration was relatively good on average (1.02log
S average predicted solubility improvement), model performance was found to be sensitive to the initial solubility of the molecules. One reason for this behavior is the decoder has limited ability to generate molecules that differ significantly from the training data. A second, more heuristic reason for this phenomenon is that for any molecule with a high starting solubility, there may be few valid examples of related molecules with higher solubilities, making the generation problem much more difficult. It is expected that these challenges would exist when applying this approach to other applications and that the extent of model sensitivity would be a function of the nature of the training dataset.
To address these challenges, the graph-to-graph translator was used to augment the initial dataset for three iterations (described in Methods). As shown in Fig. 3a–c, models trained in subsequent iterations performed significantly better than the initial model. In particular, molecules near the 75th percentile of the initial solubility distribution (with an average solubility of −1.75log
S) improved in 87.2% of cases and on average by 0.99
log
S units by the third iteration. Molecules in the top 25 of the starting dataset were expected to be challenging to translate because the model was not provided with any examples of molecules with a higher solubility. While after the first iteration, the model on average translated these molecules in the negative direction (indicated by the light blue x on the far right in Fig. 3a), by the third iteration, the model was able to translate these molecules on average in the positive direction 56.4% of the time (Fig. 3b). Thus, by training the graph-to-graph translation model in an iterative workflow, model performance was improved, resulting in successful translations for even the most challenging evaluation sets.
Alternatively, a number of scores have been proposed to rapidly (and approximately) evaluate the synthetic feasibility of a given molecule. One score, referred to as the synthetic accessibility (SA) score,23,28 estimates synthetic feasibility by penalizing molecules containing uncommon structural features and prioritizing molecules containing substructures present in molecules in existing databases including PubChem, ChEMBl, or ZINC. While this approach has been shown to estimate synthetic feasibility reasonably well, its simplistic approach leaves it vulnerable to being tricked by complex molecules containing many reasonable looking fragments or simple molecules containing uncommon fragments.29 An alternative score, referred to as the retrosynthetic accessibility (RA) score,24 instead estimates synthetic feasibility using a machine learned model trained on predictions from a CASP model (AiZynthFinder). Importantly, this model consists of a deep neural network that uses the full molecular structure as an input, allowing it to learn a more nuanced definition of synthetic feasibility. One caveat to the RA score, however, is that it directly relies on a CASP model and therefore any shortcomings of that model will also be present in the RA score.
In this study, we evaluate the synthetic feasibility of our generated molecules using a combination of the SA and RA scores, where molecules that have SA score < 3.5 and RA score > 0.8 are taken to be synthetically feasible (thresholds were estimated based on the scores corresponding to molecules in the training data). Fig. 4a shows the distribution of the SA and RA scores of the initial training data, divided into four quadrants. As expected, most of the training data (86.7%) are considered synthetically feasible with respect to both scores, while only a small subset of the data (3.6%) is considered synthetically infeasible by both scores. Additional small subsets of the training data (3.6% and 6.1%) are considered feasible by only one of the two scores.
Fig. 4b shows the distribution of the SA and RA scores of the generated molecules after three iterations of training. While the number of unique molecules in our dataset grew from 4585 molecules in the initial training dataset to 127094 molecules after the third iteration, only 59.9% of those molecules were synthetically feasible with respect to both the SA and RA scores (Fig. 4b, lower-right quadrant). To address this challenge, we performed an additional, constrained iteration of model training by removing any molecules with SA score > 3.5 and RA score < 0.8 from the dataset (described in Methods). This reduced the number of unique molecules to 64
795 used for training in the constrained iteration. As shown in Fig. 4c, after the constrained iteration, the dataset (now 176
321 unique molecules) contained 83.0% feasible molecules and 3.4% infeasible molecules (evaluated with respect to both scores). 8.1% of molecules had a good RA score but a poor SA score and 5.5% of molecules had a good SA score but a poor RA score. In this way, performing an additional constrained iteration achieves similar SA and RA score distributions (when compared with the initial training data). Examples of molecules in each quadrant are shown in Fig. 5.
Even after constraining the model, a non-negligible subset (13.6%) of molecules in the dataset only satisfy the synthetic feasibility constraints with respect to one of the scores (Fig. 4c). Additionally, as described earlier, both scores have been shown to fail under different circumstances when estimating synthetic feasibility.24,29 To explore the robustness of our approach, we randomly selected 100 molecules from each of the four quadrants in the final dataset and used a CASP model called ASKCOS26 to predict retrosynthetic routes. While ASKCOS (or any CASP model) is not a true determinant of which molecules are synthetically feasible, it does offer an automated, conservative metric. Importantly, we chose a different CASP model than the one used to develop the RA score to ensure that our three approaches to measuring synthetic feasibility were independent. We found that with an expansion time of 4 minutes, ASKCOS was able to identify retrosynthetic trees for 46 of the 100 molecules with both good SA and good RA scores, 12 molecules with good SA and bad RA scores, 9 molecules with good RA and bad SA scores, and 2 molecules with bad SA and RA scores (additional details can be found in ESI†). In this way, although ASKCOS offers a more conservative view of synthetic feasibility, it is more likely to identify synthetic pathways for molecules with both good SA and RA scores. We note that it is likely that a larger number of retrosynthetic trees would be found with a larger expansion time.
In addition to evaluating the effectiveness of the SA/RA score constraints in improving synthetic feasibility it is important to quantify the effect of this constraint on model performance. As shown in Fig. 4d–f, the model achieved equal or better performance after the constrained iteration (orange) compared with model performance after only three iterations (blue). This trend was consistent across each method of measuring improvement: average solubility improvement, percent improved, and percent improved past the MAE of the predictive model. This illustrates that introducing synthetic feasibility score-based constraints is a good strategy for generating optimized, synthetically feasible molecules.
When searching for optimized molecules, it is often desirable to generate molecules that are diverse and novel. Here, we define diversity as:
While we expect that the diversity and novelty of molecules generated using this approach will be somewhat application specific, we note that our model's performance with respect to these metrics is in the same range as the equivalent metrics for other recently reported generative models.31
Fig. 7d and e shows the performance of the model trained for three iterations on each of the subsampled datasets relative to the full dataset. The model performed significantly worse when trained on any dataset comprising the bottom part of the distribution (shown in red) compared to random subsets of the distribution (shown in green). The model trained on the 50% randomly subsampled dataset (2110 molecules) performed as well as the model trained on the full dataset across all three metrics. The models trained on the 20% and 10% randomly subsampled datasets (818 and 495 molecules, respectively) performed worse than the model trained on the original dataset, but still better than the model trained on the bottom 50% or 20% of the data (1734 molecules and 583 molecules, respectively). This illustrates that the performance of the translator is less sensitive to dataset size and is more sensitive to the dataset distribution.
While the performance of the translator function of the model is fairly insensitive to dataset size, the ability of the model to discover new, highly soluble molecules is more sensitive. Fig. 7f shows the solubilities of the top 10 molecules discovered after three iterations by models trained on each subset of the dataset. The solubilities of the most soluble molecules increased considerably with the size of each dataset and the top ten molecules discovered by the model trained on the full dataset are all more soluble than any of the molecules discovered on any model trained on any subset.
Selected molecules | Predicted solubility (log![]() |
AqSolDB data (log![]() |
Experimental solubility (wt%) | Experimental solubility (log![]() |
||
---|---|---|---|---|---|---|
Low end | High end | Low end | High end | |||
Toluene* | −1.80 | −2.21 | 0.1% | 0.4% | −1.85 | −1.33 |
4-Fluoroaniline* | −0.34 | −0.53 | 1.0% | 2.8% | −1.06 | −0.59 |
Dodecane | −7.02 | — | — | 0.2% | — | −1.98 |
Methyl butyrate* | −0.71 | −0.83 | 1.1% | 3.6% | −0.95 | −0.45 |
4-Chlorobutyric acid | 0.09 | — | 13.8% | 32.2% | 0.06 | 0.45 |
2-Heptanone | −1.25 | — | 0.5% | 0.9% | −1.39 | −1.10 |
N-Butoxymethyl acrylamide* | −0.44 | −0.65 | 1.0% | 3.0% | −1.21 | −0.72 |
Ethyl formate | 0.25 | — | 7.1% | 14.9% | −0.02 | 0.30 |
Octamethyltrisiloxane | −6.83 | — | — | 0.2% | — | −2.11 |
Pinacolone* | −1.58 | −0.72 | 0.4% | 0.9% | −1.37 | −1.05 |
Dimethyl glutarate | −0.37 | — | 2.9% | 6.5% | −0.74 | −0.39 |
N-Methylformamide | 1.20 | — | 94.9% | — | 1.21 | — |
Solubilities of the 12 molecules (4 parent + 8 generated) were experimentally measured by formulating aqueous solutions at target solute concentrations of 0.2, 0.5, 1, 3, 7, 15, 34 and 95 wt% respectively. Cloudiness of each solution was determined by image analysis and assigned as clear (soluble) or cloudy/separated (insoluble) (further details can be found in ESI†). Using this analysis, “Low End” (Table 1) represents the highest concentration for which the solution appeared clear and the “High End” represents the lowest concentration for which the solution appeared cloudy. To compare the experimentally measured solubility of the molecules with the predicted values in mol L−1, we converted the concentration in wt% to mol L−1 by neglecting the volume change during mixing between water and solute molecules. We note that the assumption of ideal mixing does not hold especially at high concentrations, however, the deviation is not significant for solubility comparisons at log scale as even a 10% volume change in mixing only alters solubility by 0.046 (log(0.9)) in logS.
For validation of our experimental method, solubilities of the five molecules that existed in the AqSol training set are presented in Fig. 8a. The solubility discrepancy between the AqSol data and the predicted solubility ranges from 0.12–0.86 in logS. In comparison, a majority of the experimentally measured solubility ranges either overlap or are extremely close to the AqSol database result. Sorkun et al.16 describe that the approximate average standard deviation in the AqSol database is roughly 0.50 in log
S; herein, the experimental error in solubility measurement should also be considered. For example, in our experiments, the chemicals were used as received and thus may have affected the experimental result due to the presence of impurities and residual moisture in the chemical. Additionally, the ambient temperature can also play a role in slight differences in solubility. Given that the discrepancy between the experimental result and the database value is not higher than the MAE of the model (0.78 in log
S) nor the estimated deviation in the AqSol dataset (∼0.5), the experimental method was found to be sufficiently robust to validate our generative model.
Fig. 8b illustrates the predicted and experimental solubility for the newly generated molecules (not present in the AqSol dataset). In each case, the generated molecules were shown to have solubility ranges that were greater than (and non-overlapping with) the parent molecules. This illustrates that our translation model can generate molecules that are not only predicted to be more soluble, but also are experimentally shown to be more soluble.
On average, after one iteration, the model learned to leverage adding hydroxyl groups and removing thiol groups in order to increase the solubility of input molecules. Interestingly, in subsequent iterations, the model relied more on these chemical groups, adding hydroxyl groups and removing thiol groups with greater frequency. Similarly, the model learned to use sulfate groups to increase the solubility of molecules but, because sulfate groups were less common in the starting dataset, it took multiple iterations for the model to consistently leverage this chemical moiety. In fact, even by the third iteration, the model on average translated the top 25 evaluation set to a set of molecules with fewer sulfate groups. Thus, it appears that by training the model on an augmented dataset, the model is able to learn increasingly about how specific chemical groups can be used to improve solubility.
Another common approach for changing a molecule to make it more soluble is reducing its molecular weight. Fig. 9d shows the average molecular weight of molecules translated by the model after each iteration. Despite the fact that the model on average improved solubility to a greater degree in every subsequent iteration, the average molecular weight of the translated molecule increased after every iteration. This is contrary to what would be expected and suggests that after each iteration the model better learns how to balance the tradeoff between molecular weight and addition of highly soluble chemical groups. We note that, as described earlier, to avoid the model translating molecules into small molecules that were likely to be gases at room temperature, we imposed a lower bound on the molecular weight of translated molecules of 50 g mol−1.
To illustrate these phenomena, we show an example of a translation of an input molecule, N-(4-bromo-2-methylphenyl)acetamide, based on the model after three iterations in Fig. 10. As can be seen in the figure, instead of reducing the size of this molecule, the model adds on additional chemical groups including carboxylates, sulfates, hydroxyl groups, and others. Thus, each of the translated molecules has a higher molecular weight than the starting molecule. The addition of these groups while also trying to minimize molecular weight often results in unusual molecules that cannot be synthesized (or at least not easily). Thus, it is useful to include synthetic feasibility as an additional constraint in order to guide the model toward feasible, optimized molecules.
Footnote |
† Electronic supplementary information (ESI) available: Architecture and training details for the hierarchical graph neural network (HGNN) and the directed message passing neural network (DMPNN), details regarding the experimental approach used for quantifying aqueous solubility, examples of molecules with good/bad SA/RA scores, and details on representative molecules used in ASKCOS retrosynthesis. See DOI: 10.1039/d1re00315a |
This journal is © The Royal Society of Chemistry 2022 |