Kianoosh
Sattari
a,
Dawei
Li
a,
Bhupalee
Kalita
d,
Yunchao
Xie
a,
Fatemeh Barmaleki
Lighvan
e,
Olexandr
Isayev
d and
Jian
Lin
*abc
aDepartment of Mechanical and Aerospace Engineering, USA. E-mail: LinJian@missouri.edu
bDepartment of Electrical Engineering and Computer Science, USA
cDepartment of Physics and Astronomy, University of Missouri, Columbia, Missouri 65211, USA
dDepartment of Chemistry, Carnegie Mellon University, Pittsburgh, PA 15213, USA
eDepartment of Biological Sciences, Southern Illinois University Edwardsville, Edwardsville, IL 62026, USA
First published on 10th January 2024
De novo design of molecules with targeted properties represents a new frontier in molecule development. Despite enormous progress, two main challenges remain: (i) generating novel molecules conditioned on targeted, continuous property values; (ii) obtaining molecules with property values beyond the range in the training data. To tackle these challenges, we propose a reinforced regressional and conditional generative adversarial network (RRCGAN) to generate chemically valid molecules with targeted HOMO–LUMO energy gap (ΔEH–L) as a proof-of-concept study. As validated by density functional theory (DFT) calculation, 75% of the generated molecules have a relative error (RE) of <20% of the targeted ΔEH–L values. To bias the generation toward the ΔEH–L values beyond the range of the original training molecules, transfer learning was applied to iteratively retrain the RRCGAN model. After just two iterations, the mean ΔEH–L of the generated molecules increases to 8.7 eV from the mean value of 5.9 eV shown in the initial training dataset. Qualitative and quantitative analyses reveal that the model has successfully captured the underlying structure–property relationship, which agrees well with the established physical and chemical rules. These results present a trustworthy, purely data-driven methodology for the highly efficient generation of novel molecules with different targeted properties.
High-throughput virtual and experimental screening (HTVS and HTES) methods have emerged to accelerate molecule discovery in the past three decades.4 They iteratively generate, synthesize, and evaluate the molecules from an enormous library of molecular fragments by combinatorial enumeration.4–7 Although they accelerate examination of the design space by 3–5 orders of magnitude, their coverage and success rate are still far from the need of discovering sufficient amount of novel molecules.4 In addition to HTVS and HTES, global optimization (GO) strategies such as genetic algorithms have made much progress in identifying the top-ranked molecules,8 since they can efficiently screen the molecules with high-ranking scores from a fraction of possible candidates. However, the GO strategies require prior rules on how to transform the molecular fragments, thus greatly restricting the number of molecules to be explored. Moreover, the accuracy dramatically decreases as the structure complexity increases.9 Finally, many evolution steps are required to obtain the optimal candidates, making them not suitable for on-demand generation of novel molecules with targeted properties.
Recently, machine learning (ML) algorithms, particularly deep learning (DL), have been applied to discover novel molecules since they can learn hidden knowledge from a large scale of data.10 For instance, they have been widely implemented to assist or even substitute theoretical simulations in HTVS of molecules for photovoltaics,11 photocatalysis,12 and antimicrobial applications.13 They are also applied as generative models (GMs) for inverse molecule design. A GM-based inverse design process starts with mapping the high-dimensional representations of the molecules to reduced latent vectors, which are then used to search for or optimize new molecules. They can identify hidden patterns from the highly complex, nonlinear data in an automatic and on-demand fashion without much prior knowledge for creating non-intuitive, even counterintuitive molecules that outperform the empirically designed ones. Thus, they are well suited for exploratory optimization problems in the unrestricted design space. For instance, variational autoencoders (VAEs),14 generative adversarial networks (GANs),15 reinforcement learning (RL),16,17 and recurrent neural networks (RNNs),18,19 or integration of these networks into a new architecture, have made the inverse molecule design more and more feasible.20,21 Gómez-Bombarelli et al. employed a VAE to map discrete representations of molecules to continuous ones, making the gradient-based search of the chemical space possible.2 For instance, grammar variational autoencoders (GVAEs)22 which represent molecules as parse trees from a context-free grammar, have been applied for multi-properties optimization.23,24 However, VAEs lack a mechanism for de novo generating molecules conditioned on targeted, continuous property values. If a value of interest was changed, they should be retrained. An RNN was proposed to generate molecules with targeted bioactivities but resulted in inaccurate property values compared to the targeted ones.18 Popova et al. proposed an RNN-based generative model within an RL framework to generate compounds with targeted melting temperatures.19 It generates the compounds with properties following those of the training molecules. Nevertheless, it is still not on-demand generation upon the targeted property values. The similar problems exist in other proposed models for molecule design.25–28 Thus, on-demand generation algorithms that can target different values of the property of interest are highly desired.29
Two proof-of-concept GANs, such as ORGAN30 and ORGANIC,31 were introduced to generate novel molecules, while the generation is not conditioned on the physicochemical or biological properties with quantitative and continuous labels. Instead, the property values of the generated molecules by these models only follow the distribution of the training samples. In other words, they do not correspond to targeted, specific property values. Our group recently proposed a regressional and conditional GAN (RCGAN) for the inverse design of two-dimensional materials.9 RCGAN can meet two criteria for inverse material design: (1) generating new structures that are novel compared to training molecules; (2) generating structures conditioned on the input quantitative, continuous labels. However, crafting a GM that has both regressional and conditional capabilities for molecules with significantly larger input dimensions presents a larger challenge. Hong et al. introduced a framework that combined GAN and VAE but did not consider the targeted values as the input to the generator.32 They incorporated the target property information into the latent vector only during the decoding process. Consequently, in the encoding phase, the latent space is not associated with the property. The applied approach worked for simple properties such as drug likeness (QED) and the water–octanol partition coefficient (logP), which are over-represented in the generative literature due to their ease of calculation and data abundance. However, such methodology may not work for more complicated properties such as the energy gap, which is not linearly related to the structures of the molecules and need to be calculated from quantum calculations. Additionally, all these GAN-based models generate the structures with the targeted properties in the range of the initial molecules for training, whose task is so-called interpolation. To the best of our knowledge, a GM that can perform an extrapolation task of generating the molecules with targeted properties beyond the range of the training dataset has been rarely reported, if not any.
To tackle this challenge, we propose a deep generative framework that integrates a reinforced RCGAN (RRCGAN) architecture. RRCGAN consists of three networks with a transfer learning algorithm to iteratively update RRCGAN for generation of molecules with targeted quantitative, continuous property values beyond the initial training dataset. RRCGAN includes an autoencoder (AE), an RCGAN network, and a reinforcement center. AE encodes the discrete representations of the molecules to continuous latent vectors, which are then fed as the input to RCGAN. RCGAN includes regressor, generator, and discriminator networks. The reinforcement center biases RCGAN towards generating valid and accurate molecules, resulting in RRCGAN. During the model's training phase on the initial dataset, it learns to discern the intricate relationships that connect molecular structures to specific properties. Nonetheless, deploying such a trained model to generate molecules with extreme property values located beyond the training distribution's boundaries is often impractical due to the inherent limitations posed by the small data challenge.33 Addressing this challenge, we applied transfer learning to iteratively fine-tune RRCGAN on generating new molecules showing increased property values compared to those of the initial training data. As a proof of concept, we employed RRCGAN to generate small molecules with the targeted energy occupied molecular orbital (HOMO) and the lowest energy unoccupied molecular orbital (LUMO) gap (ΔEH–L). The molecules with varying ΔEH–L can be tailored for applications in electronics, optoelectronics, and energy conversion and storage. In this work, we first trained RRCGAN by ∼132 thousand molecules whose ΔEH–L are distributed from 1.05 to 10.99 eV in the PubChemQC database.34 Then, it was fined-tuned to create a new model for generating new molecules with much higher ΔEH–L values than the ones in the PubChemQC database. Novelty of this iterative generative algorithm can be summarized as follows. First, the generated molecules are novel and diverse. Second, the model is conditional and regressional, and can generate molecules with targeted, continuous labels in a batch mode. Third, the generation is purely data-driven and can be extrapolated beyond the range of the initial training dataset by the transfer learning.
In this work, RCGAN has a generator, a discriminator, and a regressor network. RCGAN learns the hidden relationship between the latent vectors and properties of the training molecules for generating new latent vectors conditioned on targeted ΔEH–L (continuous labels), which are then converted to the SMILES strings using the decoder (Fig. 1b). The regressor is modified from Google Inception V2 (ref. 37) (Fig. S4†). It is built as a quantitative structure–activity relationship (QSPR) model for predicting ΔEH–L. To generate a latent vector conditioned on a ΔEH–L value, the generator receives a concatenated vector (129 × 1) consisting of a targeted ΔEH–L and a randomly generated noise vector z in a 128 × 1 matrix (Fig. S5†). In contrast to the RNN-based models that generate one token at a time based on previously generated tokens,18,19 our generator employs a CNN architecture which can generate the latent vectors in a single step. The generated latent vector has a dimension of 6 × 6 × 2 and is expected to contain chemical information hidden in the high-dimensional training data. The discriminator is trained by comparing the input concatenated vectors for both training and generated molecules (Fig. S6†). The trained decoder is used to convert the synthesized latent vectors to SMILES that are then fed into the trained encoder to generate the latent vectors, which serve as the input to the regressor. The regressor then predicts ΔEH–L that corresponds to the generated latent vectors. If the regressor is directly fed with the output of the generator, it has an R2 of 0.80 and MAE of 0.60 eV (comparing the true and predicted values), which are lower than the ones (R2 of 0.90 and MAE of 0.45 eV) afforded by the regressor fed with the converted latent vectors. The discriminator performs two functions. First, it determines whether the concatenated vector is from a real (training) molecule or a fake (generated or synthesized) one by comparing the statistical distribution of the two. Second, it tells whether a generated molecule corresponds to the targeted ΔEH–L value.
Finally, a reinforcement center is included in the RRCGAN framework to ensure that the generated molecules are chemically valid and accurate in comparison of the validated ΔEH–L with the targeted ΔEH–L (Fig. 1c). First, the latent vectors generated by the generator are converted to the SMILES by the decoder and then fed into RDKit38 to ensure that the SMILES are chemically valid. If a SMILES is valid, then “1” is assigned to the string; otherwise, “0” is assigned. Subsequently, a relative error (RE) of a targeted ΔEH–L compared with the predicted value from the regressor is evaluated. If RE is less than 20%, “1” is assigned to represent an accurate sample. Only a latent vector with assigned numbers of both “1” (valid and accurate molecule) is labeled as a real sample. Otherwise, it is labeled as a fake one. These two constraints reinforce the discriminator to consider the molecules with both high chemical validity and high accuracy as the real molecules and others as the fake ones. In the training process, before the reinforcement center is activated, the generator and the discriminator are trained in a few epochs. Details about the architectures of these networks and their training processes are described in ESI note 2.† Their loss functions and training processes are described as follows.
LAE = LAE1 + LAE2 | (1) |
(2) |
(3) |
In eqn (2), t is the true value (either 0 or 1) showing binary categories in the one-hot encoding vectors used for each SMILES. The predicted can be any value between zero and one, while they must sum to 1 in the last SoftMax layer of the decoder. In eqn (3), ŷ is the predicted ΔEH–L, y is the true ΔEH–L, and N is the number of molecules. The decoder is conditioned on the known ΔEH–L to improve the accuracy of the decoder. Eqn (3) is to calculate the mismatch between the predicted ΔEH–L from the decoder and the true ΔEH–L. The predicted ΔEH–L from the decoder, however, is not used in the model as it has a lower accuracy compared to the regressor.
The loss function of the regressor is defined as the L2 in eqn (4). It measures the difference between the predicted and true ΔEH–L.
LossR = L2[Y, R(Z)] | (4) |
(5) |
As shown in eqn (6), the loss function of the generator includes two terms. The first one is the same as the loss for the least square GAN (LSGAN),39 while the second one is the regularized loss for the regressor.
(6) |
The loss function of the discriminator is the same as the one used for LSGAN (eqn (7)).39
(7) |
The latent vectors outputted from the pre-trained encoder, along with the corresponding ΔEH–L were used to train the regressor. Fig. S10† shows that the loss of the regressor is stabilized after 150 epochs. As shown in Fig. S11,† the regressor affords a coefficient of determination, R-squared (R2) of 0.98, and a mean-absolute-error (MAE) of 0.19 eV for training and R2 of 0.95 and MAE of 0.33 eV for testing. Table S1† provides a comparison of the regressor's accuracy with other models. The pre-trained regressor is used to predict ΔEH–L of the synthesized molecules from the generator. It is also used in the reinforcement center to screen out the molecules with the unsatisfactory ΔEH–L accuracy.
After the AE and the regressor are pre-trained, the generator and discriminator are first trained for 5 epochs. After that, the reinforcement center is activated. Then, the generator generates 1000 latent vectors in response to the input ΔEH–L values. The reinforcement center groups the molecules based on two criteria: the SMILES validity and accuracy of the predicted ΔEH–L values compared to the targeted ones. To check the validity of the generated molecules, their latent vectors are first converted to SMILES by the decoder and then validated by RDKit. Meanwhile, these SMILES are converted to the latent features and then fed to the pre-trained regressor for predicting ΔEH–L. The reinforcement center selects the generated molecules that are chemically valid and have the predicted ΔEH–L within RE of 20% of the targeted values. These selected molecules are labeled as “1” and the remaining ones are labeled as “0”. Then, these grouped molecules are fed to train the discriminator. The loss evolution of the generator and discriminator is represented in Fig. S12.† It shows that after the reinforcement center is activated, the loss of the generator is fast reduced and stabilized after 150 epochs. The low and stabilized losses of both the generator and discriminator indicate a successful model training. We conducted a control experiment by disabling the reinforcement center in the training process. As shown in Fig. S13,† the losses of the generator and the discriminator without the reinforcement center do not converge after 200 epochs. Hyperparameters for these trained networks are shown in Table S2.† Evaluation metrics such as R2, mean absolute error (MAE), RMSE, MSE, and RE are defined in eqn (S1)–(S5) (ESI note 3†).
The workflow of such an iterative generative algorithm is shown in Fig. 3a. As a demo, herein, only two iterations were investigated. In the first iteration, a set of 1000 molecules with ΔEH–L values of ≥10.0 eV was used for training. Out of those, 461 molecules with ΔEH–L values of ≥10.0 eV were sourced from the PubChemQC database (Fig. 3a(ii) and Table S5†), while the remaining molecules were newly generated by the model. To generate them, we employed a multiple batch generation process, each consisting of 50 targeted ΔEH–L values uniformly sampled within the range of 8–11 eV. Subsequently, we screened the generated molecules corresponding to these targeted ΔEH–L values using the regressor model, selecting those with the predicted ΔEH–L value greater than 9.5 eV. These molecules were then subjected to DFT calculations for validation, and only those with DFT-calculated ΔEH–L values of ≥10 eV were finally selected. This batch generation process was repeated using different sampled targeted values until 539 valid, unique, and novel molecules with the DFT-validated ΔEH–L values of ≥10 eV were obtained. In the second iteration, the transferred model was fine-tuned using the generated molecules with validated ΔEH–L of ≥10.2 eV from the first transferred RRCGAN model. Fig. 3b shows the distributions of ∼132 K initial molecules used for training the initial RRCGAN model and the generated molecules in different transfer learning iterations. The ΔEH–L values of the generated molecules by the initial model are in the 2–10.15 eV range with a mean ΔEH–L of 6.33 eV, which is close to 5.94 eV, the average of the original training molecules. Only 0.5% of the outlier molecules have ΔEH–L of ≥10 eV. After the first iteration, the transferred model generates the molecules with a mean ΔEH–L of 7.4 eV and a maximum ΔEH–L of 11.6 eV. The percentage of the molecules with ΔEH–L of ≥10 eV increases to 5%. After the 2nd iteration, the generated molecules have a mean ΔEH–L of 8.7 eV and a maximum ΔEH–L of 12.9 eV. The percentage of the molecules with the predicted ΔEH–L of ≥10 eV increases to 16%. These results illustrate that the iterative transfer learning can push the generation toward higher ΔEH–L values and increase maximum ΔEH–L.
The application of transfer learning in molecule design has been explored in other studies as well.44,45 However, our approach distinguishes itself from the method proposed by Merk et al.44 in terms of our fine-tuning strategy. While they utilized historical data featuring high experimental activities, we employed newly generated molecules as training samples. This unique approach led us to uncover a previously unexplored functional group (C–F) that exhibits a strong correlation with high ΔEH–L values. The fine-tuning process using these newly generated molecules yielded a pronounced emphasis on exploration over exploitation. Furthermore, our framework differs from the work introduced by Korshunova et al.45 Although they also employed newly generated samples for fine-tuning, their framework lacks the capability to target multiple values within the high-value region. In contrast, with sufficient fine-tuning, our framework has the potential to precisely target a range of values within the explored high-value region.
Fig. 4 Representative examples of molecules generated by the original and transferred RRCGAN models: the molecules with (a) ΔEH–L of <6.5 (eV), and (b) ΔEH–L of >6.5 (eV). |
Comparison of the molecules with high and low ΔEH–L values highlights several key observations. The molecules featuring alternated single and multiple bonds – which are referred to as conjugated systems, unsaturated rings, and radical electrons, tend to exhibit lower ΔEH–L values. Conversely, the molecules with linear structures which are characterized by single bonds or saturated rings tend to display higher ΔEH–L values. Moreover, the presence of sulfur (S) and nitrogen (N) decreases ΔEH–L. This effect can be attributed to the increased extent of orbital overlap facilitated by these elements, ultimately reducing ΔEH–L.46
In addition to the structure–property relationship disclosed from the initial RRCGAN model, the transferred model reveals a different but noteworthy correlation. That is the presence of fluorine (F) atoms bonded to carbon (C) atoms in the molecules increasing ΔEH–L. That could be because F is the most electronegative element in the periodic table. In a molecule, F exerts a strong electron-withdrawing effect, which raises the LUMO level to get a higher ΔEH–L.47 But this rule remains undisclosed by the initial model due to the scarcity of the F-containing molecules in the initial training dataset. Among the 132 K initial training molecules, only 4 molecules contain the F atom and have ΔEH–L of >10 eV. As depicted in Fig. 4b, the generated molecules by the transferred model have ΔEH–L of 13.07 eV. They all include the F–C bonds. This observation illustrates the effectiveness of the transferred model in learning a critical structural feature even present in the limited samples when doing the extrapolative generation. Meanwhile, they all include the single bond and saturated rings. This knowledge is transferred from the initial model that these two features tend to improve the ΔEH–L values. These observations confirm that the model can effectively correlate the structures with the properties, aligning with the established chemical rules.47,48 The strong agreement between the model's predictions and established chemical principles enhances confidence in the utilization of this deep generative model for the efficient and cost-effective generation of novel molecules with desired properties. Adding objectives related to synthetic accessibility for generated molecules is a thoughtful approach to enhance the practical utility of the proposed generative model. This could include criteria such as the complexity of the chemical structure, the presence of synthetically challenging motifs, or adherence to established synthetic rules.49 Additionally, involving domain experts in the development and validation process can significantly enhance the effectiveness of the synthetic accessibility objectives in the proposed generative model.
Visualizing these representative molecules in Fig. 4 affords a qualitative correlation of the structures with their ΔEH–L. To establish a quantitative relationship, we trained an XGBoost regression model which takes 18 structural features (ESI note 4†) as input to predict ΔEH–L. From the feature importance analysis (Fig. S15†), we picked four important structural features that most affect the prediction. They are the number of the saturated rings, number of the hydrogen-bond acceptors, the heavy atoms molecular weight, and number of the rotatable bonds. A saturated ring is defined as a cycle composed solely of single bonds, while an aromatic ring consists of alternating single and double bonds, as exemplified by benzene. The hydrogen-bond acceptors are typically electronegative atoms with lone pairs of electrons, such as oxygen (O), nitrogen (N), and sometimes sulfur (S). The rotatable bonds are non-ring single bonds connected to non-hydrogen, non-terminal atoms. Amid C–N bonds are excluded due to their high rotation barriers.50 In Fig. 5, we present the percentage distribution of these selected features presented within both the training and generated molecules. Feature distributions of the generated molecules are slightly different from those of the training ones, demonstrating the generator's capability in exploring the new design space to generate the molecules with the targeted ΔEH–L. Specifically, Fig. 5a reveals a higher percentage of the generated molecules with a single saturated ring compared to the training molecules. Fig. 5b illustrates a decrease in the occurrence of the generated molecules with 2 and 3 hydrogen bond acceptors, while the number of the molecules with higher hydrogen-bond acceptors is increased. Moreover, the heavy atom molecular weights tend to increase ΔEH–L (Fig. 5c), indicating a tendency for the model to generate larger molecules in request of higher ΔEH–L. Additionally, Fig. 5d indicates that the number of the rotatable bonds increases in correspondence of the higher ΔEH–L values. It is worth noting that these structural features were not directly used as descriptors for the RRCGAN model. It is likely that such information is implicitly captured within the latent vectors. Furthermore, Fig. S16† presents the ranking of other features which are also associated with ΔEH–L. Further explanations and details regarding these features can be found in ESI note 4.†
To validate the hypothesis, we applied t-distributed stochastic neighbor embedding (t-SNE), a non-linear dimension reduction method, to project the latent vectors of both training and generated molecules (Fig. 6). First, we divided ΔEH–L into four ranges. The ranges for the training molecules are ≤4.4 eV, [4.4–5.7 eV], [5.7–7.5 eV], and >7.5 eV. The ranges for the generated molecules are ≤4.5 eV, [4.5–5.9 eV], [5.9–7.7 eV], and >7.7 eV. Each range was calculated by quantiles to have the same number of molecules. The projected latent vectors were then colored based on their ΔEH–L ranges, where the dark blue and dark red colors represent the low and high values, respectively. As shown in Fig. 6, the first component of t-SNE (t-SNE1) separates the molecules based on their ΔEH–L values. The molecules in the same ΔEH–L range are clustered into close regions in the plots. Molecules with ΔEH–L > 6 eV are in a region with t-SNE1 < 0 and vice versa. In Fig. 6a and b, molecule (7) is a representative sample with ΔEH–L of ≤4.4 eV and ≤4.5 eV for the training and generated molecules, respectively. Molecules (3) and (4) represent the ones with ΔEH–L of >7.5 eV and >7.7 eV for the training and generated molecules, respectively. Linear molecules with single bonds and fewer sulfur and nitrogen atoms are grouped in the high ΔEH–L value region, while molecules with rings, conjugated systems, and more sulfur and nitrogen atoms occupy the low ΔEH–L value regions. These results agree well with the observations from Fig. 4. Moreover, the generated molecules are clustered in the same regions as the ones for the training molecules (Fig. S17†), further validating that the generator has successfully learned the structural information from the latent vectors of the training molecules for generating novel molecules with the targeted ΔEH–L. As a comparison, we also performed a principal component analysis (PCA) and a spectral embedding analysis on the same molecules used for the t-SNE analysis. The results are shown in Fig. S18.† Discussion on the PCA and spectral embedding results is described in ESI note 5.† In conclusion, it is found that t-SNE outperforms the other two methods for data visualization in this case.
Fig. 6 t-SNE plots of the latent vectors of the training and generated molecules output from the encoder: (a) training molecules; (b) generated molecules. Unit of ΔEH–L is eV. |
We have also presented some molecules in the boundaries between the two gap ranges of the highest and lowest ΔEH–L to show the similarities of the structures although they are in the two different ranges. When comparing molecules (1) and (2) in Fig. 6a, the existence of a conjugated system in molecule (1) lowers ΔEH–L, which agrees well with the conclusion shown in Fig. 4. When comparing molecules (1) and (2) in Fig. 6b, the existence of radical electrons in molecule (1) lowers ΔEH–L. When compared to molecule (5), molecule (6) has the sulfur atom, thus reducing the ΔEH–L value (Fig. 6a and b). For such molecules with close structures but different ΔEH–L values, the reduced latent space is not enough to distinguish them.
To ensure the reliability and efficacy of the model, the structures and the latent features of both training and generated molecules were qualitatively and quantitatively analyzed. The analyses reveal that the model has successfully captured the underlying structure–property relationship, which agrees well with the established physical and chemical rules. The model then correlates the structural features with the values of ΔEH–L for generating novel molecules with targeted ΔEH–L. The proposed RRCGAN framework would afford a trustworthy, purely data-driven methodology for the highly efficient generation of novel molecules without the need for physical or chemical inputs.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3dd00210a |
This journal is © The Royal Society of Chemistry 2024 |