Open Access Article

This Open Access Article is licensed under a Creative Commons Attribution-Non Commercial 3.0 Unported Licence

DOI: 10.1039/C9SC04503A
(Edge Article)
Chem. Sci., 2020, Advance Article

Jaechang Lim‡
^{a},
Sang-Yeon Hwang‡^{a},
Seokhyun Moon^{a},
Seungsu Kim^{b} and
Woo Youn Kim*^{ac}
^{a}Department of Chemistry, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea. E-mail: wooyoun@kaist.ac.kr
^{b}School of Computing, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
^{c}KI for Artificial Intelligence, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea

Received
6th September 2019
, Accepted 3rd December 2019

First published on 3rd December 2019

Searching for new molecules in areas like drug discovery often starts from the core structures of known molecules. Such a method has called for a strategy of designing derivative compounds retaining a particular scaffold as a substructure. On this account, our present work proposes a graph generative model that targets its use in scaffold-based molecular design. Our model accepts a molecular scaffold as input and extends it by sequentially adding atoms and bonds. The generated molecules are then guaranteed to contain the scaffold with certainty, and their properties can be controlled by conditioning the generation process on desired properties. The learned rule of extending molecules can well generalize to arbitrary kinds of scaffolds, including those unseen during learning. In the conditional generation of molecules, our model can simultaneously control multiple chemical properties despite the search space constrained by fixing the substructure. As a demonstration, we applied our model to designing inhibitors of the epidermal growth factor receptor and show that our model can employ a simple semi-supervised extension to broaden its applicability to situations where only a small amount of data is available.

A common strategy of rational molecular design is narrowing the search space by starting from a known molecule with potential. Designing new molecules then proceeds by searching for the right combination of functional groups that optimizes the required properties, while the core structure, or scaffold, of the original molecule is purposely retained to preserve the underlying characteristics. This scaffold-based design has been one of the standard approaches in small-molecule drug discovery, where one first identifies a scaffold based on the information of the target protein and probes a library of derivative compounds to find the one showing optimum potency and selectivity.^{3–5} Molecular design in other areas such as materials chemistry also adopts similar strategies. A conspicuous example is organic electronics, where a typical set of electron-donor and -acceptor moieties is known. The process of properly selecting or combining scaffold moieties, followed by side-chain optimization, leads to designing new constituents in organic light-emitting diodes,^{6} field-effect transistors,^{7} photocatalysts^{8} and solar cells.^{9} However, although fixing the core structure hugely reduces the search space, it is most likely that, in most real-world applications, the remaining fraction still spans easily beyond the reach of what could be covered by sole human intuition.

Upon the call for more exhaustive and intelligent exploration of chemical space, recent advances in deep generative models have been more and more stimulating the use of the models in in silico molecular design.^{10,11} All the related studies share a common goal of easing the labor of discovering new molecules in practice.^{12–31} That said, until today not much attention was given to scaffold-based molecular design, despite its prevalence as described above. One exception is the work by Li et al.,^{25} where the authors use a fingerprint representation of the scaffolds contained in a molecule. The fingerprint is then used to condition the generation process so that the generated molecules tend to contain desired scaffolds. Another possible way of retaining scaffolds is to learn a continuous representation of molecules.^{15,17,24,32} Searching the vicinity of a query molecule in the resulting latent space allows generating molecules similar in shape. Generating derivatives of a scaffold can thus be possible if the distance in the learned space well correlates with core-structure similarity.

The described schemes of scaffold-based molecule generation have the following drawbacks. First, categorically representing the scaffold kind of a molecule requires a predefined vocabulary of scaffolds. This imposes a limitation on querying arbitrary kinds of scaffolds when generating molecules. Second, because of their probabilistic nature, both the conditional generation and the latent-space search permit molecules that do not contain desired scaffolds. Lastly, to the best of our knowledge, no work has been shown to control the scaffold and properties of generated molecules at the same time, which is essential in achieving the goal of scaffold-based molecular design.

We here show our development of a generative model that targets its use in scaffold-based molecular design. Our model is a variational autoencoder (VAE)^{33} that accepts a molecular scaffold as input and extends it to generate derivative molecules. Extending a scaffold is done by sequentially adding new atoms and bonds to due parts of the scaffold. We adopt graphs to represent molecular structures instead of widely used textual formats, such as the simplified molecular-input line-entry system (SMILES).^{34} This is because a string representation can substantially change throughout our process of sequential extension while graphs can more naturally express how a new atom or bond—i.e., node or edge—is added each time.^{24} Graph neural networks are used to extract the structural dependencies between nodes and edges, making every decision of adding a new element dependent on the structure of the graph being processed.^{25,35,36} The properties of generated molecules can be controlled by conditioning the generation process on desired properties. After learning, our model can accept any arbitrary scaffold and generate new molecules, with their properties controlled and core structures retained.

Ideally, generative models without any starting structure, i.e., de novo generative models for molecules, would be more desirable. However, the performance of optimizing a property in de novo generation may significantly drop when the associated structure–property relationship is hard to be learned.^{18} Such a tendency can get still worse when multiple properties are to be controlled, because acquiring a sufficient amount of data for individual properties is often impractical. By leveraging the properties of a scaffold, our scaffold-based generative model can more easily optimize target properties and preserve desirable characteristics. As an example of the latter, one can retain the synthetic accessibility of the generated molecules by using a scaffold whose synthetic routes are well established—an aspect crucial in in silico design.

Our work contributes to molecular design in two aspects. One, of more practical importance, is our proposal of the model as a tool for scaffold-based molecular design. The other, of more analytical importance, is our exploration of supergraph space. To elaborate the latter, we first note that once a scaffold is to be retained, the set of possible generations only include the molecules whose graphs are supergraphs of the scaffold graph. In other words, fixing a scaffold imposes a strong constraint on the search space, and therefore it shall be meaningful to probe how such a constraint affects the generation performance and property controllability. The first part of our experiments addresses this point. Further divided into three parts, our experiments address (i) the overall generation performance, (ii) the scaffold dependence and (iii) the property controllability of our model. In part (i), we show that our model achieves high rates of validity, uniqueness and novelty in generating molecules. In part (ii), we show that our model consistently shows good performance in extending unobserved as well as observed scaffolds, confirming that our model can build new molecules from smaller ones by learning valid chemical rules, rather than by memorizing the molecule–scaffold matching in the training data. In part (iii), we show that despite the confined search space, our model can control single or multiple properties of generated molecules, with tolerable deviations from targeted values.

Finally, returning to the more practical aspect, we applied our model to designing inhibitors of the human epidermal growth factor receptor (EGFR). We performed semi-supervised learning by incorporating a property predictor to learn from unlabeled molecules as well, which can help treat the frequently encountered problem of data deficiency. As a result, the model was able to generate new inhibitors with improved potency, where 20% of the unique generations showed more than two orders of magnitude decrease in the predicted half-maximal inhibitory concentration. This shows that our model can learn to design molecules in more realistic problems, where the deficiency of available data makes learning hard.

Despite the well-demonstrated successes of SMILES-based models, SMILES has a fundamental limitation in consistently conveying molecular similarity.^{24} In contrast to SMILES, molecular graphs as a representation can more naturally express the structural characteristics of molecules. Because learning a distribution over graphs imposes more challenging problems like low scalability and non-unique node orderings,^{25,38} it is only very recently that reports on generative models of graphs—molecular graphs, in particular—began to emerge. The majority of works adopts sequential methods in generating graphs. For instance, the models by Li et al.,^{25} You et al.^{26} and Li et al.^{27} generate a graph by predicting a sequence of graph building actions consisting of node additions and edge additions. Assouel et al.^{28} and Liu et al.^{29} similarly adopted the sequential scheme, but their graph generation proceeds through two (fully^{28} or partially^{29}) separate stages of node initialization and connection. It is also possible to coarse-grain the standard node-by-node generation into a structure-by-structure fashion. The junction tree VAE by Jin et al. generates a tree of molecular fragments and then recovers a full molecular graph through a fine-grained assembly of the fragments.^{24} Sharply contrasting to the sequential generation scheme is the single-step generation of whole graphs. The models GraphVAE by Simonovsky and Komodakis^{30} and MolGAN by De Cao and Kipf^{31} are such, where an adjacency matrix and graph attributes are predicted altogether, generating a graph at one time.

Notation | Description |
---|---|

G | An arbitrary or whole-molecule graph |

S | A molecular scaffold graph |

V(G) | The node set of a graph G |

E(G) | The edge set of a graph G |

h_{v} |
A node feature vector |

h_{uv} |
An edge feature vector |

H_{V(G)} |
{h_{v}:v ∈ V(G)} |

H_{E(G)} |
{h_{uv}:(u,v) ∈ E(G)} |

h_{G} |
A readout vector summarizing H_{V(G)} |

z | A latent vector to be decoded |

y | The vector of molecular properties of a whole-molecule |

y_{S} |
The vector of molecular properties of a scaffold |

The learning object of our model is a strategy of extending a graph to larger graphs whose distribution follows that of real molecules. We achieve this by training our model to recover the molecules in a dataset from their scaffolds. The scaffold of a molecule can be defined in a deterministic way such as that by Bemis and Murcko,^{39} which is what we used in our experiments. The construction of a target graph is done by making successive decisions of node and edge additions. The decision at each construction step is drawn from the node features and edge features of the graph at the step. The node features and edge features are recurrently updated to reflect the construction history of the previous steps. The construction process will be further detailed below.

We realized our model as a VAE,^{33} with the architecture depicted in Fig. 1. The architecture consists of an encoder q_{ϕ} and a decoder p_{θ}, parametrized by ϕ and θ, respectively. The encoder encodes a graph G to an encoding vector z and the decoder decodes z to recover G. The decoder requires a scaffold graph S as an additional input, and the actual decoding process runs by sequentially adding nodes and edges to S. The encoding vector z plays its role by consistently affecting its information in updating the node and edge features of a transient graph being processed. Similar to p(G;S), our notation p_{θ}(G;S|z) indicates that candidate generations of our decoder are always a supergraph of S. As for the encoder notation q_{ϕ}(z|G;S), we emphasize that the encoder also has a dependence on scaffolds because of the joint optimization of q_{ϕ} and p_{θ}.

As a latent variable model, a VAE may have difficulty in generating meaningful samples when the manifold of the learned latent distribution is insufficient to cover all the informative regions.^{40–42} In generative models for molecules, this difficulty has manifested itself in the form of chemically invalid generated molecules,^{15,43} and accordingly various SMILES-based^{43–46} and graph-based^{24,29,47–49} models have introduced explicit constraints or learning algorithms to enhance the validity of generations. It is possible to incorporate similar constraints or algorithms also in our model; without such, nonetheless, our model shows high chemical validity in the generated graphs (see Section 3.2 below).

In this work, we implemented the encoder q_{ϕ} as a variant of the interaction network.^{36,50} Our network's algorithm consists of a propagation phase and a readout phase, which we write as

(1) |

(2) |

The propagation phase itself consists of two stages. The first stage calculates an aggregated message between each node and its neighbors as

(3) |

(4) |

The graph propagation can be conditioned by incorporating an additional vector c in calculating aggregated messages. In such a case, the functions M and (accordingly) propagate accept c as an additional argument (i.e., they become M(·,·,·,c) and propagate(·,·,c)). When encoding input graphs, we choose c to be the concatenation of the property vectors y and y_{S} to enable property-controlled generation. During graph decoding, we use the concatenation of y, y_{S} and the latent vector z as the condition vector (see below).

Our graph decoding starts with preparing and propagating the initial node features of G_{0}. As we do for G, we prepare the initial feature vectors of G_{0} by embedding the atom types and bond types of the scaffold molecule. This initial embedding is done by the same network (embed in Fig. 1) used for whole-molecules. The initial feature vectors of G_{0} are then propagated a fixed number of times by another interaction network. As the propagation finishes, the decoder extends G_{0} by processing it through a loop of node additions and accompanying (inner) loops of edge additions. A concrete description of the process is as follows:

• Stage 1: node addition. Choose an atom type or terminate the building process with estimated probabilities. If an atom type is chosen, add a new node, say w, with the chosen type to the current transient graph G_{t} and proceed to Stage 2. Otherwise, terminate the building process and return the graph.

• Stage 2: edge addition. Given the new node, choose a bond type or return to Stage 1 with estimated probabilities. If a bond type is chosen, proceed to Stage 3.

• Stage 3: node selection. Select a node, say v, from the existing nodes except w with estimated probabilities. Then, add a new edge (v,w) to G_{t} with the bond type chosen in Stage 2. Continue the edge addition from Stage 2.

The flow of the whole process is depicted in the right side of Fig. 1. Excluded from Stages 1–3 is the final stage of selecting a suitable isomer, which we describe separately below.

In every stage, the model draws an action by estimating a probability vector on candidate actions. Depending on whether the current stage should add an atom or not (Stage 1), add an edge or not (Stage 2), or select an atom to connect (Stage 3), the probability vector is computed by the corresponding one among the following:

^{an} = addNode(H_{V(Gt)},H_{E(Gt)},z)
| (5) |

^{ae} = addEdge(H_{V(Gt)},H_{E(Gt)},z)
| (6) |

^{sn} = selectNode(H_{V(Gt)},H_{E(Gt)},z).
| (7) |

The first probability vector ^{an} is a (n_{a} + 1)-length vector, where its elements ^{an}_{1} to correspond to the probabilities on n_{a} atom types, and is the termination probability. As for ^{ae}, a vector of size n_{b} + 1, its elements ^{ae}_{1} to correspond to the probabilities on n_{b} bond types, and is the probability of stopping the edge addition. Lastly, the i-th element of the third vector ^{sn} is the probability of connecting the i-th existing node with the lastly added one.

When the model decides to add a new node, say w, a corresponding feature vector h_{w} should be added to H_{V(Gt)}. To this end, the model prepares an initial feature vector h^{0}_{w} by representing the atom type of w and then incorporates it with the existing node features in H_{V(Gt)} to compute an appropriate h_{w}. Similarly, when a new edge, say (v,w), is added, the model computes h_{vw} from h^{0}_{vw} and H_{V(Gt)} ∪ {h_{w}} to update H_{E(Gt)}, where h^{0}_{vw} represents the bond type of (v,w). The corresponding modules for initializing new nodes and edges are as follows:

h_{w} = initNode(h^{0}_{w},H_{V(Gt)})
| (8) |

h_{vw} = initEdge(h^{0}_{vw},H_{V(Gt)} ∪ h_{w}).
| (9) |

The graph building modules addNode, addEdge and selectNode include a preceding step of propagating node features. For instance, the actual operation done by addNode is

addNode(H_{V(Gt)},H_{E(Gt)},z) = f◦concat(readout◦propagate^{(k)} (H_{V(Gt)},H_{E(Gt)},z),z),
| (10) |

As shown in eqn (10), the graph propagation in addNode (and addEdge and selectNode) incorporates the latent vector z, which encodes a whole-molecule graph G. This makes our model refer to z when making graph building decisions and ultimately reconstruct G by decoding z. If the model is to be conditioned on whole-molecule properties y and scaffold properties y_{S}, one can understand eqn (5)–(7) and (10) as incorporating = concat(z,y,y_{S}) instead of z.

^{si} = selectIsomer(Ĝ,z),
| (11) |

(12) |

In actual learning, we have a scaffold dataset , and for each scaffold we have a corresponding whole-molecule dataset . Note that any set of molecules can produce a scaffold set and a collection of whole-molecule sets: once the scaffolds of all molecules in a pregiven set are defined, producing , the molecules of the pregiven set can be grouped into the collection . Using such and , our objective is to find the optimal values of the parameters ϕ and θ that maximize the right-hand side of eqn (12), hence maximizing . Here, the double expectation indicates explicitly the dependence of whole-molecule sets D(S) on scaffolds S. That is, according to our definition, defining a scaffold set is first, and then each defines a whole-molecule set .

In the ESI,† we detail our implementation of the modules and their exact operations, along with the algorithm of the full process.

Our experiments include the training and evaluation of our scaffold-based graph generative model using the stated dataset. For the conditional molecule generation, we used molecular weight (MW), topological polar surface area (TPSA) and octanol–water partition coefficient (logP). We used one, two or all of the three properties to singly or jointly condition the model. We set the learning rate to 0.0001 and trained all instances of the model up to 20 epochs. The other hyperparameters such as the layer dimensions are stated in the ESI.† We used RDKit to calculate the properties of molecules. In what follows, we will omit the units of MW (g mol^{−1}) and TPSA (Å^{2}) for simplicity.

Our source code and dataset are available at https://github.com/jaechanglim/GGM.

We define a graph to be valid if it satisfies basic chemical requirements such as valency. In practice, we use RDKit to determine the validity of generated graphs. It is particularly important for our model to check the metrics above because generating molecules from a scaffold restricts the space of candidate products. We evaluated the models that are singly conditioned on MW, TPSA or logP by randomly selecting 100 scaffolds from the dataset and generating 100 molecules from each scaffold. The target values (100 values for each property) were randomly sampled from each property's distribution over the dataset. For MW, generating molecules whose MW is smaller than the MW of its scaffold is unnatural, so we excluded those cases from our evaluation.

Table 2 summarizes the validity, uniqueness and novelty of the molecules generated by our models and the results of other molecular generative models for comparison. Note that the comparison here is only approximate because the values by the other models were taken from their respective reports. Despite the strict restriction imposed by scaffolds, our models show high validity, uniqueness and novelty that are comparable to those of the other models. The high uniqueness and novelty are particularly meaningful considering the fact that most of the scaffolds in our training set have only a few whole-molecules. For instance, among the 85318 scaffolds in the training set, 79700 scaffolds have less than ten whole-molecules. Therefore, it is unlikely that our model achieved such a high performance by simply memorizing the training set, and we can conclude that our model learns the general chemical rules for extending arbitrary scaffolds.

Fig. 2 shows the property distributions of generated molecules. We see that the property distributions are well centered around the target values. This shows that despite the narrowed search space, our model successfully generated new molecules having desired properties. To see how our model extends a given scaffold according to designated property values, we drew some of the generated molecules in Fig. 3. For the target conditions MW = 400, TPSA = 120 and logP = 7, we randomly sampled nine examples using three different scaffolds. The molecules in each row were generated from the same scaffold. We see that the model generates new molecules with designated properties by adding appropriate side chains: for instance, the model added hydrophobic groups to the scaffolds to generate high-logP molecules, while it added polar functional groups to generate high-TPSA molecules.

Fig. 3 Example molecules generated from three scaffolds. The indicated values are the target conditions of the generation and the property values of the scaffolds. |

Table 3 summarizes the validity, uniqueness and MAD of molecules generated from the seen and unseen scaffolds. Here, MAD denotes the mean absolute difference between designated property values and the property values of generated molecules. The result shows no significant difference of the three metrics between the two sets of scaffolds. This shows that our model achieves generalization over arbitrary scaffolds in generating valid molecules with controlled properties.

Property | Validity (%) | Uniqueness (%) | MAD |
---|---|---|---|

MW (seen scaffolds) | 98.4 | 88.6 | 6.72 |

MW (unseen scaffolds) | 98.4 | 83.5 | 6.09 |

TPSA (seen scaffolds) | 93.2 | 87.0 | 8.32 |

TPSA (unseen scaffolds) | 92.5 | 82.9 | 9.82 |

logP (seen scaffolds) | 98.2 | 91.1 | 0.28 |

logP (unseen scaffolds) | 97.1 | 87.0 | 0.36 |

Fig. 4 shows the result of the generations conditioned on MW and TPSA, MW and logP, and logP and TPSA. Plotted are the joint distributions of the property values over the generated molecules. Gaussian kernels were used for the kernel density estimation. We see that the modes of the distributions are well located near the point of the target values. As an exception, the distribution by the target (logP, TPSA) = (2, 50) shows a relatively long tail over larger logP and TPSA values. This is because logP and TPSA have by definition a negative correlation between each other and thus requiring a small value for both can make the generation task unphysical. Intrinsic correlations between molecular properties can even cause seemingly feasible targets to result in dispersed property distributions. An example of such can be the result of another target (logP, TPSA) = (5, 50), but we note that in the very case the outliers (in logP > 5.5 and TPSA > 65 for example) amount to only a minor portion of the total generations, as the contours show.

We further tested the conditional generation by incorporating all the three properties. We used the same target values of MW, TPSA and logP as above, resulting in total eight conditions of generation. The rest of the settings, including the scaffold set and the number of generations, were retained. The result is shown in Fig. 5, where we plotted the MW, TPSA and logP values of the generated molecules. The plot shows that the distributions from different target conditions are well separated from one another. As with the double-property result, all the distributions are well centered around their target values.

Fig. 5 Scatter plot of the property values of generated molecules. The legend lists the eight sets of property values used for the generations. |

We also compared the generation performance of our model for single- and multi-property controls in a quantitative way. Table 4 shows the performance statistics of single-, double- and triple-property controls in terms of MAD, validity and novelty. Using the same 100 scaffolds, we generated 100 molecules from each, each time under a randomly designated target condition. As the number of incorporated properties increases from one to two and to three, the overall magnitudes of the descriptors are well preserved. Regarding the slight increases in the MAD values, we attribute them to the additional confinement of chemical space forced by intrinsic correlations between multiple properties. Nevertheless, the magnitudes of the worsening are small compared to the mean values of the properties (389 for MW, 77 for TPSA and 3.6 for logP).

Properties | MW | TPSA | logP | Validity (%) | Novelty (%) |
---|---|---|---|---|---|

MAD | MAD | MAD | |||

MW | 7.99 | — | — | 98.6 | 98.7 |

TPSA | — | 8.57 | — | 93.0 | 99.1 |

logP | — | — | 0.29 | 97.8 | 99.3 |

MW & TPSA | 8.04 | 7.06 | — | 93.5 | 99.4 |

MW & logP | 11.59 | — | 0.45 | 97.0 | 99.6 |

TPSA & logP | — | 9.62 | 0.60 | 94.5 | 99.6 |

All | 16.23 | 10.95 | 0.73 | 93.9 | 99.8 |

We adopted the semi-supervised VAE^{16,54} because it can be easily implemented by adding a label predictor to a plain VAE. In semi-supervised VAE, we jointly train the predictor to predict the negative logarithm of half maximal inhibitory concentration (pIC_{50}) values against the human EGFR, together with the VAE part. For labeled molecules, we use the true pIC_{50} values to train the predictor and also use them in the VAE condition vector. For unlabeled molecules, we only train the VAE part using the pIC_{50} values predicted by the predictor. To the IBS training molecules we used above, we added 8016 molecules from the ChEMBL database^{55} to prepare our dataset for semi-supervised learning. The ChEMBL molecules were the ones having pIC_{50} values, hence labeled, whereas the IBS molecules were unlabeled. Note that each molecule is paired with its extracted scaffold and that all the scaffolds of ChEMBL as well as IBS molecules are treated unlabeled. We divided the ChEMBL molecules into 6025 training molecules and 1991 test molecules, resulting in only 1.7% label ratio in the training set. To efficiently train the predictor, we sampled labeled and unlabeled inputs with the ratio of 1:5 in every batch. After 20 epochs of learning, we chose the 100 scaffolds in the ChEMBL test set whose predicted pIC_{50} are between 5 and 6, and generated 100 molecules from each scaffold with the target pIC_{50} value of 8.

The MAD of pIC_{50} prediction on the ChEMBL test molecules was 0.58. In the total 10000 times of generation, the model showed the validity of 96.6%, uniqueness of 44.9% and novelty of 99.7%. The relatively low uniqueness was expected because using one fixed target value imposes a strict condition on the search space, increasing redundancy in generations. Fig. 6 shows the distributions of predicted pIC_{50} values of the 100 scaffolds and generated molecules. Although the distribution of generated molecules is centered on values lower than 8 and shows relatively broad dispersion, the mean improvement of pIC_{50} value compared to the scaffolds is 1.29 (only counting the unique molecules), which amounts to about 19.7 times enhancement of inhibition potency in terms of IC_{50}. Those predicted to have pIC_{50} larger than 8, which will be of more interest in practice, belong to 20% of the unique generations. These results show the applicability of our model, with minimal extension, to many of the practical situations where the preparation of data becomes problematic.

Because generating molecules from a scaffold can leverage the properties already present in it, optimizing or retaining target properties can be easier than by generating molecules de novo. Suppose a pharmacological activity is targeted to be optimized, then generating the best molecular structure from scratch is likely to be hard because of the property's intricate relationship with molecular structures. In such a case, using a scaffold having moderate activity can make the optimization more feasible. A similar advantage is expected in controlling multiple properties simultaneously. If one's objective is to acquire, for instance, synthetic accessibility of the generated molecule as well as to enhance the activity, using a well-synthesizable scaffold allows the objective to be more focused on the latter, increasing the efficiency of search.

We evaluated our model by examining the validity, uniqueness and novelty of generated molecules. Despite the constraint on the search space imposed by scaffolds, the model showed comparable results with regard to previous SMILES-based and graph-based molecular generative models. Our model consistently worked well in terms of the three metrics when new scaffolds, which were not in the training set, were given. This means that the model achieved good generalization rather than memorizing the pairings between the scaffolds and molecules in the training set. In addition, while retaining the given scaffolds, our model successfully generated new molecules having desired degrees of molecular properties such as the molecular weight, topological polar surface area and octanol–water partition coefficient. The property-controlled generation could incorporate multiple molecular properties simultaneously. We also showed that our model can incorporate semi-supervised learning in applications like designing protein inhibitors, one of the common situations where only a small amount of labeled data is available. From all the results we presented, we believe that our scaffold-based molecular graph generative model provides a practical way of optimizing the functionality of molecules with preserved core structures.

- P. G. Polishchuk, T. I. Madzhidov and A. Varnek, J. Comput.-Aided Mol. Des., 2013, 27, 675–679 CrossRef CAS PubMed.
- S. Kim, P. A. Thiessen, E. E. Bolton, J. Chen, G. Fu, A. Gindulyte, L. Han, J. He, S. He, B. A. Shoemaker, J. Wang, B. Yu, J. Zhang and S. H. Bryant, Nucleic Acids Res., 2016, 44, D1202–D1213 CrossRef CAS PubMed.
- K. H. Bleicher, Y. Wüthrich, G. Adam, T. Hoffmann and A. J. Sleight, Bioorg. Med. Chem. Lett., 2002, 12, 3073–3076 CrossRef CAS PubMed.
- G. L. Card, L. Blasdel, B. P. England, C. Zhang, Y. Suzuki, S. Gillette, D. Fong, P. N. Ibrahim, D. R. Artis, G. Bollag, M. V. Milburn, S.-H. Kim, J. Schlessinger and K. Y. J. Zhang, Nat. Biotechnol., 2005, 23, 201–207 CrossRef CAS PubMed.
- M. E. Welsch, S. A. Snyder and B. R. Stockwell, Curr. Opin. Chem. Biol., 2010, 14, 347–361 CrossRef CAS PubMed.
- Y. Im, M. Kim, Y. J. Cho, J.-A. Seo, K. S. Yook and J. Y. Lee, Chem. Mater., 2017, 29, 1946–1963 CrossRef CAS.
- J. Dhar, U. Salzner and S. Patil, J. Mater. Chem. C, 2017, 5, 7404–7430 RSC.
- A. Al Mousawi, F. Dumur, P. Garra, J. Toufaily, T. Hamieh, B. Graff, D. Gigmes, J. P. Fouassier and J. Lalevée, Macromolecules, 2017, 50, 2747–2758 CrossRef CAS.
- X. Sun, F. Wu, C. Zhong, L. Zhu and Z. Li, Chem. Sci., 2019, 10, 6899–6907 RSC.
- B. Sanchez-Lengeling and A. Aspuru-Guzik, Science, 2018, 361, 360–365 CrossRef CAS PubMed.
- H. Chen, O. Engkvist, Y. Wang, M. Olivecrona and T. Blaschke, Drug Discovery Today, 2018, 23, 1241–1250 CrossRef PubMed.
- M. H. S. Segler, T. Kogej, C. Tyrchan and M. P. Waller, ACS Cent. Sci., 2018, 4, 120–131 CrossRef CAS PubMed.
- A. Gupta, A. T. Müller, B. J. H. Huisman, J. A. Fuchs, P. Schneider and G. Schneider, Mol. Inf., 2018, 37, 1700111 CrossRef.
- E. J. Bjerrum and B. Sattarov, Biomolecules, 2018, 8(4), 131 CrossRef PubMed.
- R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, B. Sánchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams and A. Aspuru-Guzik, ACS Cent. Sci., 2018, 4, 268–276 CrossRef PubMed.
- S. Kang and K. Cho, J. Chem. Inf. Model., 2019, 59, 43–52 CrossRef CAS PubMed.
- J. Lim, S. Ryu, J. W. Kim and W. Y. Kim, J. Cheminf., 2018, 10, 31 Search PubMed.
- D. Polykovskiy, A. Zhebrak, D. Vetrov, Y. Ivanenkov, V. Aladinskiy, P. Mamoshina, M. Bozdaganyan, A. Aliper, A. Zhavoronkov and A. Kadurin, Mol. Pharm., 2018, 15, 4398–4405 CrossRef CAS PubMed.
- G. Lima Guimaraes, B. Sanchez-Lengeling, C. Outeiral, P. L. Cunha Farias and A. Aspuru-Guzik, arXiv e-prints, arXiv:1705.10843, 2017.
- N. Jaques, S. Gu, D. Bahdanau, J. M. Hernández-Lobato, R. E. Turner and D. Eck, Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 2017, pp. 1645–1654 Search PubMed.
- M. Olivecrona, T. Blaschke, O. Engkvist and H. Chen, J. Cheminf., 2017, 9, 48 Search PubMed.
- D. Neil, M. H. S. Segler, L. Guasch, M. Ahmed, D. Plumbley, M. Sellwood and N. Brown, 6th International Conference on Learning Representations, Workshop Track Proceedings, Vancouver, BC, Canada, 2018 Search PubMed.
- M. Popova, O. Isayev and A. Tropsha, Sci. Adv., 2018, 4, eaap7885 CrossRef CAS.
- W. Jin, R. Barzilay and T. Jaakkola, Proceedings of the 35th International Conference on Machine Learning, Stockholmsmässan, Stockholm Sweden, 2018, pp. 2323–2332 Search PubMed.
- Y. Li, O. Vinyals, C. Dyer, R. Pascanu and P. Battaglia, 6th International Conference on Learning Representations, Workshop Track Proceedings, Vancouver, BC, Canada, 2018 Search PubMed.
- J. You, B. Liu, Z. Ying, V. Pande and J. Leskovec, in Advances in Neural Information Processing Systems 31, ed. S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi and R. Garnett, Curran Associates, Inc., 2018, pp. 6410–6421 Search PubMed.
- Y. Li, L. Zhang and Z. Liu, J. Cheminf., 2018, 10, 33 Search PubMed.
- R. Assouel, M. Ahmed, M. H. Segler, A. Saffari and Y. Bengio, arXiv e-prints, arXiv:1811.09766, 2018.
- Q. Liu, M. Allamanis, M. Brockschmidt and A. Gaunt, in Advances in Neural Information Processing Systems 31, ed. S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi and R. Garnett, Curran Associates, Inc., 2018, pp. 7795–7804 Search PubMed.
- M. Simonovsky and N. Komodakis, Artificial Neural Networks and Machine Learning – ICANN 2018, Cham, Switzerland, 2018, pp. 412–422 Search PubMed.
- N. De Cao and T. Kipf, arXiv e-prints, arXiv:1805.11973, 2018.
- M. Skalic, J. Jiménez, D. Sabbadin and G. De Fabritiis, J. Chem. Inf. Model., 2019, 59, 1205–1214 CrossRef CAS.
- D. P. Kingma and M. Welling, 2nd International Conference on Learning Representations, Banff, AB, Canada, 2014 Search PubMed.
- D. Weininger, J. Chem. Inf. Model., 1988, 28, 31–36 CrossRef CAS.
- Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang and P. S. Yu, arXiv e-prints, arXiv:1901.00596, 2019.
- J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals and G. E. Dahl, Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 2017, pp. 1263–1272 Search PubMed.
- D. Polykovskiy, A. Zhebrak, B. Sanchez-Lengeling, S. Golovanov, O. Tatanov, S. Belyaev, R. Kurbanov, A. Artamonov, V. Aladinskiy, M. Veselov, A. Kadurin, S. Nikolenko, A. Aspuru-Guzik and A. Zhavoronkov, arXiv e-prints, arXiv:1811.12823, 2018.
- J. You, R. Ying, X. Ren, W. Hamilton and J. Leskovec, Proceedings of the 35th International Conference on Machine Learning, Stockholmsmässan, Stockholm Sweden, 2018, pp. 5708–5717 Search PubMed.
- G. W. Bemis and M. A. Murcko, J. Med. Chem., 1996, 39, 2887–2893 CrossRef CAS PubMed.
- A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow and B. Frey, 4th International Conference on Learning Representations, Workshop Track Proceedings, San Juan, Puerto Rico, 2016 Search PubMed.
- J. He, D. Spokoyny, G. Neubig and T. Berg-Kirkpatrick, 7th International Conference on Learning Representations, Conference Track Proceedings, New Orleans, LA, USA, 2019 Search PubMed.
- I. Goodfellow, arXiv e-prints, arXiv:1701.00160, 2016.
- R.-R. Griffiths and J. M. Hernández Lobato, Chem. Sci., 2020 10.1039/C9SC04026A.
- M. J. Kusner, B. Paige and J. M. Hernández-Lobato, Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 2017, pp. 1945–1954 Search PubMed.
- H. Dai, Y. Tian, B. Dai, S. Skiena and L. Song, 6th International Conference on Learning Representations, Conference Track Proceedings, Vancouver, BC, Canada, 2018 Search PubMed.
- O. Mahmood and J. M. Hernández-Lobato, arXiv e-prints, arXiv:1905.09885, 2019.
- T. Ma, J. Chen and C. Xiao, in Advances in Neural Information Processing Systems 31, ed. S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi and R. Garnett, Curran Associates, Inc., 2018, pp. 7113–7124 Search PubMed.
- B. Samanta, A. De, G. Jana, P. K. Chattaraj, N. Ganguly and M. Gomez-Rodriguez, arXiv e-prints, arXiv:1802.05283, 2018.
- Y. Kwon, J. Yoo, Y.-S. Choi, W.-J. Son, D. Lee and S. Kang, J. Cheminf., 2019, 11, 70 Search PubMed.
- P. Battaglia, R. Pascanu, M. Lai, D. Jimenez Rezende and K. Kavukcuoglu, in Advances in Neural Information Processing Systems 29, ed. D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon and R. Garnett, Curran Associates, Inc., 2016, pp. 4502–4510 Search PubMed.
- RDKit: Open-Source Cheminformatics, http://www.rdkit.org Search PubMed.
- InterBioScreen Ltd, http://www.ibscreen.com.
- N. Brown, M. Fiscato, M. H. Segler and A. C. Vaucher, J. Chem. Inf. Model., 2019, 59, 1096–1108 CrossRef CAS PubMed.
- D. P. Kingma, S. Mohamed, D. Jimenez Rezende and M. Welling, in Advances in Neural Information Processing Systems 27, ed. Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence and K. Q. Weinberger, Curran Associates, Inc., 2014, pp. 3581–3589 Search PubMed.
- A. Gaulton, L. J. Bellis, A. P. Bento, J. Chambers, M. Davies, A. Hersey, Y. Light, S. McGlinchey, D. Michalovich, B. Al-Lazikani and J. P. Overington, Nucleic Acids Res., 2011, 40, D1100–D1107 CrossRef PubMed.

## Footnotes |

† Electronic supplementary information (ESI) available: The full algorithm and implementation of our model, and the hyperparameters used in the experiments. See DOI: 10.1039/c9sc04503a |

‡ These authors contributed equally to this work. |

This journal is © The Royal Society of Chemistry 2020 |