Rustam Zhumagambetova,
Ferdinand Molnárb,
Vsevolod A. Peshkov*c and
Siamac Fazli*a
aDepartment of Computer Science, School of Engineering and Digital Sciences, Nazarbayev University, Nur-Sultan, Kazakhstan. E-mail: siamac.fazli@nu.edu.kz
bDepartment of Biology, School of Sciences and Humanities, Nazarbayev University, Nur-Sultan, Kazakhstan
cDepartment of Chemistry, School of Sciences and Humanities, Nazarbayev University, Nur-Sultan, Kazakhstan. E-mail: vsevolod.peshkov@nu.edu.kz
First published on 27th July 2021
Recent advances in convolutional neural networks have inspired the application of deep learning to other disciplines. Even though image processing and natural language processing have turned out to be the most successful, there are many other domains that have also benefited; among them, life sciences in general and chemistry and drug design in particular. In concordance with this observation, from 2018 the scientific community has seen a surge of methodologies related to the generation of diverse molecular libraries using machine learning. However to date, attention mechanisms have not been employed for the problem of de novo molecular generation. Here we employ a variant of transformers, an architecture recently developed for natural language processing, for this purpose. Our results indicate that the adapted Transmol model is indeed applicable for the task of generating molecular libraries and leads to statistically significant increases in some of the core metrics of the MOSES benchmark. The presented model can be tuned to either input-guided or diversity-driven generation modes by applying a standard one-seed and a novel two-seed approach, respectively. Accordingly, the one-seed approach is best suited for the targeted generation of focused libraries composed of close analogues of the seed structure, while the two-seeds approach allows us to dive deeper into under-explored regions of the chemical space by attempting to generate the molecules that resemble both seeds. To gain more insights about the scope of the one-seed approach, we devised a new validation workflow that involves the recreation of known ligands for an important biological target vitamin D receptor. To further benefit the chemical community, the Transmol algorithm has been incorporated into our cheML.io web database of ML-generated molecules as a second generation on-demand methodology.
Medicinal chemistry is a highly interdisciplinary field of science that deals with the design, chemical synthesis, and mechanism of action of biologically active molecules as well as their development into marketed pharmaceutical agents (i.e. drugs). The creation of new drugs is an incredibly hard and arduous process, one of the key reasons being the fact that the ‘chemical space’ of all possible molecules is extremely large and intractable. On account of estimates that the chemical space of molecules with pharmacological properties is in the range of 1023 to 1060 compounds,1 this order of magnitude leaves the work of finding new drugs outside the reach of manual labor.
To cure a particular disease, medicinal chemists need to determine molecules that are active and selective towards specific biological targets while keeping the risks of negative side effects minimal.2–5 As the number of molecules that require testing to identify an ideal drug candidate constantly increases, it raises the overall cost of the drug discovery process.6 Therefore, the need for algorithms that are able to narrow down and streamline these efforts has recently emerged. Specifically, computer algorithms can assist with creating new virtual molecules,7,8 performing molecule conformation analysis9,10 as well as molecular docking11,12 to determine the affinity of novel and known molecules towards specific biological targets.
With respect to molecular generation, the conventional non-neural algorithms heavily rely on external expert knowledge to design candidate molecules. In this context, expert knowledge may consist of molecular fragments that can be “mixed and matched” to produce a set of novel molecules.13–18 However, the resulting molecules might be difficult to synthesize. Consequently, another type of expert knowledge i.e. known chemical reactions can then be added.19,20 In this way the “mix and match” approach will be constrained with a specific set of rules to maximize the chances that any molecule that is produced can be synthesized.21 The obvious drawback of relying on external knowledge is that it may restrict access to unknown and/or not yet populated regions of chemical space.
An alternative approach to this problem are neural algorithms that are inherently data-driven. This means that such algorithms do not rely on expert knowledge and hence derive insights from the data itself. Such approaches can be applied in supervised and unsupervised settings.22–24 Supervised algorithms use artificial neural networks for the prediction of molecular properties25,26 or reaction outputs.27 Most unsupervised algorithms are aimed at molecular generation and drug design.8,22,28–48
When automating molecular search, a natural question that arises is how molecules, a physical collection of atoms that are arranged in 3D space, can be represented. In the late 1980s, the simplified molecular input line-entry system (SMILES) emerged, which aims to create a molecular encoding that is computationally efficient to use and human readable.49 The original encoding is based on 2D molecular graphs. Intended application areas are fast and compact information retrieval and storage. With the rise of machine learning algorithms in chemistry, the SMILES representation has been widely adopted by researchers for chemical space modeling tasks.
The task of generating more SMILES strings having an input string can be viewed as a language modelling task. The model takes an arbitrary length SMILES string (seed molecule) and returns arbitrary long SMILES strings – the general class of such models is called Seq2Seq. First used by Google,50 it is currently one of the most popular approaches for machine translation. The Seq2Seq architecture can be summarized as follows: an encoder produces a context vector that is later passed to the decoder for the generation in auto-regressive fashion.50
As the building blocks of the encoder and decoder could be neural networks with various architectures, researchers have experimented with a number of techniques: generative adversarial networks,30,38,44,45,51 variational autoencoders,39,52 and recurrent neural networks.53,54 However, the use of attention mechanisms has, to the best of our knowledge, so far been left unexplored. Here we aim to fill this gap and investigate the applicability of attention to molecular generation.
One of the distinctive features of attention is that, unlike for example RNN-type models which have a fixed amount of memory, attention mechanisms allow varying the length of the latent vector so that the input information does not have to be compressed into a fixed-length vector. In other words, attention will provide a varying length vector that contains the related information regarding all input nodes (i.e. SMILES characters, atoms, rings and branching info, etc.). Thus, in case of the standard one-seed approach the outcome strives to replicate the input, which means that upon injecting some variability into the model the algorithm can be tuned to attain a focused library of close-seed analogues. In order to switch from this genuinely input-guided mode to a more diversity-driven mode, we have come up with the idea of a two-seed approach. This novel approach is not only capable of generating diversified molecular libraries by assembling structures that resemble both seeds, but it also provides a Euclidean navigation framework of the latent space and can thus lead to further insights during the exploration of chemical space. The resulting Transmol algorithm comprising the two above described generation modes has been incorporated into the cheml.io55 website with the aim to make these findings accessible to a wider audience.
The first dataset was used to train the model. During this stage the model learns to interpolate between each molecule and constructs a latent space. This latent space acts as a proxy distribution for molecules, and can be employed to sample new molecules.
The testing dataset consists of molecules that are not present in the training data. It is used to evaluate the output of the generative model.
The scaffold testing set consists of scaffolds that are not present in the training and testing datasets. This dataset is used to check if the model is able to generate new scaffolds, i.e. unique molecular features, or whether the model simply reuses parts of the previously seen molecules to generate new ones.
For this work, we have employed a vanilla transformer model introduced by Vaswani et al.58 (see Fig. 1 for a pictorial representation of the architecture). All code for this study has been written in Python and is publicly available on GitLab.59 The initial transformer code was adapted from the annotated version.60 A vanilla transformer consists of two parts: an encoder and a decoder. The encoder (see left dashed block of Fig. 1) maps input to the latent representation z. The decoder (see dashed block on the right of Fig. 1), accepts z as an input and produces one symbol at a time. The model is auto-regressive, i.e. to produce a new symbol it requires previous output as an additional input.
The notable attribute of this architecture is the use of attention mechanisms throughout the whole model. While earlier models have been using attention only as an auxiliary layer, having some kind of recurrent neural networks (RNN) like a gated recurrent unit (GRU) or long short-term memory (LSTM), or a convolutional neural network (CNN), the transformer consists primarily of attention layers.
The attention mechanism can be seen as a function of query Q, key K and value V, where the output is a matrix product of Q, K, V using the following function:
(1) |
The parameter setup of the original paper has been employed here, i.e. the number of stacked encoder and decoder layers is N = 6, all sublayers produce an output of dmodel = 512, with the dimensionality of the inner feed-forward layer being dff, number of attention heads h = 6, and dropout d = 0.1. The learning rate is specified by the optimization function of the original manuscript with the following modified hyperparameters: warmup steps are set to 2000, factor is 1, the batch size is 1200 and the training was performed for 16 epochs. The rest of the hyperparameters were adopted unchanged from the original paper, i.e. we have not optimized the other hyperparameters leaving them as in the original paper. One of the reasons are the large number of hyperparameters and limited computing resources. Please note that further hyperparameter optimization may thus improve the performance of the model.
Several metrics are provided by the MOSES benchmark. Uniqueness shows the proportion of generated molecules that are within the training dataset. Validity describes the proportion of generated molecules that are chemically sound, as checked by RDKit.67 Internal diversity measures whether the model samples from the same region of chemical space, i.e. producing molecules that are valid and unique but only differ in a single atom. Filters measures the proportion of a generated set that passes a number of medical filters. Since the training set contains only molecules that pass through these filters, it is an implicit constraint imposed on the algorithm.
Fragment similarity (Frag) measures the similarity of the BRICS fragments distribution contained in the reference and generated sets. If the value is 1, then all fragments from the reference set are present in the generated one. If the value is 0, then there are no overlapping fragments between the generated and reference sets. Scaffold similarity (Scaff) is similar to Frag, but instead of BRICS fragments, Bemis–Murcko scaffolds are used for comparison. The range of this metric is analogous to Frag. Similarity to the nearest neighbor (SNN) is a mean Tanimoto distance between a molecule in the reference set and its closest neighbor from the generated set. One of the possible interpretations of this metric is precision; if the value is low, it means that the algorithm generates molecules that are distant from the molecules in the reference set. The range of this metric is [0, 1]. Fréchet ChemNet Distance (FCD) is a metric that correlates with internal diversity and uniqueness. It is computed using the penultimate layer of the ChemNet, a deep neural network that is trained for the prediction of biological activities. All four aforementioned metrics have also been compared to the scaffold test dataset.
VDR is a member of the nuclear receptor superfamily, a zinc-finger transcriptional factor and a canonical receptor for its most active secosteroid 1α,25-dihydroxyvitamin D3 (1,25D3) (Fig. 3A). It has also been established as sensor for steroid acids such as lithocholic acid (Fig. 3B) predominantly acting as an activator of the detoxification pathway for them in the human colon through regulation of cytochrome P450 monooxygenases e.g. CYP3A4. The classical VDR physiology involves the activation and regulation of divers processes such as mineral homeostasis, cellular proliferation and differentiation and the modulation of native and adaptive immunity.68 Hence, their dysfunction may be connected to serious maladies making VDR suitable for effective drug-target development.69 To date more than 3000 vitamin D (VD) analogues have been synthesized largely due to the side effects of 1,25D3 such as hypercalcemia with some of them belonging to a completely new group of analogues called non-steroidal VD analogues of which an example is shown in Fig. 3C.70
The dataset for recreating VDR ligands was extracted from the ZINC database.71 The ZINC database is a reliable curated database with all compounds available for purchase and testing. From this database 418 VDR ligands have been selected. After canonicalizing and stripping the stereoisomeric information from these molecules, 210 SMILES strings remained. We further divided these 210 molecules into approximately equal size training and test sets. The training data was used for model fine-tuning and subsequently as seed molecules to create focused libraries of potential VDR ligands. The test set was used as a reference for the comparison of Transmol-generated molecules. While the sample size of this dataset is comparatively small, it is composed of only several subsets of VDR ligands with each subset containing molecules that are structurally similar to each other. Therefore, we theorize that Transmol may be able to extract enough information from the training data to recreate previously known VDR ligands and thus mimic the process of creating new ones. To reduce the impact of randomness we perform this procedure ten times.
A detailed workflow for examining the recreation of known VDR ligands with Transmol can be summarized in the following way:
(1) Acquire 418 known VDR ligands from the ZINC database.
(2) Bring them into canonical form by discarding the stereoisomeric information reduced the set to 210 molecules.
(3) Randomly sample 100 structures and use them as a training set. Remaining 110 constitute the test set.
(4) Use the training set to fine tune Transmol. Fine tuning is a common technique in deep learning. The idea is that instead of training a new model from scratch we can use a set of model parameters trained on a large dataset, and then perform additional training on a smaller dataset.
(5) Use a subset of training molecules as seeds for Transmol (30 randomly selected ligands from the training set).
(6) Compare Transmol output to test set molecules and record overlap.
(7) Repeat steps 3 to 6 ten times.
Model | Valid (↑) | Unique@1k (↑) | Unique@10k (↑) | IntDiv (↑) | IntDiv2 (↑) | Filters (↑) | Novelty (↑) |
---|---|---|---|---|---|---|---|
Train | 1 | 1 | 1 | 0.8567 | 0.8508 | 1 | 1 |
HMM | 0.076 ± 0.0322 | 0.623 ± 0.1224 | 0.5671 ± 0.1424 | 0.8466 ± 0.0403 | 0.8104 ± 0.0507 | 0.9024 ± 0.0489 | 0.9994 ± 0.001 |
NGram | 0.2376 ± 0.0025 | 0.974 ± 0.0108 | 0.9217 ± 0.0019 | 0.8738 ± 0.0002 | 0.8644 ± 0.0002 | 0.9582 ± 0.001 | 0.9694 ± 0.001 |
Combinatorial | 1.0 ± 0.0 | 0.9983 ± 0.0015 | 0.9909 ± 0.0009 | 0.8732 ± 0.0002 | 0.8666 ± 0.0002 | 0.9557 ± 0.0018 | 0.9878 ± 0.0008 |
CharRNN | 0.9748 ± 0.0264 | 1.0 ± 0.0 | 0.9994 ± 0.0003 | 0.8562 ± 0.0005 | 0.8503 ± 0.0005 | 0.9943 ± 0.0034 | 0.8419 ± 0.0509 |
AAE | 0.9368 ± 0.0341 | 1.0 ± 0.0 | 0.9973 ± 0.002 | 0.8557 ± 0.0031 | 0.8499 ± 0.003 | 0.996 ± 0.0006 | 0.7931 ± 0.0285 |
VAE | 0.9767 ± 0.0012 | 1.0 ± 0.0 | 0.9984 ± 0.0005 | 0.8558 ± 0.0004 | 0.8498 ± 0.0004 | 0.997 ± 0.0002 | 0.6949 ± 0.0069 |
JTN-VAE | 1.0 ± 0.0 | 1.0 ± 0.0 | 0.9996 ± 0.0003 | 0.8551 ± 0.0034 | 0.8493 ± 0.0035 | 0.976 ± 0.0016 | 0.9143 ± 0.0058 |
LatentGAN | 0.8966 ± 0.0029 | 1.0 ± 0.0 | 0.9968 ± 0.0002 | 0.8565 ± 0.0007 | 0.8505 ± 0.0006 | 0.9735 ± 0.0006 | 0.9498 ± 0.0006 |
Transmol | 0.0694 ± 0.0004 | 0.9360 ± 0.0036 | 0.9043 ± 0.0036 | 0.8819 ± 0.0003 | 0.8708 ± 0.0002 | 0.8437 ± 0.0015 | 0.9815 ± 0.0004 |
As can be seen from Table 1, Transmol has a low validity score. One of the possible reasons is the architecture of the network. In architectures like autoencoders, or GANs the model is evaluated on how well it replicates the input during the training. However, in this case the model was learning the molecular space, by maximizing the likelihood of the next symbol. The aim of this work was to explore the molecular space through SMILES representations, and our model is successful in this task. It is natural that some areas of this space are uninhabited, i.e. contain only invalid SMILES. One of the possible solutions to amend the low validity would be the integration of autoencoders, or GAN parts into the model, since for these methods the accurate replication of the input is explicit. Another potential solution would be to optimize the model towards high validity.
Table 2 shows more sophisticated metrics that require the comparison of two molecular sets, reference and generated: Fréchet ChemNet Distance (FCD), similarity to the nearest neighbor (SNN), fragment similarity (Frag), and scaffold similarity (Scaff). In this table, we observe that Transmol has a relatively high FCD score compared to neural methods and a comparable or lower score in relation to non-neural algorithms. This is a surprising result that we have not anticipated considering the high scores of Transmol for internal diversity and relatively high scores for uniqueness. One of the probable reasons is that in our work we have used 8000 molecules as seeds out of ≈170000 molecules in the MOSES test set. As a result, we created 8000 focused libraries. The individual outputs exhibit high internal diversity, since each focused set results in unique molecules. In addition, these focused sets are diverse with respect to each other. The high FCD score indicates dissimilarity between our generated set and the MOSES test set. The reason for this is given by the fact that we are generating very focused libraries and therefore cannot (and do not need to) capture the whole variability of the MOSES test set. If the number of seed molecules would be increased substantially, the FCD value would decrease.
Model | FCD (↓) | SNN (↑) | Frag (↑) | Scaf (↑) | ||||
---|---|---|---|---|---|---|---|---|
Test | TestSF | Test | TestSF | Test | TestSF | Test | TestSF | |
Train | 0.008 | 0.4755 | 0.6419 | 0.5859 | 1 | 0.9986 | 0.9907 | 0 |
HMM | 24.4661 ± 2.5251 | 25.4312 ± 2.5599 | 0.3876 ± 0.0107 | 0.3795 ± 0.0107 | 0.5754 ± 0.1224 | 0.5681 ± 0.1218 | 0.2065 ± 0.0481 | 0.049 ± 0.018 |
NGram | 5.5069 ± 0.1027 | 6.2306 ± 0.0966 | 0.5209 ± 0.001 | 0.4997 ± 0.0005 | 0.9846 ± 0.0012 | 0.9815 ± 0.0012 | 0.5302 ± 0.0163 | 0.0977 ± 0.0142 |
Combinatorial | 4.2375 ± 0.037 | 4.5113 ± 0.0274 | 0.4514 ± 0.0003 | 0.4388 ± 0.0002 | 0.9912 ± 0.0004 | 0.9904 ± 0.0003 | 0.4445 ± 0.0056 | 0.0865 ± 0.0027 |
CharRNN | 0.0732 ± 0.0247 | 0.5204 ± 0.0379 | 0.6015 ± 0.0206 | 0.5649 ± 0.0142 | 0.9998 ± 0.0002 | 0.9983 ± 0.0003 | 0.9242 ± 0.0058 | 0.1101 ± 0.0081 |
AAE | 0.5555 ± 0.2033 | 1.0572 ± 0.2375 | 0.6081 ± 0.0043 | 0.5677 ± 0.0045 | 0.991 ± 0.0051 | 0.9905 ± 0.0039 | 0.9022 ± 0.0375 | 0.0789 ± 0.009 |
VAE | 0.099 ± 0.0125 | 0.567 ± 0.0338 | 0.6257 ± 0.0005 | 0.5783 ± 0.0008 | 0.9994 ± 0.0001 | 0.9984 ± 0.0003 | 0.9386 ± 0.0021 | 0.0588 ± 0.0095 |
JTN-VAE | 0.3954 ± 0.0234 | 0.9382 ± 0.0531 | 0.5477 ± 0.0076 | 0.5194 ± 0.007 | 0.9965 ± 0.0003 | 0.9947 ± 0.0002 | 0.8964 ± 0.0039 | 0.1009 ± 0.0105 |
LatentGAN | 0.2968 ± 0.0087 | 0.8281 ± 0.0117 | 0.5371 ± 0.0004 | 0.5132 ± 0.0002 | 0.9986 ± 0.0004 | 0.9972 ± 0.0007 | 0.8867 ± 0.0009 | 0.1072 ± 0.0098 |
Transmol | 4.3729 ± 0.0466 | 5.3308 ± 0.0428 | 0.6160 ± 0.0005 | 0.4614 ± 0.0007 | 0.9564 ± 0.0009 | 0.9496 ± 0.0009 | 0.7394 ± 0.0009 | 0.0183 ± 0.0065 |
Another observation is that Transmol demonstrates superiority for SNN/Test and Scaf/Test compared to non-neural baselines and is comparable to other neural algorithms. Test stands for random test set and TestSF for scaffold split test set. For SNN/Test Transmol is a top-2 algorithm. Another thing to note is the TestSF column. The original authors of the benchmark recommend a comparison of the TestSF columns when the goal is to generate molecules with scaffolds that are not present in the training set. However, the comparison should be used with caution since the test scaffold set is not all-encompassing. It does not contain any scaffolds that are absent in the training dataset. Taking into consideration that metrics in Table 2 compute overlaps in the two sets, the TestSF part of the metrics should not be over-interpreted.
Fig. 5 demonstrates distribution plots of the baselines and Transmol output compared to the test set. The distribution plots are similar to the histograms, but instead of showing discrete bins, the distribution plot smoothes the observations using a Gaussian kernel. The distribution plots compare four molecular properties: molecular weight (MW), octanol–water partition coefficient (logP), quantitative estimation of drug-likeness (QED), and synthetic accessibility score (SA). To quantify the distance between the test set and the generated set the Wasserstein-1 distance was used (value in brackets). The results show that Transmol has a matching SA score, or better than the original distribution while having a higher variance in other metrics. It shows that Transmol is not as close to the testing set distribution as other neural algorithms, implying a higher diversity, but it is not as far from it as some simpler, combinatorial baselines.
As can be seen in Fig. 6 Transmol has the highest proportion of molecules that satisfy the rule of 3 among the neural baselines. In addition, among non-neural algorithms, Transmol has the highest proportion of molecules that satisfy the Ghose filter and REOS.
Across the 10 sampling cycles our algorithm was able to recover 27 known VDR ligands from the employed ZINC dataset. Given that the ZINC database allowed us to extract the structures of 210 VDR ligands with stripped stereoisomeric information, we have recovered 12.9% of all structures within this dataset.
The recovery process was performed 10 times where VDR ligands were re-sampled into the training and test set randomly. The minimum and maximum number of recovered molecules per cycle were 1 and 9, respectively. On average the algorithm generated 4.1 molecules per sampling cycle.
Of the 27 recovered molecules 56% were secosteroid backbone (Fig. 3A) and the rest non-steroidal VD analogues (Fig. 3C). The algorithm did not recover any steroid backbone based analogs (Fig. 3B). It is remarkable that the algorithm was able to recover the natural ligand 1,25D3 (Fig. 3A), as well as the first non-steroidal 3,3-diphenylpentane analog YR301 (Fig. 3C). It should be noted that the group of non-steroidal ligand contains two subgroups, the first are cannonical ligands that bind to the ligand-binding pocket of VDR as secosteroid and steroid acids. The second group contains possible coactivator mimetics that irreversible compete with coactivator interaction on the surface of VDR. Thus, the above described approach constitutes a newly established benchmark for validation of generative algorithms, which examines how well the model can recreate the known ligands to a particular biological target.
While the basic case of getting vector z12 through averaging the individual vectors, a more general approach would be computation of weighted sum. In the default case the weights of the vectors z1 and z2 are chosen to be 0.5. However, other valid weight combinations can also be used to navigate the latent space. We have noticed that when the weights of both vectors are close to 0.5 the algorithm tends to return only very simple molecules with low molecular weight. The above is especially true when both seed molecules have a molecular weight of more than 250 and contain more than 15 atoms (except hydrogen) meaning that their SMILES strings are relatively long. Therefore, we have varied the α parameter of the reward function of the beam search to adjust the length of the decoded SMILES strings and as a result the complexity of the resulting molecule. On the other hand when the α parameter is set too big the algorithm may return only a limited number of molecules. Thus, to create enough molecules of larger size and increased complexity, different combinations of weight distribution and the sampling reward parameter α can be tested. Overall, the possibility to vary the seeds' weight distributions along with the sampling reward parameter α provides a viable mean of traversal across molecular space (see SI for specific comparisons). Of course, not all the generated structures are stable and/or synthesizable. Nonetheless, they may still be used as inspirational ligands for molecular docking.
Fig. 8 illustrates how molecular sampling from the latent distribution using two seed molecules can help with expanding the scaffold diversity. The resulting molecular library demonstrates a diversity of structural features that would be unattainable through simple fragment substitution or through the application of the alternative one-seed approach.
Both one-seed and two-seeds modes of Transmol are incorporated into the cheml.io55 website as one of the methods that allows the generation of molecules on demand. For the two-seed approach, the user can specify both seeds, a weighting distribution between them and the sampling reward parameter α that influences the length of the decoded SMILES string.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/d1ra03086h |
This journal is © The Royal Society of Chemistry 2021 |