Open Access Article
AkshatKumar
Nigam
ab,
Robert
Pollice
ab,
Mario
Krenn
abc,
Gabriel dos Passos
Gomes
ab and
Alán
Aspuru-Guzik
*abcd
aDepartment of Computer Science, University of Toronto, Canada. E-mail: alan@aspuru.com
bDepartment of Chemistry, University of Toronto, Canada
cVector Institute for Artificial Intelligence, Toronto, Canada
dLebovic Fellow, Canadian Institute for Advanced Research (CIFAR), 661 University Ave, Toronto, Ontario M5G, Canada
First published on 20th April 2021
Inverse design allows the generation of molecules with desirable physical quantities using property optimization. Deep generative models have recently been applied to tackle inverse design, as they possess the ability to optimize molecular properties directly through structure modification using gradients. While the ability to carry out direct property optimizations is promising, the use of generative deep learning models to solve practical problems requires large amounts of data and is very time-consuming. In this work, we propose STONED – a simple and efficient algorithm to perform interpolation and exploration in the chemical space, comparable to deep generative models. STONED bypasses the need for large amounts of data and training times by using string modifications in the SELFIES molecular representation. First, we achieve non-trivial performance on typical benchmarks for generative models without any training. Additionally, we demonstrate applications in high-throughput virtual screening for the design of drugs, photovoltaics, and the construction of chemical paths, allowing for both property and structure-based interpolation in the chemical space. Overall, we anticipate our results to be a stepping stone for developing more sophisticated inverse design models and benchmarking tools, ultimately helping generative models achieve wider adoption.
In this work, using SELFIES as a robust molecular representation, we propose an efficient set of algorithms (STONED) to perform exploration and interpolation in the chemical space (Section II A). These tasks are commonly addressable by expensive deep generative models or stochastic optimization approaches like evolutionary methods.24 Our algorithm avoids the need for extensive training times, large datasets, and hand-crafted rules for obtaining novel molecules, and allows to interpolate deterministically between molecules. We achieve this via string manipulations of SELFIES and demonstrate the ability to form local chemical subspaces (Section II B), allowing for local optimization, and obtain chemical paths (Section II C), enabling interpolation between structures. Additionally, we demonstrate applications in designing molecules for material science (Section II D) and drug development (Section II C 2). On established benchmarks, our algorithm achieves non-trivial results despite not using any sophisticated optimization engines and is comparable in its capabilities to the state of the art in generative modeling. The ease of obtaining molecules for local optimization and interpolation via chemical paths allows for our methods to be used in high-throughput virtual screening for materials science,25 catalysis,26 and drug design.27 Ultimately, we anticipate that our results will stimulate more powerful models, more meaningful benchmarks, and more widespread use of generative models in general.
We started this work by performing point mutations of the molecules aripiprazole, albuterol, and mestranol32 in the SELFIES representation to generate local chemical subspaces. A point mutation in the SELFIES representation corresponds to a single character addition, deletion or replacement. As delineated in Table 1, STONED is able to generate vast local chemical subspaces requiring only one data point as a seed. Additionally, in comparison to the state of the art in deep generative modeling for molecular design, our algorithm is an order of magnitude faster. Notably, for each of these experiments, the respective fingerprints suggested in the analogous GuacaMol benchmarks were used. Fig. 2 illustrates the ability of our algorithm to generate diverse structures in the neighborhood of the known drug celecoxib.34 As expected, we observe that the fraction of unique molecules obtained decreases with more stringent structure-based fingerprint similarity requirements. Importantly, this is a general feature of the SELFIES representation. As depicted in Fig. S2 (left),† mutating molecules randomly in the SELFIES representation rarely preserves high molecular similarity. Additionally, molecular similarity to the initial structure, on average, decreases with the number of mutations performed which is intuitive.
000 random string mutations of the starting structures. Additionally, for celecoxib, we also formed the local chemical space with a scaffold constraint
| Starting structure (method) | Fingerprint | Number of molecules (and percentage) | ||
|---|---|---|---|---|
| δ > 0.75 | δ > 0.60 | δ > 0.40 | ||
| Aripirazole (SELFIES, random) | ECFP4 | 513 (0.25%) | 4206 (2.15%) | 34 416 (17.66%) |
| Albuterol (SELFIES, random) | FCFP4 | 587 (0.32%) | 4156 (2.33%) | 16 977 (9.35%) |
| Mestranol (SELFIES, random) | AP | 478 (0.22%) | 4079 (1.90%) | 45 594 (21.66%) |
| Celecoxib (SELFIES, random) | ECFP4 | 198 (0.10%) | 1925 (1.00%) | 18 045 (9.44%) |
| Celecoxib (SELFIES, terminal 10%) | ECFP4 | 864 (2.02%) | 9407 (21.99%) | 34 187 (79.91%) |
| Celecoxib (SELFIES, central 10%) | ECFP4 | 111 (0.08%) | 1767 (1.32%) | 15 348 (11.45%) |
| Celecoxib (SELFIES, initial 10%) | ECFP4 | 368 (0.53%) | 7345 (10.53%) | 34 702 (49.74%) |
| Celecoxib (SMILES, random) | ECFP4 | 122 (18.43%) | 515 (77.49%) | 662 (100.00%) |
| Celecoxib (SMILES, terminal 10%) | ECFP4 | 90 (20.79%) | 368 (84.99%) | 433 (100.00%) |
| Celecoxib (SMILES, central 10%) | ECFP4 | 114 (22.18%) | 419 (81.52%) | 514 (100.00%) |
| Celecoxib (SMILES, initial 10%) | ECFP4 | 122 (19.71%) | 490 (79.16%) | 619 (100.00%) |
| Celecoxib (DeepSMILES, random) | ECFP4 | 132 (4.43%) | 953 (31.99%) | 2793 (93.76%) |
| Celecoxib (DeepSMILES, terminal 10%) | ECFP4 | 106 (9.73%) | 513 (47.11%) | 1083 (99.45%) |
| Celecoxib (DeepSMILES, central 10%) | ECFP4 | 53 (6.54%) | 162 (19.98%) | 658 (81.13%) |
| Celecoxib (DeepSMILES, initial 10%) | ECFP4 | 105 (9.28%) | 609 (53.80%) | 1106 (97.70%) |
| Celecoxib (SELFIES, scaffold constraint) | ECFP4 | 354 (0.44%) | 6311 (7.79%) | 53 479 (66.07%) |
| Celecoxib (CReM, ChEMBL: SCScore ≤ 2.5) | ECFP4 | 239 (0.58%) | 5547 (13.47%) | 14 887 (36.14%) |
While the success rate of mutations leading to structurally similar molecules is relatively low (Table 1), our approach is extremely efficient, with the entire experiment running in just a few minutes on an ordinary laptop at the time of writing (Intel i7-8750H CPU, 2.20 GHz). In particular, the most time-consuming benchmark in Table 1 was the formation of the subspace of aripiprazole, completing in 500 seconds. The most expensive step in this experiment involved performing multiple SELFIES mutations and subsequently converting all mutated strings into SMILES, taking 400 seconds. Importantly, this step can be made more efficient by conducting mutations on different strings using parallel workers. Hence, this algorithm possesses extensive parallelizability. For comparison, using the same setup, we also formed the local chemical subspace of celecoxib using either SMILES or DeepSMILES. For SMILES, merely 0.30% of the mutated structures corresponded to valid molecules. With DeepSMILES, merely 1.44% of the mutated structures were valid. In addition, we observed that random mutations within SMILES and DeepSMILES led to lower structural diversity compared to SELFIES (see Table 1). Additionally, we also formed the chemical subspace of celecoxib while preserving a pre-selected scaffold (see celecoxib (SELFIES, scaffold constraint) in Table 1). Discarding all mutated strings that do not contain the scaffold, i.e., keeping only 2.8% of all mutated strings, STONED readily proposed a large number of structures in the neighborhood of celecoxib. Overall, the speed and scalability of our methods suggest that it can be readily applied to extend datasets used in machine learning for creating more robust generalizable models.
Importantly, we also found that a general strategy for preserving molecular similarity during random SELFIES mutations of the starting structure is to restrict the location of the SELFIES changes. Restricting the mutations to either the initial or the terminal characters yields mutated structures that are more similar to the initial structure than when the mutation position is either chosen randomly or restricted to the middle characters (see Table 1 and Fig. S3;† initial, central or terminal 10%). It should be emphasized that this is not just a curious finding but can be used systematically to choose between exploration and exploitation for the structural space generated using STONED. In addition, it can be employed in conjunction with scaffold constraints as restricting the mutations to the terminal 10% of the SELFIES also increases the probability to retain scaffolds. We repeated the scaffold constraint experiment from above but restricted the mutations to the terminal 10% and found that 36.3% of all mutated strings retained the scaffold, which corresponds to a more than 10-fold increase in the scaffold retention rate. Notably, trying to use the same type of character mutation restriction for SMILES or DeepSMILES does not provide the same kind of tunability between exploitation and exploration of the generated structures. As additional comparison to alternative methods, we also generated the local chemical subspace of celecoxib using the recently developed expert system CReM, a fragment-based approach.35 Taking fragments and mutation rules from a subset of ChEMBL36,37 with an SCScore ≤ 2.5, CReM generates significantly more structures in the structural neighborhood than fully random SELFIES mutations but less than when SELFIES mutations are restricted to the terminal characters. This shows that STONED is comparable in performance to expert systems like CReM.
Notably, in the experiments described above, we performed mutations solely on the starting structure. A natural extension is to repeat the procedure on all distinct neighbours, i.e., molecules produced by point mutations from the initial structure, thereby extending the subspace search significantly. To demonstrate the power of this approach, starting from the randomly mutated structures of celecoxib, we repeated the random mutations on all unique molecules obtained in the first step. Consequently, we generated more than 17 million unique molecules, 120 thousand of which have a similarity greater than 0.4 with respect to celecoxib (see Fig. S4†) showing that the structural coverage of the local subspace can be increased immensely by including structural next nearest neighbors of the initial seed molecule.
Furthermore, we wanted to demonstrate the full potential of the chemical subspace exploration by replacing the ECFP4 fingerprint similarity with 3D fingerprints to form geometry-based chemical subspaces. To do that, we generated conformers of celecoxib and 2350 of its mutants with RDKit using the implemented conformer ensemble routine. The lowest-energy conformer was selected and the 3D similarity between the structures of celecoxib and its mutants was estimated using the E3FP similarity metric.38 Consequently, we found 206 structures with an E3FP similarity larger than 0.2, 31 of which were even larger than 0.3. Selected structures are depicted in Fig. 3 with an overlay of the corresponding conformers with the structure of celecoxib. This shows that generating the 3D similarity space with STONED and E3FP similarity is straightforward allowing it to be applied to structure-based inverse design. Notably, we hypothesize that the 2D or 3D structure-based fingerprints can also be replaced with efficient property-based molecular descriptors39–41 for systematic property space exploration in an analogous way.
Importantly, the similarity of proposed median structures to the references can be gauged via structure-based fingerprint similarity measures. In GuacaMol, a median molecule (i.e., m) of two known structures (i.e., m′, m′′) is assessed based on the geometric mean of the respective fingerprint similarities to the two reference structures. The higher the geometric mean, the better the median molecule. However, we observe that maximizing the geometric mean of fingerprint similarities does not capture joint molecular similarity satisfactorily. The metric can return large values despite possessing high similarity only to one structure (see Section S2†). Therefore, we propose to redefine joint similarity F(m) for an arbitrary number of reference molecules M = {m1, m2, …}; n = |M|, which is discussed in detail in the ESI (Section S2),† as follows, to penalize higher similarities to only a subset of the reference molecules more severely:
![]() | (1) |
In the subsequent sections, we investigate the behaviour of this joint molecular similarity along chemical paths between molecules which inadvertently leads to the generation of median molecules.
Importantly, while a monotonically increasing fingerprint similarity score is not observed along paths generated deterministically between two SELFIES, one can extract chemical paths by requiring fingerprint similarities to increase and removing all structures that lead to similarity drops. Compared to generating chemical paths using SELFIES stochastically, by making use of evolutionary algorithms, our deterministic approach leads to a speedup of more than one order of magnitude. To avoid holes in the beginning of the chemical paths, we imposed the requirement for increasing fingerprint similarities only after the first point mutation of the starting structure. Because of the speed and parallelizability of this chemical path generation method, motivated by the idea that similarity in structure can correspond to similarity in properties, we looked into properties of molecules along chemical paths. As an initial test, we considered the water–octanol partition coefficient (log
P)46 and the quantitative estimate of drug-likeness (QED)47 in paths between the known drugs tadalafil and sildenafil (Fig. 4(a)), as estimated using RDKit.48 One of these chemical paths is shown in Fig. 5 (top), and the similarities to the starting and target structures along the path as well as the comparison of the corresponding geometric mean joint similarities and our newly defined joint similarities are illustrated in Fig. S6 (top).† These results demonstrate that the redefined joint similarity is more reliable for indicating molecules that are similar to several reference structures simultaneously.
Moreover, we analyzed the binding affinity estimated via docking49 in chemical paths between dihydroergotamine and prinomastat as a more challenging type of property to optimize (Fig. 4(b)). Dihydroergotamine and prinomastat have been discussed in the literature as potential inhibitors for the protein structures of serotonin (5-HT1B)50 and P450 2D6 (CYP2D6).51 The 5-HT1B receptor is a target for antimigraine drugs such as ergotamine and dihydroergotamine.50 P450 2D6, on the other hand, contributes to the metabolism and elimination of more than 15% of the drugs used in clinical practice. Among individuals, considerable variations exist in the efficacy and amount of CYP2D6 enzyme production. As a result, a clinical drug dose may need to be altered to account for the metabolization speed of CYP2D6.52 Prinomastat, as an inhibitor, decreases enzyme production, thereby allowing increased efficacy of certain drugs. Our goal in this experiment is to find molecules encountered along the paths between dihydroergotamine and prinomastat that can simultaneously bind (i.e., possess negative binding affinities large in magnitude) to both proteins (see Fig. 4(b)). One selected chemical path is depicted in Fig. 5 (bottom). Moreover, we also compared both the similarities to the reference molecules along the path and our redefined joint similarity metric to the corresponding geometric mean joint similarities (see Fig. S6 (bottom)†). These diagrams show again that the joint similarity introduced in this work avoids molecules that are similar to only one of the reference structures. For the selected path, the docked molecules are depicted in Fig. S7.† Notably, there are significant structural jumps along these chemical paths, i.e., successive molecules show relatively large structure changes. They largely stem from the condition that every molecule along a chemical path needs to increase in similarity to the target. Accordingly, molecules along the full path that led to a decrease in similarity to the target after the first mutation were removed causing these large changes in the remaining molecules. Notably, some of the molecules obtained have unstable functional groups or would tautomerize in solution to a different structure. To improve both their stability and synthetic feasibility, rules based on domain knowledge can be implemented to filter out structures that are infeasible.
Importantly, this experiment demonstrates the ability to perform efficient structural interpolation between molecules without the need to form continuous representations within deep generative models. Our simple algorithm for obtaining chemical paths possesses considerable potential for parallelization and does not need a large number of data points as input. Particularly, Cieplinski et al.53 noted that with realistic training set sizes (i.e., consisting of a few thousand points), deep generative models have difficulty optimizing docking scores. In contrast, as illustrated in Fig. 4(b), our approach to forming chemical paths between two known ligands yields several structures with non-trivial binding affinities to both proteins without any optimization routine.
We compared the ability of the obtained median molecules to resemble the triplet references in structure (Fig. 6 (left)) and property (Fig. 6 (right)). A selection of the generated median molecules is shown in Fig. S9.† Notably, higher joint similarities indicate that the median molecules resemble the triplets more closely in structure. However, low values of the normalized property distance indicate that the median molecules have properties closer to the respective reference structures. For each triplet identified from the HCE database, we used the 100 median molecules with the highest joint similarities to the reference structures from chemical paths between the three reference structures (Unfiltered Medians). We observed that many of these median molecules possessed bridgehead atoms with double bonds, a very unstable structural feature.56 To remedy this problem, we added a simple filter discarding molecules with bridgehead atoms in general (Filtered Medians).
In Fig. 6, Random HCE refers to sampling 100 random structures from the HCE database for each triplet, while Best HCE refers to the 100 molecules with the highest joint similarities to the reference structures available within the database. Importantly, we found that the median molecules are significantly closer in both structure and target properties to the respective triplets compared to Random HCE. In addition, they are also closer to the respective triplets in the investigated properties compared to Best HCE showing that generating median molecules can be an effective strategy for performing multi-objective property optimization (see Fig. S8 and Table S3† for detailed statistics). Importantly, this task is a complicated multi-objective optimization in a chemical subspace tailored for a very specific application. Our method is able to produce molecules that are similar in structure to three molecules simultaneously. In that regard, our method produces structures similar in both structural similarity and property compared to a database of molecules obtained using a building block approach based on expert knowledge. Hence, our results are very promising for fully automated exploration of chemical subspaces based on a few reference structures without defining building blocks and rules to construct molecules.
Expert rules-based systems can yield median molecules as well,35,42,57,58 but their use can be application-dependent. For example, a potential algorithm could disassemble the reference structures into fragments by breaking rotatable bonds and then recombine the fragments in a building block approach, akin to the design of CReM.35 However, this technique would not be generalizable to molecules without rotatable bonds, such as fused polyaromatics, and more sophisticated algorithms would be required. Our method differs in that it requires no expert knowledge and relies solely on the graph representation of molecules within SELFIES and necessarily leads to a median molecule. Deep generative models can be used to avoid such problems, with expert knowledge being derived solely from a known dataset. However, they require many training examples, potentially even labeled training data depending on the specific task at hand. In contrast, our approach is both rules-free and training-free.
Finally, we compared the capabilities of STONED with established algorithms for generating molecules (Table 2). Similar to VAEs, GANs and RL approaches, STONED relies on random changes of molecules within a given representation superseding hard-coded expert rules.14,16 In contrast, expert systems (ES) typically incorporate fragment combination rules and heuristic synthesizability and stability checks.16,24,31 Moreover, as SELFIES covers the entire molecular space representable by molecular graphs, STONED allows the systematic exploration and generation of all these compounds. Importantly, neither of the alternative methods considered offer a comparable structure coverage. The hardcoded rules of ES tend to limit exploration, and within VAEs, GANs and RL the generated molecules have not been found to stray too far from the training set. Another important property we considered is interpolatability, i.e. the possibility to interpolate between two molecules deterministically. Interpolation in STONED is constrained by the number of distinct characters in the SELFIE string. VAEs and GANs can use geometric interpolation in the latent space. ES such as Molpher16 and the chemical space travel algorithm24 perform exploration and interpolation stochastically similar to a GA. RL techniques typically do not form a continuous representation, which limits their possibility for deterministic interpolation. Furthermore, VAEs, GANs and RL techniques are capable of property-based navigation, i.e., selecting structural modifications that are likely to improve properties. This is often achieved via property estimators such as neural networks. In VAEs, prediction networks are often employed for arranging latent representations based on properties allowing gradient-based navigation in the property space. Both STONED and ES can be used in GAs for property-based navigation, but only in a stochastic way. Additionally, VAEs, GANs and RL models require training which can be prohibitive due to the potential need for multiple GPUs. STONED and ES, in comparison, do not require any training. Finally, VAEs, GANs and RL require considerably large training datasets. Contrarily, STONED and ES require very few, if any, reference points. To summarize, STONED combines the merits of both classical ES and more sophisticated ML methods for molecule generation closing a gap in the available methods to navigate the chemical space.
| Feature | ES | VAE | GAN | RL | STONED |
|---|---|---|---|---|---|
| Expert rule-free | ✗ | ✓ | ✓ | ✓ | ✓ |
| Structure coverage | ∼ | ∼ | ∼ | ∼ | ✓ |
| Interpolatability | ✗ | ✓ | ✓ | ✗ | ✓ |
| Property-based navigation | ∼ | ✓ | ✓ | ✓ | ∼ |
| Training-free | ✓ | ✗ | ✗ | ✗ | ✓ |
| Data independence | ✓ | ✗ | ✗ | ✗ | ✓ |
The speed, parallelizability, and performance of STONED suggests that it can be used for practical tasks such as high-throughput virtual screening.59 In optimization algorithms such as genetic algorithms, we believe that median molecule generation through our approach can be used as a general crossover rule. The current evaluation standard for deep generative modeling includes producing valid molecules that resemble specific datasets.32,33 With the guarantee of molecular validity in SELFIES by design, perfect results in the validity benchmark can be trivially achieved. Furthermore, we demonstrate the simplicity of generating multiple structures that resemble a known set of molecules. Among other benchmarks, properties such as penalized log
P and QED do not represent the complexity of molecular design, making them an insignificant benchmarking objective. Accordingly, we also demonstrated application to more complicated multi-objective property optimizations including protein docking, dipole moments, LUMO energies and HOMO–LUMO gaps as target properties. By introducing STONED, a fast suite of algorithms that can compete reasonably with deep generative models on several recently introduced benchmarks, we believe that we provide a stepping stone to improve generative modeling for molecular design and its benchmarking.60
000 SMILES orderings representing the same structure, convert all of them to the SELFIES representation, and perform 1–5 point mutations. Hence, a total of 250
000 strings are generated per experiment. A single mutation consists of a SELFIES character replacement, deletion, and addition at random positions of the string. This process is repeated to perform multiple mutated structures. All the mutants are subsequently converted back to SMILES for calculating their similarity to the original molecule based on various fingerprint similarity measures. Within Table 1, this process is repeated for: aripirazole, albuterol, mestranol, celecoxib (SELFIES, random), celecoxib (SELFIES, terminal 10%), celecoxib (SELFIES, central 10%), celecoxib (SELFIES, initial 10%) and celecoxib (SELFIES, scaffold constraint). For celecoxib (SMILES, random) and celecoxib (DeepSMILES, random), up to 5 random mutations are performed within the corresponding representations on 50
000 randomly ordered strings. For celecoxib (CReM, ChEMBL: SCScore ≤ 2.5), we performed the MUTATE and GROW operations, using a database of fragments provided in the CreM GitHub repository (vert replacements02_sc2.5.dbvert).61 The mutate and grow operations were applied to celecoxib both with and without explicit hydrogen atoms, with the parameters vert max_sizevert and vert max_atomsvert set to 100.
In Section II C 2, within a path, we randomly sampled molecules that necessarily increase fingerprint similarity allowing for the formation of a chemical path. log
P and QED values of molecules in a path were estimated using RDKit.48 The docking scores were estimated with the SMINA open-source software62 using the setup proposed previously in the literature.53 Namely, the crystal structures for docking to 5-HT1B and CYP3D6 were obtained from the Protein Data Bank (PDB) (entry codes 4IAQ and 3QM4), the binding sites were selected manually, and the scores of the top 5 best-scoring binding poses were averaged to maximize consistency of the results. In both experiments, we considered different SMILES orderings of the starting and target molecules, respectively, and, between each pair, repeated the experiment several times, leading to different results, such that approximately 800 unique molecules from the paths were obtained. For path and chemical path formation between two SELFIES, we padded the string to the same length with a dummy character. The dummy character was removed from the SELFIES before converting to SMILES.
000 paths were obtained between randomized orderings of the respective SMILES string. We ran semiempirical calculations to obtain the dipole moments, LUMO energies and HOMO–LUMO gaps for the HCE database and the top-100 unique median structures using GFN2-xTB.55 Random SELFIES for the experiment were generated via random combinations of the 34 SELFIEcharacters part of the semantically robust alphabet. The length of the generated random SELFIES was restricted to the largest number of characters within the SELFIES representations of the three reference molecules.
Footnote |
| † Electronic supplementary information (ESI) available. See DOI: 10.1039/d1sc00231g |
| This journal is © The Royal Society of Chemistry 2021 |