Jan H.
Jensen
Department of Chemistry, University of Copenhagen, Copenhagen, Denmark. E-mail: jhjensen@chem.ku.dk; Web: http://www.twitter.com/janhjensen
First published on 11th February 2019
This paper presents a comparison of a graph-based genetic algorithm (GB-GA) and machine learning (ML) results for the optimization of logP values with a constraint for synthetic accessibility and shows that the GA is as good as or better than the ML approaches for this particular property. The molecules found by the GB-GA bear little resemblance to the molecules used to construct the initial mating pool, indicating that the GB-GA approach can traverse a relatively large distance in chemical space using relatively few (50) generations. The paper also introduces a new non-ML graph-based generative model (GB-GM) that can be parameterized using very small data sets and combined with a Monte Carlo tree search (MCTS) algorithm. The results are comparable to previously published results (Sci. Technol. Adv. Mater., 2017, 18, 972–976) using a recurrent neural network (RNN) generative model, and the GB-GM-based method is several orders of magnitude faster. The MCTS results seem more dependent on the composition of the training set than the GA approach for this particular property. Our results suggest that the performance of new ML-based generative models should be compared to that of more traditional, and often simpler, approaches such a GA.
In this paper I present a comparison of graph-based GA and ML results for the optimization of logP values with a constraint for synthetic accessibility and show that the GA is as good as or better than the ML approaches for this particular property. I also introduce a new non-ML graph-based generative model that can be parameterized using very small data sets and combined with a Monte Carlo tree search algorithm. The implementation of both methods relies on the open source RDKit cheminformatics package and the codes are made freely available.
In order to create realistic looking molecules, such as those in the ZINC data set, the mutations and choice of element are chosen with probabilities obtained by an analysis of a subset of the molecules in the ZINC dataset. An analysis of first 1000 molecules in the ZINC dataset shows that 63% of the atoms are ring atoms, so the ring-creation or ring-insertion mutation is chosen 63% of the time. The most common 3-atom combination in rings is CC–C, which accounts for 45% of all 3-atom combinations in rings (Table 1), so a C ∼ C → CC–C mutation is chosen 45% of the time, and similarly for the 34 other X ∼ Y ∼ Z combinations found in the data set. The site of insertion/creation is chosen randomly and excludes aromatic and six-membered rings. Similarly, for addition the most common bond involving at least one non-ring atom is C–C, so a C → C–C mutation is chosen more often (see Table S1† for more details). A more specific bonding analysis, e.g. addition of C to CC vs. C–C, was considered but then the most probable bonding patterns are often not found in the early stages of molecule growth and the growth process effectively stops.
Bonding | ZINC | GB-GM (62%) | GB-GM (80%) |
---|---|---|---|
CC–C | 45% | 41% | 53% |
C–C–C | 15% | 23% | 21% |
C–C–N | 9% | 9% | 6% |
C–N–C | 6% | 7% | 5% |
CC–N | 4% | 6% | 4% |
NC–C | 3% | 2% | 2% |
CN–C | 2% | 2% | 1% |
C–C–O | 2% | 2% | 2% |
NC–N | 2% | 0% | 0% |
CN–N | 2% | 0% | 0% |
C–O–C | 1% | 1% | 1% |
C–N–N | 1% | 1% | 1% |
CC–S | 1% | 1% | 0% |
C–S–C | 1% | 1% | 1% |
CC–O | 1% | 1% | 1% |
The GB-GM-MCTS code is a modified version of the mcts.py code14 modified for leaf parallelization with a maximum of 25 leaf nodes. The exploration factor is and rollout is terminated if the molecule exceeds the target size as described for the GB-GA. Any three- or four-membered alkene rings are subsequently expanded to five-membered rings by inserting C atoms. The reward function is 1 if the predicted J(m) value (see below) is larger than the largest value found so far and 0 otherwise. This reward function was found to work slightly better than the one used by Yang et al.2
J(m) = logP(m) − SA(m) − RingPenalty(m) | (1) |
J(m) depends both on the number and types of atoms and can be made arbitrarily large by increasing the molecular size. Therefore it is important to limit the molecular size in order to make a fair comparison to previous studies. Yang et al.2 predict SMILES strings with a maximum length of 81 characters, but it is difficult to translate that restriction directly to molecular size since many of the characters do not refer to atoms. Instead, the 20 molecules found by Yang et al.2 are used to determine an average target size of 39.15 ± 3.50 non-H atoms.
The average maximum J(m)-score for the GA is 6.8 ± 0.7 and 7.4 ± 0.9 using a 50% and 1% mutation rate, respectively (Table 2). For comparison the best average maximum value found by Yang et al.2 is 5.6 ± 0.5, which required close to 20000 J(m)-score evaluations. The latter took 8 hours each, while the GB-GA calculations took 30 seconds each on a laptop. These J(m)-scores are also significantly larger than those of the other ML-based methods tried by Yang et al.2 (Table 2): a recurrent neutral network (RNN) with and without Bayesian optimization (BO), the continuous variational autoencoder4 (CVAE), and the grammar variational autoencoder3 (GVAE).
Method | Average J(m) | No. molecules | CPU time |
---|---|---|---|
GB-GA (50%) | 6.8 ± 0.7 | 1000 | 30 seconds |
GB-GA (1%) | 7.4 ± 0.9 | 1000 | 30 seconds |
GB-GM-MCTS (62%) | 2.6 ± 0.6 | 1000 | 90 seconds |
GB-GM-MCTS (80%) | 3.4 ± 0.6 | 1000 | 90 seconds |
GB-GM-MCTS (80%) | 4.3 ± 0.6 | 5000 | 9 minutes |
ChemTS | 4.9 ± 0.5 | ∼5000 | 2 hours |
ChemTS | 5.6 ± 0.5 | ∼20000 | 8 hours |
RNN + BO | 4.5 ± 0.2 | ∼4000 | 8 hours |
Only RNN | 4.8 ± 0.2 | ∼20000 | 8 hours |
CVAE + BO | 0.0 ± 0.9 | ∼100 | 8 hours |
GVAE + BO | 0.2 ± 1.3 | ∼1000 | 8 hours |
Fig. 3a and b show the molecules with the two highest J(m)-scores found by the GB-GA. These scores, 8.8 and 8.5, are slightly larger than the three largest values (7.8–8.0) obtained by You et al.6 using a graph convolutional policy network approach. The two molecules bear little resemblance to the molecules used to construct the initial mating pool. The molecules in the ZINC data set that are most similar to these two molecules have respective Tanimoto similarity scores of just 0.27 and 0.12 and corresponding J(m) values of 0.9 and −2.4 (Fig. S1†). This indicates that the GB-GA approach can traverse a relatively large distance in chemical space using relatively few (50) generations.
Fig. 4 shows an example of how the maximum J(m) value evolves with each generation for 10 different GB-GA runs. While most of the runs have mostly peaked after about 20 generations the three best performing runs (run 1, 6, and 10) show significant improvements between 30 and 40 generations, so running fewer than 50 generations cannot be recommended for J(m) maximisation. None of the runs increased J(m) significantly after periods of 20 generations with no or little change in J(m) (with the possible exception of run 7). So a good strategy may be to terminate the GB-GA run if the J(m) value has not changed for more than 20 generations (at least for J(m) maximisation).
Fig. 4 Plot of the highest J(m) value found as a function of generations for 10 different GB-GA runs with a mutation rate of 1%. |
Not surprisingly, there are bigger differences for the larger scale features not specifically encoded in the rules such as the type of ring (Table 3). For example, there are many fewer benzene-type rings ([*]1[*]–[*][*]–[*][*]1) as well as cyclopentane ([*]1–[*]–[*]–[*]–[*]1) and cyclohexane ([*]1–[*]–[*]–[*]–[*]–[*]1) type rings, while there are more of most of the other types compared to the reference set. The reason is that the molecules in the ZINC data set are not made by randomly adding atoms, but by assembling larger functional groups such as aliphatic and aromatic rings. As a result the average ring composition does not reflect the most likely ring compositions. It is possible to scale the probabilities to skew the results towards one or the other ring-type. For example in the last column the probabilities are scaled such that the probability of X = Z–Y is 80% rather than the 62% in the reference set, which increases the number of benzene-like rings from 479 to 850 at the expense of the aliphatic rings.
Ring-type | ZINC | GM (62%) | GM (80%) |
---|---|---|---|
[*]1–[*]–[*]1 | 57 | 104 | 57 |
[*]1–[*]–[*]–[*]1 | 17 | 33 | 10 |
[*]1–[*]–[*]–[*]–[*]1 | 280 | 15 | 4 |
[*]1[*]–[*]–[*]–[*]1 | 120 | 396 | 408 |
[*]1[*]–[*][*]–[*]1 | 470 | 132 | 221 |
[*]1–[*]–[*]–[*]–[*]–[*]1 | 409 | 64 | 2 |
[*]1[*]–[*]–[*]–[*]–[*]1 | 77 | 363 | 104 |
[*]1[*]–[*][*]–[*]–[*]1 | 100 | 591 | 405 |
[*]1[*]–[*]–[*][*]–[*]1 | 7 | 321 | 202 |
[*]1[*]–[*][*]–[*][*]1 | 1206 | 479 | 850 |
7-Membered ring | 24 | 0 | 0 |
8-Membered ring | 1 | 0 | 0 |
Total | 2768 | 2498 | 2263 |
Fig. 3c and d show the highest scoring molecule found using 1000 and 5000 tree traversals. The molecules are similar to those found by Yang et al.2 in that they consist mostly of benzene rings, and the benzene ring is the most common hydrophobic structural motif in the ZINC data set. The GB-GA results show that higher J(m)-scores can be obtained using long aliphatic chains, but this structural motif is relatively rare in the ZINC data set and therefore rarely suggested by the generative models.
The paper also introduces a new non-ML graph-based generative model (GB-GM) that can be parameterized using very small data sets and combined with a Monte Carlo tree search (MCTS) algorithm such as the one used by Yang et al.2 The results are comparable to the results obtained by Yang et al.2 using a recurrent neural network (RNN) generative model, with maximum J(m)-values of 4.3 ± 0.6 compared to 4.9 ± 0.6 found using 5000 property evaluations. While the results are slightly worse than the RNN results, the GB-GM-based method is several orders of magnitude faster. In both cases the MCTS approach essentially extracts the main hydrophobic structural motif (a benzene ring) found in the training set. The MCTS results thus seem more dependent on the composition of the training set than the GA approach for this particular property.
While the results are quite encouraging, it is important to perform similar comparisons for other properties before drawing general conclusions. In a very recent study Brown et al.16 have compared GB-GA and GB-GM-MCTS methods to SMILES-based ML and GA methods on 20 different optimization problems and conclude that “In terms of optimization performance, the best model is a genetic algorithm based on a graph representation of molecules”. Our and their results strongly suggest that the performance of new ML-based generative models should be compared to that of more traditional, and often simpler, approaches such as GAs. Both the GB-GA and GB-GM-MCTS codes used in this study are freely available with the open source RDKit toolkit as the only dependence.
Footnote |
† Electronic supplementary information (ESI) available: The codes used in this study can be found on GitHub: github.com/jensengroup/GB-GA/tree/v0.0 and github.com/jensengroup/GB-GM/tree/v0.0. See DOI: 10.1039/c8sc05372c |
This journal is © The Royal Society of Chemistry 2019 |