A graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space

This paper presents a comparison of a graph-based genetic algorithm (GB-GA) and machine learning (ML) results for the optimization of log P values with a constraint for synthetic accessibility and shows that the GA is as good as or better than the ML approaches for this particular property.


INTRODUCTION
Within the last few years several papers have been published on using machine learning (ML) to generate molecules with the aim of optimizing their properties. [1;2;3;4;5;6;7;8] While the performance of these generative models have been impressive, they are usually not compared to more traditional approaches such as genetic algorithms (GAs), [9;10;11;12] and recent work by Tsuda and co-workers [13] suggest that non-ML approaches may be competitive. The lack of comparison is perhaps in part due to the fact that there are no free or open source versions of these GA implementation available (ACSESS [11] uses the proprietary OpenEye cheminformatics toolkits).
In this paper I present a comparison of graph-based GA and ML results for the optimization of logP values with a constraint for synthetic accessibility and show that GA is as good or better than the ML approaches for this particular property. I also introduce a new non-ML graph-based generative model that can be parameterized using very small data sets and combined with a Monte Carlo tree search algorithm. The implementation of both methods rely on the opens source RDKit cheminformatics package and are made freely available. a graph representation of the molecules as opposed to, say, a string based representation such as SMILES. The crossovers and mutations are defined using RDKit's reaction SMILES. Following Brown et al. [9] a crossover can occur both at non-ring (Figure 1a-c) and ring bonds (Figure 1d-f), with equal probability and the position of the cuts are chosen randomly. Molecules with macrocycles (seven atoms or more - Figure  1f), allene centers in rings, fewer than five heavy atoms (Figure 1c), incorrect valences as determined by RDKit, and more non-H atoms than the target size are discarded. The target size is a random number sampled from a distribution with a mean 39.15 non-H atoms and a standard deviation of 3.50 non-H atoms as discussed later in the paper. The mutation operations to those used by Virshup et al. [11] ( Figure  2). The initial mating pool is drawn from a subset of the ZINC dataset used in previous studies, [2;4;6] as described below and molecules are sampled from the mating pool based on their normalized logP scores. show two examples of children made by the mating process. Methylflouride is discarded because it is too small and the cycloheptene ring is discarded because the ring is too large Figure 2. Overview of mutation operations and their associated probabilities, e.g. if an "append atom" mutation is chosen then a single bond is added 60% of the time. "∼" indicates an arbitrary bond order Table 1. Probability of the 15 most common 3-atom combination in rings in the first 1000 structures of the ZINC data set ("ZINC"), and in 1000 structures generate by the GB-GM method using the ZINC probabilities ("GB-GM(62%)" and a probability set where the probability of [*]=[*]-[*] type bonding is increased to 80% ("GB-GM(80%)") bonding ZINC GB-GM(62%) GB-GM(80%)

Graph-based generative model with Monte Carlo tree search (GB-GM-MCTS)
Tsuda and coworkers [2;5] have combined the text-based generative model developed by Segler et al. [1] with a Monte Carlo tree search (MCTS) algorithm to optimize molecular properties. In this approach each character in a SMILES string corresponds to a node in a tree network and each node is selected sequentially using the MCTS approach. Inspired by Virshup et al. [11] we developed a graph-based generative model (GB-GM) that grows the molecule one atom at a time and that can be combined with a MCTS. GB-GM uses the "append atom" and "insert atom" mutations ( Figure 2) for atom addition to non-ring and ring atoms, respectively, but with different probabilities as described below. In addition to these two mutation-types, a new tricyclic ring-creation mutation is used: X∼Y → X1∼Z∼Y1 where "1" indicates that X and Y are bonded and "∼" indicates an arbitrary bond order.
In order to create realistic looking molecules, such as those in the ZINC data set, the mutations and choice of element are chosen with probabilities obtained by an analysis of a subset of the molecules in the ZINC dataset. An analysis of first 1000 molecules in the ZINC dataset shows that 63% of the atoms are ring atoms, so the the ring-creation or ring-insertion mutation is chosen 63% of the time. The most common 3-atom combination in rings is C=C-C, which accounts for 45% of all 3-atom combinations in rings (Table 1), so a C∼C → C=C-C mutation is chosen 45% of the time, and similarly for the 34 other X∼Y∼Z combinations found in the data set. The site of insertion/creation is chosen randomly and excludes aromatic and six-membered rings. Similarly, for addition the most common bond involving at least one non-ring atom is C-C, so a C → C-C mutation is chosen more often (see Table S1 for more details). A more specific bonding analysis, e.g. addition of C to C=C vs C-C, was considered but then the most probable bonding patterns are often not found in the early stages of molecule growth and the growth process effectively stops.
The GB-GM-MCTS code is a modified version of the mcts.py code [14] modified for leaf parallelization with a maximum of 25 leaf nodes. The exploration factor is 1/ √ 2 and rollout is terminated if the molecule exceeds the target size as described for GB-GA. Any three-or four-membered alkene rings are subsequently expanded to five-membered rings by inserting C atoms. The reward function is 1 if the predicted J(m) value (see below) is larger than the largest value found so far and 0 otherwise. This reward function was found to work slightly better than the one used by Yang et al. [2] . Table 2. Maximum J(m) score averaged over 10 runs, the number of molecules evaluated per run, and the required CPU time per run. See text for an explanation of the methods. Results for the non-GB methods are talen from Yang et al. [2] where the number of molecules evaluated per run is estimated based on the average number of molecules generated per minute and the CPU time. The penalized logP score Following Gómez-Bombarelli et al. [4] and Yang et al. [2] the goal is to maximize J(m): where logP(m) is the octanol-water partition coefficient for molecule (m) as predicted by RDKit, SA(m) is a synthetic accessibility score, [15] and RingPenalty(m) penalizes unrealistically large rings by reducing the score by RingSize − 6 where RingSize is the number of atoms in a ring. Following previous studies, each property is normalized to have zero mean and unit standard deviation across the ZINC dataset.
J(m) depends both on the number and types of atoms and can be made arbitrarily large by increasing the molecular size. Therefore it is important to limit the molecular size in order to make a fair comparison to previous studies. Yang et al. [2] predicts SMILES strings with a maximum length of 81 characters, but it is difficult to translate that restriction directly to molecular size since many of the characters do not refer to atoms. Instead, the 20 molecules found by Yang et al. [2] are used to determine an average target size of 39.15±3.50 non-H atoms.

RESULTS AND DISCUSSION
GB-GA Ten GA simulations are performed and the maximum J(m) scores for each simulation is averaged. The population size is 20 and and 50 generations are used (i.e. 1000 J(m) evaluations per run). The initial mating pool is 20 random molecules sampled from the first 1000 molecules in the ZINC data set. The mean J(m)-score for this set is 0.2 and the maximum value is 3.6.
The average maximum J(m)-score for GA is 6.8±0.7 and 7.4±0.9 using a 50% and 1% mutation rate, respectively ( Table 2). For comparison the best average maximum value found by Yang et al. [2] is 5.6±0.5, which required close to 20,000 J(m)-score evaluations. The latter took 8 hours each, while the GB-GA calculations took 30 seconds each on a laptop. These J(m)-scores are also significantly larger than the other ML-based methods tried by Yang et al. [2] (Table 2): recurrent neutral network (RNN) with and without Bayesian optimization (BO), the continuous variational autoencoder [4] (CVAE), and the grammar variational autoencoder [3] (GVAE). graph convolutional policy network approach. The two molecules bear little resemblance to the molecules used to construct the initial mating pool. The molecules in the ZINC data set that are most similar to these two molecules have respective Tanimoto similarity scores of just 0.27 and 0.12 and corresponding J(m) values of 0.9 and -2.4 ( Figure S1). This indicates that the GB-GA approach can traverse a relatively large distance in chemical space using relatively few (50) generations. . So a good strategy may be to terminate the GB-GA run if the J(m) value has not changed for more than 20 generations (at least for J(m) maximisation).

GB-GM
As described in the Computational Methodology section, the GB-GM method grows a molecule one atom at a time where the choice of bond order and atom type is chosen probabilistically based on a bonding analysis of the first 1000 molecules in the ZINC dataset (referred to hereafter as the reference set). The GB-GM model is tested by generating 1000 molecules using ethane as the "seed" molecule (which takes about 30 seconds on a laptop) and repeating the statistical bonding analysis. The average molecular size in the new data set is 23.2±4.1 atoms, which is nearly identical to the that of the training set: 23.2±4.4 atoms. There are 2498 rings compared to 2768 in the reference set and 59% of the atoms are in rings, which also is close to 63% in the reference set. 41% of the 3-atom combinations in rings is C=C-C, which is reasonably close to the 45% in the reference set (Table 1) This difference is probably due to the fact that at least one of the two carbons must be secondary to "accept" the double bond, and if such a bonding pattern is not present then no "=C-" atom will be added. It is thus a little bit more likely that a "-C-" atom will be "accepted", compared to a "=C-" atom.
Not surprisingly, there are bigger differences for the larger scale features not specifically encoded in the rules such as the type of ring (Table 3). For example, there are many fewer benzene-type rings 5/9 The reason is that the molecules in the ZINC data set are not made by randomly adding atoms, but by assembling larger functional groups such as aliphatic and aromatic rings. As a result the average ring composition does not reflect the most likely ring compositions. It is possible to scale the probabilities to skew the results towards one or the other ring-type. For example in the last column the probabilities are scaled such that the probability of X=Z-Y is 80% rather than the 62% in the reference set, which increases the number of benzene-like rings from 479 to 850 at the expense of the aliphatic rings.

GB-GM-MCTS
Ten GB-GM-MCTS simulations are performed using ethane as a seed molecule and the maximum J(m)-scores for each simulation is averaged. The tree is traversed 1000 times, i.e. there are 1000 J(m) evaluations per run. For GB-GM-MCTS the average maximum J(m)-score value is 2.6±0.6, which is significantly lower than the lowest value in the Yang et al. [2] study, 4.9±0.4 ( Table 2). The most likely explanation is the GB-GM makes relatively few benzene rings (as discussed above), which, together with Cl atoms, is the defining structural feature of the high scoring molecules found by Yang et al. [2] . Indeed, if the probability of generating C=C-C containing rings is increased from 62% to 80% then the average maximum J(m)-score increases to 3.4±0.6. Increasing the tree traversal to 5000 increases the value to 4.3±0.6, which is similar to 4.9±0.4, obtained by Yang et al. [2] using as similar number of tree traversals. The latter run takes about 9 minutes, compared to 2 hrs in the Yang et al. [2] study.
Figures 3c-d shows the highest scoring molecule found using 1000 and 5000 tree traversals. The molecules are similar to those found by Yang et al. [2] in that they consist mostly of benzene rings, which is the most common hydrophobic structural motif in the ZINC data set. The GB-GA results show that higher J(m)-scores can be obtained using long aliphatic chains, but this structural motif is relatively rare in the ZINC data set and therefore rarely suggested by the generative models.

CONCLUSION AND OUTLOOK
This paper presents a comparison of a graph-based genetic algorithm (GB-GA) and machine learning (ML) results, compiled by Yang et al. [2] , for the optimization of logP values with a constraint for synthetic accessibility (J(m)) and shows that GA is as good or better than the ML-based approaches for this particular property. GB-GA predicts maximum J(m)-values that, on average, are 1.3 -1.8 units higher than the best  ML-based results reported by Yang et al. [2] , while also being orders of magnitude computationally more efficient. Similarly, the GB-GA method finds molecules with maximum J(m)-scores that, depending on the method, often are several units larger than those found with ML-based approaches. These molecules bear little resemblance to the molecules used to construct the initial mating pool, indicating that the GB-GA approach can traverse a relatively large distance in chemical space using relatively few (50) generations.
The paper also introduces a new non-ML graph-based generative model (GB-GM) that can be parameterized using very small data sets and combined with a Monte Carlo tree search (MCTS) algorithm such as the one used by Yang et al. [2] . The results are comparable to the results obtained by Yang et al. [2] using a recurrent neural network (RNN) generative model, with maximum J(m)-values of 4.3±0.6 compared to 4.9±0.6 found using 5000 property evaluation. While the results are slightly worse than the RNN results, the GB-GM-based method is orders of magnitude faster. In both cases the MCTS approach essentially extracts the main hydrophopic structural motif (a benzene ring) found in the training set. The MCTS results thus seem more dependent on the composition of the training set than the GA approach for this particular property.
While the results are quite encouraging, it is important to perform similar comparisons for other properties before drawing general conclusions. In a very recent study Brown et al. [16] have compared GB-GA and GB-GM-MCTS to SMILES-based ML and GA methods on 20 different optimization problems and conclude that "In terms of optimization performance, the best model is a genetic algorithm based on a graph representation of molecules." Our and their results strongly suggest that the performance of new ML-based generative models should be compared to more traditional, and often simpler, approaches such as GAs. The both the GB-GA and GB-GM-MCTS codes used in this study are freely available with the open source RDKit toolkit as the only dependence.  Figure S1. Molecules from the ZINC data set with the highest Tanimoto similarity to the molecules shown in Figure 3, together with their J(m) values. The Tanimoto score is calculated using ECFP4 fingerprints.