 Open Access Article
 Open Access Article
      
        
          
            Xiaoxue 
            Wang
          
        
       ac, 
      
        
          
            Yujie 
            Qian
          
        
      b, 
      
        
          
            Hanyu 
            Gao
ac, 
      
        
          
            Yujie 
            Qian
          
        
      b, 
      
        
          
            Hanyu 
            Gao
          
        
       a, 
      
        
          
            Connor W. 
            Coley
a, 
      
        
          
            Connor W. 
            Coley
          
        
       a, 
      
        
          
            Yiming 
            Mo
a, 
      
        
          
            Yiming 
            Mo
          
        
       a, 
      
        
          
            Regina 
            Barzilay
          
        
      b and 
      
        
          
            Klavs F. 
            Jensen
a, 
      
        
          
            Regina 
            Barzilay
          
        
      b and 
      
        
          
            Klavs F. 
            Jensen
          
        
       *a
*a
      
aDepartment of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA. E-mail: kfjensen@mit.edu
      
bDepartment of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
      
cDepartment of Chemical and Biomolecular Engineering, The Ohio State University, Columbus, Ohio 43210, USA
    
First published on 14th September 2020
Computer aided synthesis planning of synthetic pathways with green process conditions has become of increasing importance in organic chemistry, but the large search space inherent in synthesis planning and the difficulty in predicting reaction conditions make it a significant challenge. We introduce a new Monte Carlo Tree Search (MCTS) variant that promotes balance between exploration and exploitation across the synthesis space. Together with a value network trained from reinforcement learning and a solvent-prediction neural network, our algorithm is comparable to the best MCTS variant (PUCT, similar to Google's Alpha Go) in finding valid synthesis pathways within a fixed searching time, and superior in identifying shorter routes with greener solvents under the same search conditions. In addition, with the same root compound visit count, our algorithm outperforms the PUCT MCTS by 16% in terms of determining successful routes. Overall the success rate is improved by 19.7% compared to the upper confidence bound applied to trees (UCT) MCTS method. Moreover, we improve 71.4% of the routes proposed by the PUCT MCTS variant in pathway length and choices of green solvents. The approach generally enables including Green Chemistry considerations in computer aided synthesis planning with potential applications in process development for fine chemicals or pharmaceuticals.
One major challenge for an efficient CASP algorithm to include green chemistry considerations is limitations in chemical space search algorithms. The search algorithm is the core of synthesis planning as it determines the success rate and the route quality of the proposed synthesis pathways. Heuristic best first searches (BFS) or depth first searches (DFS)5 use hand-written heuristic functions to evaluate positions in the search tree as they explore the synthesis planning search space.1,5 However, given the large search space5,20 of chemical reactions, the searching efficiency of BFS/DFS is not high enough for good CASP performance.5 In addition, it is difficult to define strong heuristics for BFS/DFS CASP.1 To tackle this challenge, Segler et al.1 showed that Monte Carlo Tree Search (MCTS), a search scheme that demonstrated superhuman performance in playing the game of Go,21,22 is able to find buyable pathways with much higher success rate than traditional heuristic BFS.1 Segler et al.1,23,24 also innovatively demonstrated that it was possible to learn the rules of chemical synthesis for a specific domain by using data. Additionally, Kishimoto et al.25 demonstrated that the MCTS variant used by Segler et al.1 significantly outperformed depth-first proof-number (DFPN) search in terms of success rate in the domain of retrosynthesis. More recently, Chen et al.26 introduced Retro*, a neural-based tree search method also demonstrating advanced performance.
The MCTS variant used by Segler et al. is similar to Google's Alpha Go algorithm.22 Though it has great search success rate, the method does not take green chemistry into account. Schreck et al.3,27 used another variant of MCTS, Upper Confidence bound applied to Trees (UCT),28–30 in a reinforcement learning approach to find synthesis pathways with as few buyable precursors as possible. However, the UCT method is prone to have a lower success rate in finding pathways than the Alpha Go-like PUCT (predictor + UCT)22,31 MCTS variant used by Segler et al.1,3,24 An efficient search scheme that maintains high search success rate and favors green chemistry routes is still missing.
A critical challenge for green chemistry CASP is the difficulty of evaluating chemical compounds and reactions in the search tree. Evaluation of compounds and reactions in the sense of green chemistry requires a fast-computing model to predict the conditions, such as solvents, catalysis and reaction temperature, of a given reaction. Struebing et al.32 used quantum mechanical (QM) calculations to effectively find solvents that can accelerate certain reactions. But the high computational cost of QM calculations, makes such an approach infeasible for CASP tree search. Recent data-driven condition prediction models could potential be sufficiently fast for CASP. As examples, Marcou et al.33 demonstrated an expert system to predict the catalysts and solvents for Michael additions and Gao et al.34 developed a data-driven neural network model to predict the reaction conditions with high accuracy. So far, such models have not been incorporated into the tree search algorithms to guide and evaluate the searches.
Additionally, the evaluation of the compounds in a CASP search tree could be facilitated by reinforcement learning (RL). RL has been frequently used in solving games and evaluating game positions.21,22 The similarity between the synthesis planning problem and strategic games,1,5 suggests that RL would be a suitable method for evaluating chemical compounds as demonstrated by Schreck et al.3 in their application of RL to assess the cost of compounds in retrosynthetic analysis.
In this work, we tackle the aforementioned challenges by proposing a more efficient MCTS variant that incorporates the condition prediction model and RL. This MCTS variant, modified UCT with “dynamic c”, enables tuning the balance between exploring new reaction templates and exploiting known templates dynamically during tree search. Specifically, we promote an active search by adjusting the “c” coefficient28,29 along with the tree expansion. Because of the forced exploration by our method, the modified UCT with dynamic c algorithm is able to significantly boost the success rate compared to the original UCT algorithm proposed by Kocsis et al.28,29
A value network trained by MCTS self-play RL is used explicitly as the look-ahead mechanism to evaluate the “synthesis easiness” of each compound in order to steer the tree expander towards short route length and high success rate. A policy network is applied implicitly to narrow down the beam of search. However the exact value of the policy function22 is not used in the UCT formulation of MCTS,29 making our algorithm more robust to an imperfect policy network. In addition, each reaction in the proposed routes is evaluated by the “greenness” of neural network predicted solvents.34
We use automatically extracted templates (patterns of reaction rules) as other retrosynthesis efforts5,17,23,35 to generate a full buyable path with multistep reactions.1,3,8,36 Although many recent efforts have focused on developing template free synthesis planning,37 they are primarily addressing single step conversion20,38,39 and template-free full synthesis planning is still emerging.37
We show that our new MCTS variant is able to suggest shorter pathways with greener solvents than the suggestions by the PUCT MCTS without sacrificing the success rate of route searching. At the same time, the new MCTS also demonstrates significantly higher success rate compared to the original UCT MCTS. Although we only consider route length and solvent greenness in this work, the method is generally applicable for other green chemistry considerations, e.g. milder reaction temperature and more economical catalysts.
![[Doublestruck R]](https://www.rsc.org/images/entities/char_e175.gif) describes the reward taken by taking action a from state s and reaching new state s′, and finally a discount factor γ ∈ (0,1)28,40 reflects a preference for shorter pathways. In the MDP,22,40 the policy function p(a|s) is the probability distribution of the allowed actions a ∈ A(s), and a value function is defined as
 describes the reward taken by taking action a from state s and reaching new state s′, and finally a discount factor γ ∈ (0,1)28,40 reflects a preference for shorter pathways. In the MDP,22,40 the policy function p(a|s) is the probability distribution of the allowed actions a ∈ A(s), and a value function is defined as  which can be tailored for different optimization purposes such as high success rate, low reaction cost and high process greenness. As mentioned above, we use templates to implement the retrosynthetic disconnection action, a.
 which can be tailored for different optimization purposes such as high success rate, low reaction cost and high process greenness. As mentioned above, we use templates to implement the retrosynthetic disconnection action, a.
        The retrosynthesis process can be formulated as a tree search problem in an MDP environment. Given the large search width and depth of chemical retrosynthesis planning, the traditional depth/breadth first search is not efficient.1 Here a more efficient search scheme (MCTS, Fig. 1) is used. The detailed description of the MCTS process can be found in the Methods section. The core of MCTS is using an “upper confidence bound” (UCB) to prioritize the templates. Different UCB equations define different MCTS variants as shown in Table 1. The first three rows are UCT29 type MCTS variants, and the last row is PUCT22,24,31 type MCTS. The modified UCT and the “dynamic c” tuning (mUCT-dc) in Table 1 are described in details in the Methods section. In short, we (1) modify the UCB equations of original UCT to a hybrid form that combines UCT and PUCT types of MCTS, in order to save the time required by the mandatory ergodic step in the initial step of searching by the original UCT,29 and (2) introduce a “dynamic c” trick to tune the value of the coefficient, c (Table 1), in order to effectively force the tree expander to explore dynamically templates that are ranked low by the policy network to better balance exploration and exploitation than the original UCT and PUCT MCTS.
|  | ||
| Fig. 1 The process of Monte Carlo Tree Search in synthesis planning. Following the notations of MDP, a molecule (or state) is denoted as s, and a template (or retrosynthetic disconnection action) is denoted as a. In the selection phase, starting from the target molecule, the most “promising” template is recursively chosen by selecting the template with the highest upper confidence bound (UCB(s,a)) value until a leaf node is reached. A policy network is used to narrow down the search beam in each template selection step. In the expansion phase, the leaf node is expanded by applying the selected template. New leaf nodes (precursors) that are not visited by the tree expander before are generated. Once the new leaf nodes are encountered, in the evaluation step, a value network is used to evaluate the values of the leaf nodes (if the node is buyable, the value is set to 1). Then in the backpropagation step, upward along the tree, the visit count N(s,a) of each compound-template (s,a) pairs, or edges, are updated. The Q(s,a) value (see Table 1) is recalculated as well and used to recompute UCB(s,a) values in the next selection step. With the updated values, the tree expander goes back to the selection phase, starting selecting the most promising template for the target molecule (root node) again. Here circles denote compounds. (Blue) not commercially available; (Green) commercially available. | ||
 denotes the summation of N(s,b) over all allowable templates b available for state s. v(s) is the value of s and the output of the value network. s,a → s′ indicates that s′ is eventually reached after taking action a from position s
 denotes the summation of N(s,b) over all allowable templates b available for state s. v(s) is the value of s and the output of the value network. s,a → s′ indicates that s′ is eventually reached after taking action a from position s
		In addition, instead of using a Monte Carlo roll-out1,28,29 as the look-ahead mechanism in the “Evaluation” step, we use a value network to generate the value of a compound, similar to Alpha Go by Silver et al.22 and envisioned by Segler et al.1 If the compound is in our buyable catalog, the value will be set to 1 and overwrite the value given by the value network. The value network is trained through RL self-play. The self-play process (bootstrapping) and the definition of values are discussed in the following part. The value network trained in this way is therefore dependent on the buyable catalogue used.
As shown in Table 1, in total, we study 6 MCTS variants: modified UCT with dynamic c and value network (mUCT-dc-V), original UCT with value network (UCT-V), modified UCT without dynamic c with value network (mUCT-V), PUCT22,31 MCTS with value network (PUCT-V, is very similar to the MCTS used by Segler et al.1), PUCT MCTS without value network (PUCT-bootstrapping), modified UCT with dynamic c but without value network (mUCT-dc-bootstrapping).
 We only consider the deterministic case where T(s,a,s′) = 1, if s′ is the result of retrosynthetic disconnection action a (or template) taken by compound s, otherwise T(s,a,s′) = 0; and we set R(s,a,s′)to 0 for all reactions. If s′ is buyable, v*(s′) = 1. Therefore
 We only consider the deterministic case where T(s,a,s′) = 1, if s′ is the result of retrosynthetic disconnection action a (or template) taken by compound s, otherwise T(s,a,s′) = 0; and we set R(s,a,s′)to 0 for all reactions. If s′ is buyable, v*(s′) = 1. ThereforeHere s′ is the next state of s so that T(s,a,s′) = 1. A(s) is the action space which defines all the allowable retrosynthetic disconnection actions for s. Note that the definition of v*(s) is recursive and depending on ergodic search in the action space, which is not possible. It is prohibitive to obtain the analytical solution of the exact value of v*(s). Instead, we can empirically approximate v*(s) by running MCTS on molecules randomly sampled from the chemical space. The resulting approximate of v*(s) through MCTS sampling is z(s), and the output of the value network trained using z(s) is defined as vθ(s), so that vθ(s) ≈ z(s) ≈ v*(s).
Since one compound s can be split into multiple compounds, defining z(s) is complex. In order to bootstrap the process, we define the z(s) value for each compound in the tree as shown in Fig. 2a. The z(s) is defined similar to the Bellman equation of v*(s)40 with a discount factor of γ. For a reaction ((s,a) pair) in the tree, if all of its child compounds (s′) have a non-zero z value, then a Z(s,a) value for the reaction is defined as the average z value of its children, and the z value of the parent compound, s, is defined as the maximum Z(s,a) value multiplied by γ among all actions available to s in the tree.
By initialization, the z values for the compounds are zero. Along the tree search process, the encountered compounds in the buyable catalog will be assigned a z(s) of 1, as shown in Fig. 2a. In the bootstrapping phase, no value network is used. The Q(s,a) value, which is important in UCB equations, is calculated using z(s) values in the bootstrapping phase as shown in Table 1. Therefore in this process, a strong favor will be given to those actions which can lead to buyable products fast.
The z(s) values are obtained retrospectively, i.e., they can only be calculated after running the MCTS. In order to guide the retrosynthesis, we need to evaluate the value prospectively, so we train a value function vθ(s) that only takes molecular structure as an input to predict z(s) (Fig. 2b). The value network can be further updated with more z(s) data accumulate during the training of the RL algorithm. In this work, we will discuss the value network obtained by the Round 1 RL (trained with z(s) values generated from bootstrapping MCTS) and the Round 2 RL (trained with z(s) values generated from MCTS with the Round 1 RL value network). The bootstrapping MCTS variant used here is the mUCT-dc-bootstrapping method. The details of the RL value network training can be found in the Methods section.
Among all the MCTS variants, for our test set of 1000 compounds, the mUCT-dc-V and PUCT-V algorithms outperform the other variants with a fixed expansion time of 30 s (Fig. 3a). Especially when compared to the original UCT method (UCT-V), the mUCT-dc-V outperforms by 19.7% with the same value network and same initialization of the c value. This is a result of the active exploration promoted by the dynamic c tuning. Without the dynamic c, the performance of the modified UCT without dynamic c (mUCT-V) and the original UCT method (UCT-V) are very close. In addition, mUCT-dc-V is able to find synthesis pathways in 30 s for many target compounds that cannot be solved by PUCT-V method, providing a new effective approach to finding synthetic pathways connecting targets to buyable chemicals (cf.Fig. 4).
Fig. 4 shows the probability P(s,a) of each step (s,a) given by the policy network together with the ranking of the template among the top 50 templates suggested by the policy network. The success of the mUCT-dc-V method is attributed to templates with small P(s,a) and low ranking can still be effectively explored due to the dynamic tuning of the exploration coefficient c in the UCB equation. On the contrary, PUCT-V tree expander is trapped by the high ranking templates that eventually are insufficient in forming a valid synthetic route, as a result of a more greedy strategy in the definition of its UCB equation.
Traditionally, since PUCT utilizes the information of the prior probabilities of different templates explicitly, it does not need to compute the logarithmic function (Table 1) and will search more quickly. As a result, the PUCT algorithm outperforms the original UCT method. However, with the dynamic c to promote active exploration, the mUCT-dc-V method even outperforms the PUCT algorithm in both test sets and training sets. In addition, since the policy network trained on the Reaxys database focuses only on single step transformations and does not necessarily reflect the best prior probability P(s,a) that helps maximize the success rate of multi-step retrosynthesis search, it may mislead the PUCT algorithm. In contrast, UCT type MCTS variants do not explicitly include the P(s,a) into the equation for UCB (Table 1). Instead, they only use the results of the policy network to shrink the width of the searching. Therefore modified UCT with dynamic c trick (mUCT-dc-V) is more robust against the imperfect policy network quality and the forced exploration guarantees broader selection of templates and more various synthesis routes, which will lead to greener pathways as we will show in the following parts.
When the termination criteria of tree expansion is switched to a fixed root visit count, the advantage of mUCT-dc-V algorithm is even more obvious (Fig. 3(b)). The power of dynamic c makes the modified UCT method outperform the PUCT-V method by ∼16%. Since the UCT method requires the computation of the logarithmic function and forces exploration of low ranked templates, UCT searches much slower than the PUCT method. With 30 s as time limit, UCT actually visits the root much less than the PUCT method. However with the dynamic c trick, our novel UCT is able to search more efficiently and have a slightly higher success rate compared to PUCT. When both methods are given the same root visit count limit, the mUCT-dc-V method largely outperforms the PUCT method due to the more efficient searching.
The value network used in Fig. 3 is the value network from Round 1 RL. The results of Round 2 RL value network can be found in Table S1 in the ESI† and are not significantly different from the results of Round 1 RL value network. The c value is 0.1 for UCT type MCTS variants, and is set to 1 for PUCT type of MCTS variants. The effect of different c values is studied in Table 2, and it does not significantly affect the performance. Here the values for buyable compounds are set to be 1 during the tree search regardless of the prediction of the value network. The results of the tree expansion without the prior knowledge of buyable compounds during value setting is shown in Fig. S1 and Table S2 in the ESI.† In general, the performance is much worse than the case in Fig. 3, implying that the MCTS tree expander favors pathways with strong incentive such as high weight for buyable compounds.
We assign different scores to different solvents in our solvent library based on biosafety and flammability as suggested by Byrne et al.41 “Green” solvents, such as ethanol and water, are assigned the score of 1. “Mediocre” solvents, such as methyl isobutyl ketone (MIBK) and toluene, are given the score of 0. “Non-green” solvents, such as tetrachloromethane and 1,2-dimethoxyethane, are allotted a negative score of −1 (see Fig. 5a). Gao et al.34 proposed a neural network model that can predict suitable solvents with a probability distribution when given a target reaction. Based on this result, we define the reaction solvent score as the weighted average of the suggested solvents by the solvent model (Fig. 5b). The solvent model suggests a list of solvents for the desired reaction with different probabilities. This probability-weighted greenness is akin to Li and Eastgate's work for incorporating ligand information into process mass intensity (PMI).42 With the idea of reaction solvent score, a compound solvent score can be defined. We first convert the reaction solvent score (RSS ∈ [−1,1]) to reaction solvent penalty (RSP ∈ [−1,−0.1]) (Fig. 5c). The idea of reaction solvent penalty is that any additional reaction is “bad” for the goal of finding shortest possible synthetic routes, therefore all reactions should be penalized. Yet reactions using “green” solvents should be less penalized than the non-green reactions. Therefore, the reactions with RSS of 1 receives only a small penalty of −0.1, while the reactions using very toxic or flammable solvents (RSS = −1) are given the most negative penalty −1. Then we are able to define compound solvent score (CSS) as 
The optimization problem to maximize the route greenness boils down to the problem of finding the greenest valid route argmaxpath(ΣR in Path (R penalty in the path)) for the root compound. We use the CSS of the root compound as the greenness of the synthetic route in this work. Moreover, the overall goal is to optimize the compound solvent scores of the root compounds (or route greenness) by developing efficient MCTS tree expanders.
The reason for mUCT-dc-V generating greener synthetic routes lies in the intrinsic difference of the PUCT MCTS and the modified UCT with dynamic c MCTS. In the UCB equation of PUCT-V (Table 1), the output of the policy network P(s,a) is explicitly used in the exploration term to steer the tree expander to select templates with higher prior probability. However, the policy network trained on the Reaxys database does not necessarily reflect the true value of the prior probability, although the relative ranking given by the policy network is meaningful. Therefore, with an inaccurate prior probability value, the PUCT MCTS can be misled by the policy network. On the contrary, the forced exploration by the dynamic c tuning in the modified UCT MCTS provides a broader vision to explore templates that are underestimated by the policy network, since the value of the prior probability is not used explicitly in the UCB equation here but only the ranking is used. In addition, since P(s,a) for many low ranked templates is extremely small as a result of the imperfect policy network, the templates are not effectively explored by the PUCT MCTS. As a consequence, the PUCT MCTS tree expander will prefer to exploring edges that policy network prefers instead of exploring templates that might be ranked lower but can eventually yield shorter routes.
We also trained a value network using the CSS values generated from MCTS experiments. The route greenness is worse than the value networks only considering synthetic “easiness” (i.e. Round 1 RL value network). The main reason is that the CSS value based value network sacrifices the success rate of the synthesis planning. Consequently with deficient routes to choose from, the CSS value network is worse at finding green pathways. The loss–iteration curve of the CSS value network can be found in Fig. S2.† The comparison of the performances of CSS value network and the Round 1 RL value network can be found in Table S3.†
We have also compared the performance of PUCT-bootstrapping method with the mUCT-dc-V with Round 1 RL value network. The novel UCT method again defeats the PUCT-bootstrapping algorithm in terms of the solvent greenness of the generated synthetic routes. The performance can be found in Table S4 in the SI.†
The most promising template selected for each compound s is the template with the highest UCB(s,a) value. This template selection process takes place recursively in the tree until a leaf node (nodes without a child along the edge of the most promising template) is encountered. The tree expander will expand the leaf node by applying the selected template to it. In this way, new leaf nodes (new precursors) are added to the tree. Once the new leaf nodes are generated, a second neural network, value network, is used to evaluate the novel compounds. We check the commercial availability of the leaf nodes and once the leaf nodes are found to be in our catalogue of purchasable compounds, the value of this compound is set to 1 (overwriting the value given by the value network) and this node will not undergo an expansion process any more. The tree expansion settings are: top 50 templates given by policy network are considered, maximum depth is 10, and minimum plausibility is 0.75. The minimum plausibility is the output of a fast filter based on a model predicting the likelihood of a reaction being plausible.17
Once we get the values of the leaf compounds, a backpropagation process takes place. In the backpropagation step, visit counts (N(s,a)) and template Q value (Q(s,a)) are updated for all the (s,a) pairs (compound-template pairs) upward on the tree as shown in Fig. 1. The equation to calculate Q(s,a) is shown in Table 1. In practice, we also update z(s) value (which is explained in the bootstrapping part) along the MCTS process in the bootstrapping phase. After the backpropagation, the tree expander select the most promising template with the highest UCB(s,a) from the target molecule (root node) again with the updated template visit count (N(s,a)) and template Q value (Q(s,a)).
A key issue of the modified UCB1(see the ESI†) and the original UCB1 in MCTS is that the tree expander may tend to visit the known best action too many times and it does not sample other options enough, when the parameter c is not appropriate. Additionally, when the value network cannot evaluate the states perfectly, it is important to explore the actions efficiently in a dynamic way. Therefore, we proposed the dynamic c trick to solve these challenges, and as shown in the main text, it significantly improves performance of the MCTS algorithm.
First of all, we have a policy network that ranks the probability to apply each template (or action). Therefore we always start from the highest ranked template and then visit the second ranked and so on. The event we want to investigate is the case when the tree expander switches from the highest ranked template to the second highest ranked template when using modified UCB1 equation.
Suppose that in the last round, the expanded template (action a) has the  and the next unexpanded template has a
 and the next unexpanded template has a  the unexpended template will not be visited in the current round unless
 the unexpended template will not be visited in the current round unless 
Specifically, if the switching is between the highest ranked template and the second highest ranked template, ∑bN(s,b) = N(s,a), since only the highest ranked template a is visited. Therefore the minimum N(s,a) required for the highest ranked template a can be solved from
Therefore the minimum required visit time N(s,a) depends on the ratio of Q(s,a) and c.The solution of this inequality vs. the value of Q/c is shown in Fig. 8 and Table 3. As Q/c value increases, the minimum required N(s,a) increases very rapidly. Therefore if we don't choose c carefully, the tree expander will be searching the highest ranked template almost forever. If the highest ranked template cannot lead to a buyable pathway, the tree expander is “trapped” in this suboptimal branch. The phenomena shown in Fig. 8 and Table 3 also comport with the idea that c is promoting “exploration” in the sense that the extent to which c is promoting exploration depends on its relative value to Q.
| Q/c value | min N(s,a) | 
|---|---|
| 4 | 3879 | 
| 5 | 281326 | 
| 9 | 3.88 × 1017 | 
In practice, here we choose Q/c = 2 as the standard. Therefore the first template switch will happen when the first template has been visited for 24 times. If we set  for all visited Q(s,b) values and we assume random variables Q(s,b) ∈ (0,1) (this is the requirement of Chernoff–Hoeffding bound30 and is the actual case in this paper) are not changing with tree expansion, the average and maximal visit count for each template before all templates are visited can be calculated as shown in Fig. 9 (a and b) using 104 random simulations. In fact, this trick allows visiting all the templates with decaying visit times according to their ranking given by the policy network.
 for all visited Q(s,b) values and we assume random variables Q(s,b) ∈ (0,1) (this is the requirement of Chernoff–Hoeffding bound30 and is the actual case in this paper) are not changing with tree expansion, the average and maximal visit count for each template before all templates are visited can be calculated as shown in Fig. 9 (a and b) using 104 random simulations. In fact, this trick allows visiting all the templates with decaying visit times according to their ranking given by the policy network.
One issue for the aforementioned strategy is, the Q(s,b) values are actually changing as the tree expands. Therefore we propose a method to tune the c on the go. The idea is shown in Fig. 10. The value of c is tuned as current observed max Q/2 as the tree expands. The benefit of this promoted exploration of low ranked templates can be seen in Fig. 4, where low ranked templates lead to successful synthetic routes while PUCT-V method is trapped in the high ranked templates which eventually are insufficient in forming a valid synthetic route.
|  | ||
| Fig. 10 The dynamic method to decide the value of c. We define c as half of the current max Q(s,b) value during the tree expansion process during which the visit count of the compound s increases. | ||
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 product molecules randomly selected from the Reaxys database.
000 product molecules randomly selected from the Reaxys database.
        The tree expansion generates z(s) values for all compounds generated in the 112![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 trees in the way shown in Fig. 2a. In the Round 1 RL, γ = 0.9. Also, since for the same compound s, the z(s) value can depend on the timing it appears in the trees, we choose z(s) as the maxz(s)∀ trees where compound s appears. The data is then scaled to [0.2, 1] based on the equation:
000 trees in the way shown in Fig. 2a. In the Round 1 RL, γ = 0.9. Also, since for the same compound s, the z(s) value can depend on the timing it appears in the trees, we choose z(s) as the maxz(s)∀ trees where compound s appears. The data is then scaled to [0.2, 1] based on the equation:
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 trees, there are 695
000 trees, there are 695![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 574 z′(s) values generated. We then split the data set with 8
574 z′(s) values generated. We then split the data set with 8![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) :
:![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 2 ratio randomly to formulate the training set and test set for the value network.
2 ratio randomly to formulate the training set and test set for the value network.
        The input of the value network is the Morgan fingerprint of the compound converted by RDKit. The architecture of the value network is shown in Fig. 11.
The loss–iteration curve can be found in Fig. S2.† The mean squared error (MSE) of the test set is also tested. MSE of round 1 RL value network on the test set is ∼0.031 and MSE of round 2 RL value network on the test set is ∼0.02. Considering the MSE of the value network in Deep Mind's Alpha Go algorithm22 is ∼0.226 for training set and ∼0.234 for test set, our value networks' MSE's are much lower, when the value v(s) is bounded within (0,1] for both cases, which is a requirement of Chernoff-Hoeffding bound.30
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 compounds with a list price of under $100 per g from SigmaAldrich and eMolecules with salts removed.17 The lookup function is part of the open source ASKCOS website.43
000 compounds with a list price of under $100 per g from SigmaAldrich and eMolecules with salts removed.17 The lookup function is part of the open source ASKCOS website.43
      
      
        ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 compounds among the 112
000 compounds among the 112![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 compounds used in Round 1 RL value network training, in order to save workload. Here the value network used for MCTS tree expansion is the value network trained in round 1 RL. The γ here is set to be 0.7 to enhance the differentiation of synthesis “easiness”. We again choose z(s) as the maxz(s) ∀ trees where compound s appears. Then after adding z(s) = 1for the buyable compounds encountered in the 62
000 compounds used in Round 1 RL value network training, in order to save workload. Here the value network used for MCTS tree expansion is the value network trained in round 1 RL. The γ here is set to be 0.7 to enhance the differentiation of synthesis “easiness”. We again choose z(s) as the maxz(s) ∀ trees where compound s appears. Then after adding z(s) = 1for the buyable compounds encountered in the 62![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 trees, we get 321
000 trees, we get 321![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 395 z(s) values. We split them with 8
395 z(s) values. We split them with 8![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) :
:![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 2 ratio and get a training set of 257
2 ratio and get a training set of 257![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 116 compounds and a test set of 64
116 compounds and a test set of 64![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 279 compounds.
279 compounds.
      
      
        ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 compounds. In total, we collected CSS(s) value for 282
000 compounds. In total, we collected CSS(s) value for 282![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 398 compounds successfully solved in the tree expansion. Then we split them with 8
398 compounds successfully solved in the tree expansion. Then we split them with 8![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) :
:![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 2 ratio and get a training set of 225
2 ratio and get a training set of 225![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 918 compounds and a test set of 56
918 compounds and a test set of 56![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 480 compounds.
480 compounds.
      
      
        ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 compounds used in the bootstrapping phase. We randomly choose 1000 compounds from the training set of the value network training for test purpose of MCTS success rate experiment. The 1000 compounds from the training set are the same for all MCTS variants in Fig. 3.
000 compounds used in the bootstrapping phase. We randomly choose 1000 compounds from the training set of the value network training for test purpose of MCTS success rate experiment. The 1000 compounds from the training set are the same for all MCTS variants in Fig. 3.
        The test set for the MCTS experiments in Fig. 3 are different from the test set of the value network validation. We choose the compounds in the Reaxys data base that are not in the training sets used in training the value networks, and are not in the “training set” in MCTS success rate experiments. There are in total 1![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 673
673![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 879 compounds in Reaxys that are not seen in the previous value network training process, and we use these compounds as the test set for MCTS success rate testing. In the MCTS success rate experiments shown in Fig. 3, we choose 1000 compounds from the test set randomly. The 1000 tested compounds from the test set are the same for all MCTS variants in Fig. 3.
879 compounds in Reaxys that are not seen in the previous value network training process, and we use these compounds as the test set for MCTS success rate testing. In the MCTS success rate experiments shown in Fig. 3, we choose 1000 compounds from the test set randomly. The 1000 tested compounds from the test set are the same for all MCTS variants in Fig. 3.
The test set of the greenness tests in Fig. 6 is randomly selected 2000 compounds from the test set, and Tables S3 and S4† is randomly selected 500 compounds from the 1000 tested compounds used in the success rate experiments.
| Footnote | 
| † Electronic supplementary information (ESI) available. See DOI: 10.1039/d0sc04184j | 
| This journal is © The Royal Society of Chemistry 2020 |