Open Access Article
Kangjie
Lin‡
a,
Youjun
Xu‡
a,
Jianfeng
Pei
*b and
Luhua
Lai
*ab
aBNLMS, Peking-Tsinghua Center for Life Sciences at the College of Chemistry and Molecular Engineering, Peking University, Beijing, 100871, PR China. E-mail: lhlai@pku.edu.cn
bCenter for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, PR China. E-mail: jfpei@pku.edu.cn
First published on 3rd March 2020
Retrosynthetic route planning can be considered a rule-based reasoning procedure. The possibilities for each transformation are generated based on collected reaction rules, and then potential reaction routes are recommended by various optimization algorithms. Although there has been much progress in computer-assisted retrosynthetic route planning and reaction prediction, fully data-driven automatic retrosynthetic route planning remains challenging. Here we present a template-free approach that is independent of reaction templates, rules, or atom mapping, to implement automatic retrosynthetic route planning. We treated each reaction prediction task as a data-driven sequence-to-sequence problem using the multi-head attention-based Transformer architecture, which has demonstrated power in machine translation tasks. Using reactions from the United States patent literature, our end-to-end models naturally incorporate the global chemical environments of molecules and achieve remarkable performance in top-1 predictive accuracy (63.0%, with the reaction class provided) and top-1 molecular validity (99.6%) in one-step retrosynthetic tasks. Inspired by the success rate of the one-step reaction prediction, we further carried out iterative, multi-step retrosynthetic route planning for four case products, which was successful. We then constructed an automatic data-driven end-to-end retrosynthetic route planning system (AutoSynRoute) using Monte Carlo tree search with a heuristic scoring function. AutoSynRoute successfully reproduced published synthesis routes for the four case products. The end-to-end model for reaction task prediction can be easily extended to larger or customer-requested reaction databases. Our study presents an important step in realizing automatic retrosynthetic route planning.
Since the 1960s, computer-aided retrosynthetic analysis tools have attracted much attention, with the earliest retrosynthesis program likely being the early Logic and Heuristics Applied to Synthetic Analysis (LHASA) work of E. J. Corey.4 Computer-aided synthesis planning has been well-reviewed over the past years.5–11 According to a recent review,11 computer-aided retrosynthetic route planning strategies can be clustered into two main categories: template-based and template-free methods. Template-based methods, including the LHASA software,4,12 have been applied since the philosophy of retrosynthetic analysis was put forward by E. J. Corey. These methods can also be categorized as using either a manual encoding approach or an automated extraction approach. Synthia (formerly Chematica), one of the most well-known, expert-encoded, template-based retrosynthetic analysis tools, is a commercial program developed by Grzybowski and co-workers.9,13–18 This tool uses a manually collected knowledge database containing about 70
000 hand-encoded reaction transformation rules.18 Based on human knowledge of organic synthesis and the encoding of organic rules over a period of more than 15 years, Synthia has been validated experimentally as an efficient toolkit for complex products recently.16 However, it would not be practical to manually collect all the knowledge of organic synthesis considering the exponential growth rate of the number of published reactions.19 Another straightforward strategy for a template-based method, the ReactionPredictor from Baldi's group,20–22 is based on mechanistic views. This method considers the reactions between reactants as electron sinks and sources, and ranks the interactions using approximate molecular orbitals. Although these approaches are logical and interpretable for chemists, the manual encoding of mechanistic rules cannot be avoided and the mechanisms outside the knowledge database cannot be predicted.
In addition to manual rules, automated reaction templates have been extracted by several groups. Based on the algorithms described by Law et al.23 and Bogevig et al.,24 Segler and Waller employed a neural network to score templates and perform retrosynthesis and reaction prediction.25,26 Coupling this method with Monte Carlo Tree Search (MCTS), they built a novel method for synthetic pathway planning.19 Later, Coley et al.27 used automatically extracted templates to perform retrosynthesis analysis based on molecular similarity, where they considered the similarity of both products and reactants to score and rank the templates. Recently, Baylon and coworkers28 applied a multiscale approach based on deep highway networks (DHN) and reaction rule classification for retrosynthetic reaction prediction. Their approach achieved better performance than other previous methods based on automated extraction of reaction templates. However, there are two unavoidable limitations when using automatically extracted reaction templates. First, there is an inevitable trade-off between generalization and specificity in template-based methods. Second, current template extraction algorithms consider reaction centers and their neighboring atoms, but not the global chemical environment of molecules. Moreover, mapping the atoms between products and reactants remains a nontrivial problem for all template-based methods.29
Recently, template-free models have emerged as a promising strategy to predict reactions and retrosynthetic transformations. With the pioneering work using neural networks to generate SMILES30 by Aspuru-Guzik et al.31 and Segler et al.,32 sequence-to-sequence (seq2seq) models have been gradually applied as an important template-free model in reaction outcome prediction and retrosynthetic analysis. The first template-free model in retrosynthetic analysis was proposed by Liu and co-workers,29 who used a seq2seq model to predict SMILES30 strings for reactants of a single product. They used a neural network architecture that involves bidirectional long short-term memory (LSTM) cells with an additive attention mechanism. The seq2seq model performed comparably to its template-based baseline (37.4% versus 35.4% in top-1 accuracy). However, the invalidity rate of the top-10 predicted SMILES strings was greater than 20%, which restricts its potential in further synthetic pathway planning.
In 2017, Vaswani et al.33 proposed a multi-head attention-based Transformer model in machine translation tasks that achieves state-of-the-art performance. Later, two studies used this model to predict the reaction outcome and reactants for single-step retrosynthetic analysis.34,35 Herein, we present a novel template-free strategy for automatic retrosynthetic route planning. The workflow is depicted in Fig. 1. We first trained an end-to-end model for single-step retrosynthetic task prediction using the Transformer architecture on the reactions from the United States patent literature. Our best model achieved a top-1 prediction accuracy of 63.0% using USPTO_MIT (without chiral species) with reaction classification, which exceeds that of the previous similarity-based27 or LSTM-based seq2seq models.29 This model generated fewer invalidity errors of SMILES (99.6% validity rate) compared to the previously reported seq2seq model. When applied recursively, our model successfully performed multi-step retrosynthetic route planning for four case products. It should be noted that none of the products or intermediates appears in the training dataset. We further developed an automatic data-driven end-to-end retrosynthetic route planning system (AutoSynRoute) using MCTS with a heuristic scoring function. AutoSynRoute successfully reproduced the published pathways for the four case products, demonstrating its potential for retrosynthetic pathway planning.
![]() | ||
| Fig. 2 A schematic diagram of the Transformer architecture. Redrawn from ref. 33. | ||
000 reactions (USPTO_50K) extracted from the United States patent literature, which was previously used by Liu et al.29 and Coley et al.27 The reaction classes in the dataset were labeled by Schneider and co-workers37 as described in Table 1. Based on the study by Liu et al.,29 we used a 90%/10% training/testing split, and the validation set was randomly sampled from training sets (10%). To develop a more powerful model, we also used a much larger dataset called USPTO_MIT38 from the USPTO,39 with preprocessed training, validation, and testing sets of 424
573, 42
457 (randomly sampled from training sets), and 38
648 reactions, respectively.
| Reaction class | Reaction name | Fraction of USPTO_50k (%) | Fraction of USPTO_MIT (%) |
|---|---|---|---|
| 1 | Heteroatom alkylation and arylation | 30.3 | 29.9 |
| 2 | Acylation and related processes | 23.8 | 24.9 |
| 3 | C–C bond formation | 11.3 | 13.4 |
| 4 | Heterocycle formation | 1.8 | 0.7 |
| 5 | Protections | 1.3 | 0.3 |
| 6 | Deprotections | 16.5 | 14.1 |
| 7 | Reductions | 9.2 | 9.4 |
| 8 | Oxidations | 1.6 | 2.0 |
| 9 | Functional group interconversion (FGI) | 3.7 | 5.0 |
| 10 | Functional group addition (FGA) | 0.5 | 0.2 |
Inspired by Schneider et al.,37,40 the original USPTO_MIT dataset was preprocessed to extract the reactants and products of each reaction. We classified the reactions using a machine learning method based on reaction fingerprints and agent features. Moreover, we tried both token- and character-based methods to tokenize the SMILES strings as model inputs. The details of the reaction classification algorithm and results and the difference between token- and character-based preprocessing are described in the ESI.†
The source code is available online at http://https://github.com/PKUMDL-AI/AutoSynRoute. All program scripts were written in Python (version 3.6), and the open source RDKit (version 2018.09.02)44 was used for reaction preprocessing and SMILES validation. Our seq2seq model was built with TensorFlow (version 1.12.0),45 and the details of key hyperparameter settings of our models are available in the ESI.†
| top-n accuracy (%), n= | ||||
|---|---|---|---|---|
| Model (dataset) | 1 | 3 | 5 | 10 |
| a Key: “+class” means that reaction class information is added to the model; “+token” means that token-based preprocessing is applied; “+char” means that char-based preprocessing is applied. b The results are implemented by us using Liu et al.'s public code. | ||||
| Liu et al. template +class (USPTO_50K)29,b | 35.4 | 52.3 | 59.1 | 65.1 |
| Liu et al. LSTM +class (USPTO_50K)29,b | 37.4 | 52.4 | 57.0 | 61.7 |
| Coley et al. similarity +class (USPTO_50K)27 | 52.9 | 73.8 | 81.2 | 88.1 |
| Our Transformer +token +class (USPTO_50K) | 54.3 | 74.1 | 79.2 | 84.4 |
| Our Transformer +char +class (USPTO_50K) | 54.6 | 74.8 | 80.2 | 84.9 |
| Liu et al. LSTM +class (USPTO_MIT)29,b | 56.1 | 69.9 | 73.6 | 77.3 |
| Our Transformer +char +class (USPTO_MIT) | 63.0 | 79.2 | 83.4 | 86.8 |
| top-n accuracy (%), n= | ||||
|---|---|---|---|---|
| Model (dataset) | 1 | 3 | 5 | 10 |
| a Key: “+token” means that token-based preprocessing is applied; “+char” means that char-based preprocessing is applied. b The results are implemented by us using Liu et al.'s public code. c The results are implemented by us using Coley's reproduced code (https://github.com/connorcoley/retrotemp). | ||||
| Liu et al. LSTM (USPTO_50K)29,b | 28.3 | 42.8 | 47.3 | 52.8 |
| Liu et al. LSTM (USPTO_MIT)29,b | 46.9 | 61.6 | 66.3 | 70.8 |
| Coley et al. Similarity (USPTO_50K)27 | 37.3 | 54.7 | 63.3 | 74.1 |
| Segler–Coley-retrained (USPTO_50K)25,c | 38.7 | 56.2 | 62.2 | 69.2 |
| Segler–Coley-retrained (USPTO_MIT)25,c | 47.8 | 67.6 | 74.1 | 80.2 |
| Karpov et al. Transformer (USPTO_50K)35 | 42.7 | 63.9 | 69.8 | — |
| Our Transformer +token (USPTO_50K) | 42.0 | 64.0 | 71.3 | 77.6 |
| Our Transformer +char (USPTO_50K) | 43.1 | 64.6 | 71.8 | 78.7 |
| Our Transformer +char (USPTO_MIT) | 54.1 | 71.8 | 76.9 | 81.8 |
The ratio of invalid SMILES strings produced by our model is much lower than that of the previous LSTM-based model, which means that our model has a powerful ability to capture the grammar of SMILES representations. As shown in Table 4, the top-10 invalidity error of our model is 12.6%, which is close to the top-1 invalidity error of Liu's model. When we trained our model on the large-volume USPTO_MIT dataset, the top-1 accuracy increased to 63.0%, which shows the generality ability of our model by increasing the chemical knowledge base. Meanwhile, the error rate of SMILES strings decreases to 0.4% in the top-1 prediction.
| Invalid SMILES rate (%) | ||||
|---|---|---|---|---|
| Model (dataset) | 1 | 3 | 5 | 10 |
| a Key: “+class” means that reaction class information is added to the model; “+token” means that token-based preprocessing is applied; “+char” means that char-based preprocessing is applied. | ||||
| Liu et al. LSTM +class (USPTO_50K)29 | 12.2 | 15.3 | 18.4 | 22 |
| Our Transformer +token (USPTO_50K) | 2.2 | 3.7 | 4.8 | 7.8 |
| Our Transformer +token +class (USPTO_50K) | 2.3 | 4.9 | 7.0 | 12.1 |
| Our Transformer +char (USPTO_50K) | 2.1 | 3.5 | 4.7 | 8.3 |
| Our Transformer +char +class (USPTO_50K) | 2.4 | 4.4 | 6.4 | 12.6 |
| Our Transformer +char +class (USPTO_MIT) | 0.4 | 1.5 | 2.9 | 8.6 |
A comparison of the top-10 accuracies across all classes of our model with those of the previous studies on USPTO_50K and USPTO_MIT datasets is shown in Table S1.† The performance of our model was much better than that of the seq2seq model of Liu et al.29 across all reaction categories. However, our model performed just slightly better or comparably to the similarity-based model in reaction categories 3, 7 and 9.
As shown in Fig. 4, we used the top-5 retrosynthetic disconnections of a compound in a test set as an example to analyze the specificity and generality of our model. We chose a compound in class 1 as an example, in which the ground truth prediction ranks first and the other predicted reactions are also chemically plausible. The results show that our model is able to give reasonably diverse disconnections and the top-5 disconnections comply with the reaction class of heteroatom alkylation. Additional results regarding single-step retrosynthetic disconnections within each reaction class can be found in the ESI (Fig. S1–S10).†
For the first example of retrosynthesis pathway planning, Rufinamide (shown in Fig. 5a), the reported first step is the formation of an amide bond, ranking first in reaction class 9 (functional group interconversion). The subsequent step is also found to rank top-1 in class 4 (heterocycle formation), consistent with the mechanistic view. This is followed by another functional group interconversion (FGI) step, the final step, predicted precisely as top-1 in class 9. It is worth mentioning that different reaction classes may have the same disconnections and thus result in the same reactants. For example, the third step of the aforementioned route also ranks first in class 1 (heteroatom alkylation), which is also plausible.
![]() | ||
| Fig. 5 Iterative multi-step pathway generation. The routes are constructed by iteratively applying single-step retrosynthetic methodology to (a) Rufinamide, (b) an antagonist of the interaction between WDR5 and MLL1, from the example of Grzybowski et al.,16 (c) an allosteric activator for GPX4, and (d) an intermediate of a drug candidate from the example of Segler et al.19 The suggested disconnections are consistent with published pathways. The number before the “.” indicates the reaction class, and the number after the “.” indicates the ranking in the top-10 prediction. | ||
As shown in Fig. 5b, the second example comes from the previous work of Grzybowski et al.,16 which was the retrosynthesis pathway planning of an antagonist of the interaction between WD repeat-containing protein 5 (WDR5) and mixed-lineage leukemia 1 (MLL1).48 Our model could recover the route suggested by the commercial program Synthia. The first step is a FGI predicted as top-2 by our model. The next step is a common amide formation. The final step is a C–C bond formation, which was also predicted by our model as top-1 with the correct reaction class.
The aforementioned two routes are predicted by our model trained on the USPTO_50K dataset. However, another two more challenging routes cannot be completely predicted due to less coverage of chemical space. Remarkably, using the USPTO_MIT dataset (without stereochemistry information), our trained model could completely reproduce the following two routes in our top-10 predictions, suggesting the importance of training on enlarging coverage of the chemical knowledge space.
The third example is the retrosynthesis pathway planning of the GPX4 activator compound, as depicted in Fig. 5c. The published first step ranks first in class 6 (deprotection). The second step could be regarded as acylation and related processes and it is predicted correctly as top-8 by our model. The ground truth of the third step ranks top-1 in class 9, followed by a deprotection step in top-3, and a final acylation and related processes in top-1.
The fourth example, described in Fig. 5d, is the retrosynthesis pathway planning of an intermediate of a drug candidate from the example of Segler and co-workers.19 The first, second, and third steps can be easily reproduced by our model as top-1 or top-2 with the right class. The fourth step is a common functional group addition, followed by an uncommon reduction of a carbonyl group. After the final step of heteroatom alkylation, our model was shown to reproduce the steps predicted by the former template-based method.
000
000 candidate pathways assuming all of the output SMILES strings are valid. To make our model applicable for retrosynthetic pathway planning, we needed to achieve efficient automatic pathway searching and ranking. We used a MCTS algorithm combined with a heuristic scoring function to achieve this purpose. Our heuristic scoring function was inspired by Synthia's Chemical Scoring Function (CSF).9 We considered the Scoremodel produced by our model (representing the decoding log probability from the beam search), the changed SMILES length from the target to the reactants, and the changed number of rings from the target to the reactants. To scale the heuristic scoring function in a comparable range, we presented the scoring function in the formula| Scorestep = a × exp(Scoremodel) − (b × RINGSchanged + SMILESchanged) | (1) |
To make the model applicable, we also needed to define the terminal nodes or reactants, which means the commercially available molecules. We used a dataset containing 84
807 building blocks from a chemical supplier (Sigma Aldrich), obtained from the ZINC15 database (http://zinc15.docking.org/) and 17
182 molecules from the USPTO_MIT database. The data were used as reactants at least five times as terminal nodes (a building block database of 93
563 molecules after removing redundant ones) for searching. Users can also use any specific building block database as a terminal reactant database.
Using our automatic retrosynthetic pathway planning strategy, most of the aforementioned steps in the four examples can be found and ranked in top-10 except for the fourth step of example 3 (ranks top-12) and fifth step of example 4 (ranks top-25). The overall pathway ranking results of the four examples can be found in the ESI (Fig. S11–S14†). Though our heuristic scoring function is simple, these results are impressive. We demonstrated the potential ability of our template-free model to plan the automatic retrosynthetic pathway in a new way other than by using current template-based methods.
Our approach can be further developed with larger and more diverse chemical knowledge bases for training. Currently, the information regarding reaction conditions like catalysts, solvents, and reagents is missing because of the database used. These conditions can be introduced in the future by using more comprehensive datasets, like the Reaxys database or in-house data. Future work will also tackle problems like SMILES′ poor representations of stereochemistry and tautomerization. Finally, we envision that automatic retrosynthetic route planning will play more important roles in real-world automated synthesis of molecules50 and in de novo molecular design.51
Footnotes |
| † Electronic supplementary information (ESI) available. See DOI: 10.1039/c9sc03666k |
| ‡ These authors contributed equally. |
| This journal is © The Royal Society of Chemistry 2020 |