Open Access Article
Philippe
Schwaller‡
*,
Théophile
Gaudin‡
,
Dávid
Lányi
,
Costas
Bekas
and
Teodoro
Laino
IBM Research, Zurich, Switzerland. E-mail: {phs,tga,dla,bek,teo}@zurich.ibm.com
First published on 22nd June 2018
There is an intuitive analogy of an organic chemist's understanding of a compound and a language speaker's understanding of a word. Based on this analogy, it is possible to introduce the basic concepts and analyze potential impacts of linguistic analysis to the world of organic chemistry. In this work, we cast the reaction prediction task as a translation problem by introducing a template-free sequence-to-sequence model, trained end-to-end and fully data-driven. We propose a tokenization, which is arbitrarily extensible with reaction information. Using an attention-based model borrowed from human language translation, we improve the state-of-the-art solutions in reaction prediction on the top-1 accuracy by achieving 80.3% without relying on auxiliary knowledge, such as reaction templates or explicit atomic features. Also, a top-1 accuracy of 65.4% is reached on a larger and noisier dataset.
Multiple efforts have been made in the past 50 years to rationalize the large number of chemical compounds and reactions identified, which form the large knowledge bases for solving synthetic problems. In 1969, Corey and Wipke1 demonstrated that both synthesis and retrosynthesis could be performed by a machine. Their pioneering contribution involved the use of handcrafted rules made by experts, which are commonly known as reaction templates. The templates encode the local changes to the atoms' connectivity under certain conditions accounting for various subtleties of retrosynthesis. A similar algorithm emerged in the late 1970s2 which also requires a set of expert rules. Unfortunately, rules writing is a tedious task, both time and labor-intensive, and may not cover the entire domain for complex organic chemistry problems. In such cases, profound chemical expertise is still required, and the solutions are usually developed by trained organic chemists. However, it can be extremely challenging even for them to synthesize a relatively complex molecule, which may take several reaction steps to construct. In fact, navigating the chemical space of drug-related compounds by relying only on intuition may turn a synthesis into a nearly impossible task, especially if the problem is slightly outside the expert's knowledge.
Other approaches extract reaction templates directly from data.3–6 In this specific context, candidate products are generated from the templates and then are ranked according to their likelihood. Satoh and Funatsu3,4 used various hard-coded criterion to perform the ranking whereas more recent approaches5,6 used a deep neural network. However, these types of approaches are fundamentally dependent on the rule-based system component and thus inherit some of its major limitations. In particular, these approaches do not produce sufficiently accurate predictions outside of the training domain.
Nevertheless, the class of algorithms1–6 that is based on rules manually encoded by human experts or automatically derived from a reaction database is not the only way to approach the problem of organic synthesis. A second approach for predicting chemical reactions exploits the advancements in computational chemistry to evaluate the energy barriers of a reaction, based on first-principle calculations.7–9 Although it is possible to reach very accurate levels of predictions for small systems (chemical reactions involving few hundred atoms), it is still a very computationally daunting task which limits, among other things, the sampling of the solvent degrees of freedom, possibly resulting in unrealistic entropy contributions. Therefore, while computational chemistry may intrinsically solve the problem of reaction prediction, its prohibitive cost does prevent the systematic treatment of all those degrees of freedom that may drive the chemical reaction along a specific route. For such reasons, its current field of applicability in industry is mainly limited to problems that may have a purely academic interest.
One way to view the reaction prediction task is to cast it as a translation problem, where the objective is to map a text sequence that represents the reactants to a text sequence representing the product. Molecules can equivalently be expressed as text sequences in line notation format, such as the simplified molecular-input line-entry system (SMILES).10 Intuitively, there is an analogy between a chemist's understanding of a compound and a language speaker's understanding of a word. No matter how imaginative such an analogy is, it was only very recently that a formal verification was proved.11 Cadeddu et al.11 showed that organic molecules contain fragments whose rank distribution is essentially identical to that of sentence fragments. Moreover, it has already been shown that a text representation of molecules has been effective in chemoinformatics.12–16 This has strengthened our belief that the methods of computational linguistics can have an immense impact on the analysis of organic molecules and reactions.
In this work, we build on the idea of relating organic chemistry to a language and explore the application of state-of-the-art neural machine translation methods, which are sequence-to-sequence (seq2seq) models. We intend to solve the forward-reaction prediction problem, where the starting materials are known and the interest is in generating the products. This approach was first pioneered by Nam and Kim.17 Here, we propose a model with higher capacity and a different attention mechanism, which better captures the relation between reactants and products. For the tokenization, we combine an atom-wise tokenization for the reactants similar to the work of Nam and Kim17 with a one-hot reagent tokenization suggested by Schneider et al.18 Given that training data for reaction condition were available, the tokenization would be arbitrarily extensible with tokens describing those conditions. In this work, we only use a set of the most common reagents.19 The overall network architecture is simple, and the model is trained end-to-end, fully data-driven and without additional external information. With this approach, we improved the top-1 accuracy by 0.7% compared to current template-free solutions, achieving a value of 80.3% using their own training and test data sets.20 The model presented set also a first score of 65.4% on a noisy single product reactions dataset extracted from US patents.
000 reactions extracted from US patents. The first WLN scored the reactivity between atom pairs and predicted the reaction center. All possible bond configuration changes were enumerated to generate product candidates. The candidates that were not removed by hard-coded valence and connectivity rules are then ranked by a Weisfeiler–Lehman Difference Network (WLDN). Their method achieved a top-1 accuracy of 79.6% on a test set of 40
000 reactions. Jin et al.20 claimed to outperform template-based approaches by a margin of 10% after augmenting the model with the unknown products of the initial prediction to have a product coverage of 100% on the test set. The dataset with the exact training, validation and test split have been released.§ The complexity of the reaction prediction problem was significantly reduced by removing the stereochemical information.
Retrosynthesis is the opposite of reaction prediction. Given a product molecule, the goal is to find possible reactants. In contrast to major product prediction, in retrosynthesis more than one target string might be correct, e.g. a product could be the result of two different reactant pairs. Having no distinct target, the training of a seq2seq model can be more difficult. The first attempt of using a seq2seq model in retrosynthesis was achieved by Liu et al.32 They used a set of 50
000 reactions extracted and curated by Schneider et al.19 The reactions from that set include stereochemical information and are classified into ten different reactions classes. Overall, none of the previous works was able to demonstrate the full potential of seq2seq models.
808
938 reactions, which are described using SMILES.10
Looking at the original patent data, it is surprising that a complex chemical synthesis process consisting of multiple steps, performed over hours or days, can be summarized in a simple string. Such reaction strings are composed of three groups of molecules: the reactants, the reagents, and the products, which are separated by a ‘>’ sign. The process actions and reaction conditions, for example, have been neglected so far.
To date, there is no standard way of filtering duplicates, incomplete or erroneous reactions in Lowe's dataset. We kept the filtering to a minimum to show that our network is able to handle noisy data. We removed 720
768 duplicates by comparing reaction strings without atom mapping and an additional 780 reactions, because the SMILES string could not be canonicalized with RDKit,34 as the explicit number of valence electrons for one of the atoms was greater than permitted. We took only single product reactions, corresponding to 92% of the dataset, to have distinct prediction targets. Although this is a current limitation in the training procedure of our model, it could be easily overcome in the future, for example by defining a specific order for the product molecules. Finally, the dataset was randomly split into training, validation and test sets (18
:
1
:
1).¶ Reactions with the same reactants, but different reagents and products were kept in the same set.
To compare our model and results with the current state of the art, we used the USPTO set recently published by Jin et al.20 It was extracted from Lowe's grants dataset33 and contains 479
035 atom-mapped reactions without stereochemical information. We restricted ourselves to single product reactions, corresponding to 97% of the reactions in Jin's USPTO set. An overview of the datasets taken as ground truths for this work is shown in Table 1.
In general, we observe that if reactions that do not work well with the model are removed under the assumption that they are erroneous, the model's accuracy will improve suggesting the presence of a specific tradeoff between coverage and accuracy. This calls for open datasets. The only fair way to compare models is to use datasets to which identical filtering was applied or where the reactions that the model is unable to predict are counted as false predictions.
| token_regex = "(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#|−|\+|\\\\\/|:|∼|@|\?|>|\*|\$|\%[0–9]{2}|[0–9])". |
As reagent atoms are never mapped into product atoms, we employed a reagent-wise tokenization using a set of the 76 most common reagents, according to the analysis in ref. 19. Reagents belonging to this set were added as distinct tokens after the first ‘>’ sign, ordered by occurrence. Other reagents, which were not in the set, were neglected and removed completely from the reaction string. The separate tokenization would allow us to extend the reaction information and add tokens for reaction conditions without changing the model architecture. The final source sequences were made up of tokenized “reactants > common reagents” and the target sequence of a tokenized “product”. The tokens were separated by space characters. The preprocessing steps together with examples are summarized in Table 2. The same preprocessing steps were applied to all datasets.
| Step | Example (entry 23 738, Jin's USPTO test set20): reactants > reagents > products |
|
|---|---|---|
| (1) | Original string | [Cl:1][c:2]1[cH:3][c:4]([CH3:8])[n:5][n:6]1[CH3:7].[OH:14][N+:15]([O−:16])=[O:17].[S:9](=[O:10])(=[O:11])([OH:12])[OH:13]≫[Cl:1][c:2]1[c:3]([N+:15](=[O:14])[O−:16])[c:4]([CH3:8])[n:5][n:6]1[CH3:7] |
| (2) | Reactants and reagent separation | [Cl:1][c:2]1[cH:3][c:4]([CH3:8])[n:5][n:6]1[CH3:7].[OH:14][N+:15]([O−:16])=[O:17]>[S:9](=[O:10])(=[O:11])([OH:12])[OH:13]>[Cl:1][c:2]1[c:3]([N+:15](=[O:14])[O−:16])[c:4]([CH3:8])[n:5][n:6]1[CH3:7] |
| (3) | Atom-mapping removal and canonicalization | Cc1cc(Cl)n(C)n1.O=[N+]([O−])O>O=S(=O)(O)O>Cc1nn(C)c(Cl)c1[N+](=O)[O−] |
| (4) | Tokenization | C c 1 c c ( C l ) n ( C ) n 1 . O = [ N+ ] ( [ O− ] ) O> A1 > C c 1 n n ( C ) c ( C l ) c 1 [ N+ ] ( = O ) [ O−] |
| Source | C c 1 c c ( C l ) n ( C ) n 1 . O = [ N+ ] ( [ O− ] ) O> A1 |
|
| Target | C c 1 n n ( C ) c ( C l ) c 1 [ N+ ] ( = O ) [ O− ] |
| it = σ(Wixt + Uiht−1 + bi) | (1) |
| ft = σ(Wfxt + Ufht−1 + bf) | (2) |
| ot = σ(Woxt + Uoht−1 + bo) | (3) |
| ct = ft ⊗ ct−1 + it ⊗ tanh(Wcxt + Ucht−1 + bc) | (4) |
| ht = ot ⊗ tanh(ct−1), | (5) |
and
for each time step. The hidden states of a BLSTM are defined as![]() | (6) |
Thus we can formalize our encoder as
| C = f(Wext,ht−1), | (7) |
n are the hidden states at time t; xt is an element of an input sequence x = {x0, …, xT}, which is a one-hot encoding of our vocabulary; and We are the learned embedding weights. Generally, C is simply the last of the encoder's hidden states:| C = hT | (8) |
The second part of the model – the decoder – predicts the probability of observing a product ŷ = {ŷ1, …, ŷM}:
![]() | (9) |
| p(ŷi|{ŷ1, …, ŷi−1},ci) = g(ŷi−1,si,ci), | (10) |
![]() | (11) |
![]() | (12) |
The attention vector is then defined by
| ai = tanh(Wa{ci;si}). | (13) |
Both Wα and Wa are learned weights. Then a can be used to compute the probability for a particular token:
| p(yi|{y1, …, yi−1},ci) = softmax(Wpai), | (14) |
![]() | (15) |
| Parameter | Possible values |
|---|---|
| Number of units | 128, 256, 512 or 1024 |
| Number of layers | 2, 4 or 6 |
| Type of encoder | LSTM, BLSTM |
| Output dropout | 0–0.9 (0.7676) |
| State dropout | 0–0.9 (0.5374) |
| Learning rate | 0.1–5 (0.355) |
| Decay factor | 0.85–0.99 (0.854) |
| Type of attention | “Luong” or “Badhanau” |
496 reactions in Jin's USPTO train set and tested the fully trained model on Jin's USPTO test set. Additionally, we trained a second model with the same hyperparameters on 902
581 randomly chosen single-product reactions from the more complex and noisy Lowe dataset and tested it on a set of 50
258 reactions. As molecules are discrete data, changing a single character, such as in source code or arithmetic expressions, can lead to completely different meanings or even invalidate the entire string. Therefore we use full-sequence accuracy, the strictest criteria possible, as our evaluation metric by which a test prediction is considered correct only if all tokens are identical to the ground truth.
The network had to solve three major challenges. First, it had to memorize the SMILES grammar to predict synthetically correct sequences. Second, because we trained it on canonicalized molecules, the network had to learn the canonical representation. Third, the network had to map the reactants plus reagents space to the product space.
Although the training was performed without a beam search, we used a beam width of 10 without length penalty for the inference. Therefore the 10 most probable sequences were kept at every time step. This allowed us to know what probability the network assigned to each of the sequences. We used the top-1 probabilities to analyze the prediction confidence of the network.
The final step was to canonicalize the network output. This simple and deterministic reordering of the tokens improved the accuracy by 1.5%. Thus, molecules that were correctly predicted, but whose tokens were not enumerated in the canonical order, were still counted as correct. The prediction accuracies of our model on different datasets are reported in Table 4. For single product reactions, we achieved an accuracy of 83.2% on Jin's USPTO test dataset and 65.4% on Lowe's test set.
An additional validation can be found in the ESI,† were we used the model trained on Lowe's dataset to predict reactions from pistachio,45 a commercial database of chemical reactions extracted from the patent literature. Because the Lowe's dataset used to train the model contained reactions until September 2016, we only predicted the reactions from 2017 and hence, we had a time split.
648 USPTO test set reactions takes on average 25 ms per reaction, inferred with a beam search. Our model can therefore compete with the state of the art.
![]() | ||
| Fig. 3 Attention weights of reaction 120 from Jin's USPTO test set. The atom mapping between reactants and product is highlighted. SMILES: Brc1cncc(Br)c1.C[O−]>CN(C)C=O.[Na+]>COc1cncc(Br)c1. Reaction plotted with RDKit.34 | ||
Another limitation of the training procedure are multiple product reactions. In contrast to words in a sentence, the exact order in which the molecules in the target string are enumerated does not matter. A viable option would be to include in the training set all possible permutations of the product molecules.
Our hyperparameter space during optimization was restricted to a maximum of 1024 units for the encoder. Using more units could have led to improvements. On Jin's USPTO dataset, the training plateaued because an accuracy of 99.9% was reached and the network had memorized almost the entire training set. Even on Lowe's noisier dataset, a training accuracy of 94.5% was observed. A hyperparameter optimization could be performed on Lowe's dataset to improve the prediction accuracy.
Footnotes |
| † Electronic supplementary information (ESI) available: Time-split test set and example predictions, together with attention weights, confidence and token probabilities. See DOI: 10.1039/c8sc02339e |
| ‡ P. S. and T. G. contributed equally to this work. |
| § https://github.com/wengong-jin/nips17-rexgen. |
| ¶ https://ibm.box.com/v/ReactionSeq2SeqDataset. |
| This journal is © The Royal Society of Chemistry 2018 |