Mikhail
Andronov
*ae,
Varvara
Voinarovska
b,
Natalia
Andronova
c,
Michael
Wand
ad,
Djork-Arné
Clevert
e and
Jürgen
Schmidhuber
f
aIDSIA, USI, SUPSI, 6900 Lugano, Switzerland. E-mail: mikhail.andronov@idsia.ch
bInstitute of Structural Biology, Molecular Targets and Therapeutics Center, Helmholtz Munich – Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), 85764 Neuherberg, Germany
cVia Berna 9, 6900 Lugano, Switzerland
dInstitute for Digital Technologies for Personalized Healthcare, SUPSI, 6900 Lugano, Switzerland
eMachine Learning Research, Pfizer Worldwide Research Development and Medical, Linkstr.10, Berlin, Germany
fAI Initiative, KAUST, 23955 Thuwal, Saudi Arabia
First published on 1st March 2023
Automated synthesis planning is key for efficient generative chemistry. Since reactions of given reactants may yield different products depending on conditions such as the chemical context imposed by specific reagents, computer-aided synthesis planning should benefit from recommendations of reaction conditions. Traditional synthesis planning software, however, typically proposes reactions without specifying such conditions, relying on human organic chemists who know the conditions to carry out suggested reactions. In particular, reagent prediction for arbitrary reactions, a crucial aspect of condition recommendation, has been largely overlooked in cheminformatics until recently. Here we employ the Molecular Transformer, a state-of-the-art model for reaction prediction and single-step retrosynthesis, to tackle this problem. We train the model on the US patents dataset (USPTO) and test it on Reaxys to demonstrate its out-of-distribution generalization capabilities. Our reagent prediction model also improves the quality of product prediction: the Molecular Transformer is able to substitute the reagents in the noisy USPTO data with reagents that enable product prediction models to outperform those trained on plain USPTO. This makes it possible to improve upon the state-of-the-art in reaction product prediction on the USPTO MIT benchmark.
However, there are caveats that are crucial for successful synthesis planning and often unaccounted for in CASP systems. Most importantly, one would like to take into account as much information about a chemical reaction as possible when modeling it. Among the most important aspects of a reaction besides reactants and products are reaction conditions (Fig. 1). The conditions comprise temperature, pressure, and other physical parameters as well as the chemical environment imposed by reagents, which are catalysts, solvents, and other molecules necessary for a reaction to occur. The conditions and reagents are integral to any reaction. The same reaction under different conditions can result in different outcomes.
In cheminformatics, all tasks of reaction modeling rely heavily on the datasets of known organic reactions. One of the most important of them at the moment is the USPTO dataset,9 which is publicly available and consists of reactions obtained by text mining from open-access texts from the United States Patents database. The reagents in this dataset are noisy, they may be often unspecified or incorrect, and the data does not report any temperature or pressure conditions. Therefore, the models for reaction product prediction or retrosynthesis cannot exactly benefit from full and reliable reaction information when developed using USPTO.
The reagent information is often overlooked in models for single-step retrosynthesis: even though some systems allow the prediction of retrosynthetic steps encompassing reagents,7 most of them only suggest reactants,10,11 leaving the chemist wondering about the actual procedure needed to conduct the proposed reaction. A separate reagent prediction model may help in this case.
There have been substantial efforts to predict suitable reaction conditions or chemical contexts in various settings. For example, there are reports on using DFT to select suitable solvents for a reaction,12 or thermodynamic calculations to choose heterogeneous catalysts.13 Some focused on optimizing reaction conditions for particular reaction types using an expert system14 or machine learning.15–18 Gao et al.19 have used a fully connected neural network trained on the Reaxys20 data in the form of reaction fingerprints to predict reagents in a supervised classification manner. Also, Ryou et al.21 have extended this approach by using a graph network to encode the information in reaction graphs instead of using reaction fingerprints while also using Reaxys and treating the task as a supervised classification task.
The broad task of organic reaction modeling, whether it is reagent prediction, single-step retrosynthesis or product prediction, can be formulated as a sequence-to-sequence translation if the reactions are represented as reaction SMILES.22 The first attempts to use deep learning to predict reaction outcomes were made by Nam and Kim23 and Schwaller et al.24 This approach experienced significant advances with the adoption of the transformer25 as the deep learning model (transformers with “linearized self-attention” date back to 1992 (ref. 26)). The transformer performed very well in reaction prediction5,27 and single-step retrosynthesis7,28 and established state-of-the-art results in both these fields. A trained large-scale transformer can perform both single-step retrosynthesis and product prediction with impressive accuracy.29 The transformer has also been demonstrated to be suitable for multi-task reaction modeling: when trained in a BERT-like30 fashion to predict any masked tokens in a reaction, it can do both forward prediction and single-step retrosynthesis, as well as reagent prediction.31,32
Many machine learning tools for reaction modeling are trained on some preprocessed subsets of USPTO because it is an open dataset. Alternatively, there are proprietary reaction datasets. One of them, Reaxys, contains about 56 million hand-curated reactions. While researchers also use it to train ML models,19,34 the problem with it is that those models and data subsets are not allowed to be publicly shared.
We use the whole USPTO as the training dataset for the reagent prediction model. To train and test the product prediction model, we use the subset of USPTO called USPTO MIT or USPTO 480K, which is a common benchmark for reaction product prediction.5,6 However, we do not use any subsets of USPTO as a test dataset for the reagent prediction model. The problem with the USPTO is that this dataset is assembled using text mining, so the reactions in it are recorded with a significant amount of noise. For example, different instances of the same reaction type may be written with different amounts of detail.35Fig. 2A shows examples of Suzuki coupling with the year and patent number. This reaction type makes up a significant portion of USPTO36 reactions. The Suzuki coupling generally requires a palladium catalyst, a base, and a suitable solvent. However, in many reactions of this type in USPTO the necessary reagents are not specified. This is also observed for other types of reactions. In addition, USPTO may contain reaction SMILES involving nonsensical molecules (Fig. 2B).
If such noisy reactions end up in the test dataset, it will not allow us to correctly evaluate the performance of the reagents model. Preliminary experiments in which we tested the reagents model on USPTO showed that often the model prediction does not match the ground truth sequence, even though the former is more sensible than the latter. To overcome this problem, we assembled our test set from the reactions in the Reaxys database. Since Reaxys is comprised of reactions manually extracted from scientific papers, we assume that the quality of reagents information there is ensured by human experts.
While gathering the Reaxys subset, we aimed at making it resemble the USPTO 50K37 dataset in its distribution of reaction types. The purpose of such a design is to make the test distribution close to the train distribution. We use USPTO 50K as a proxy for the training set because USPTO 50K is the only open subset of USPTO that contains reaction class labels. The subset of Reaxys used as the test comprises 96729 reactions of 10 broad classes. Their distribution is displayed in Table 1. We aimed at making the class proportions in USPTO 50K and the Reaxys test set similar. The classes were determined using NameRXN software.38
Reaction class | Proportion, USPTO 50K (%) | Proportion, Reaxys test (%) |
---|---|---|
Heteroatom alkylation and arylation | 28.73 | 28.11 |
Acylation and related processes | 24.29 | 18.15 |
Deprotections | 16.97 | 17.37 |
C–C bond formation | 11.52 | 11.95 |
Reductions | 9.72 | 11.10 |
Protections | 1.35 | 4.53 |
Functional group interconversion (FGI) | 3.89 | 4.13 |
Functional group addition (FGA) | 0.51 | 2.46 |
Oxidations | 1.70 | 2.10 |
Heterocycle formation | 1.31 | 0.09 |
To further investigate the similarity of USPTO and the Reaxys subset we use, we employ the technique of parametric t-SNE, which is a dimensionality reduction method aiming at preserving closeness between points in higher-dimensional space. We represent each reaction as a reaction Morgan difference fingerprint39 of 2048 bits with both reagent- and non-reagent weight equal to 1. Using an implementation of parametric t-SNE from OpenTSNE,40 we project the fingerprint vectors of reactions in USPTO 50K on the 2D plane, and then use the same t-SNE model to obtain the projections of reactions in Reaxys. The absolute values of the coordinates of the t-SNE embeddings of reaction vectors bear no physical meaning. The closeness of the points in 2D reflects the closeness of the reaction vectors in the fingerprint space: similar reactions lie close together. Parametric t-SNE means that the coordinates of the 2D embeddings of reactions in Reaxys would lie close to those for similar reactions in USPTO 50K. In other words, the closeness of similar reactions will be preserved across datasets, not only within one dataset. The local structure of the second dataset is preserved but the absolute values of the coordinates of its points are determined by the coordinates of the points in the first dataset.
Fig. S1† in the ESI shows the t-SNE maps of USPTO 50K and our Reaxys test set. The maps for individual datasets are shown at the top of the figure. On the bottom, it demonstrates the overlap of those maps. One can see that the local structures of both datasets are similar: there is a significant overlap between the clusters of points in both datasets. Fig. 3 shows the overlapping t-SNE maps for individual classes of reactions. One can see that they also demonstrate a noticeable overlap, even though it is not ideal.
The transformer is an encoder–decoder neural network architecture. The encoder is built of several layers which essentially consist of a multi-head attention part and a feed-forward layer. The multi-head attention updates the representations of every token in a batch according to eqn (1).
(1) |
Here X is the matrix of the token embeddings, Xnew is the matrix of the token embeddings after a multi-head attention layer, WQ, WK, and WV are matrices of trainable parameters, and dk is the number of columns in WK. This mechanism resembles an update mechanism used in graph neural networks if we treat the batch of entries as nodes in a full graph.42 The decoder has a similar structure and learns the embeddings of tokens in the target sequence. Ultimately, the model uses the representations of all tokens both in the input sequence and the output sequence to predict the next token in the output sequence. The decoding stops when the model predicts the special “end-of-sequence” token. The ordering of the tokens is imposed by adding positional encodings (special periodic signals) to the token representations at the start of the training.25 By using a beam search, one can obtain several translations ordered by probability for an input sequence. The model produces output sequentially, token by token, treating the choice of each token as multi-class classification and conditioning this choice on the input sequence and the tokens already decoded for the given input sequence.
In our experiments, we used the OpenNMT43 implementation of the transformer for both the reagent and product prediction. We chose to use this particular solution to be consistent with Schwaller et al.5
To train the reagents model, we take the copy of USPTO kindly provided by Schwaller et al.,5 but preprocess it to fit our setting. Instead of using the original train–validation–test split, we unite all those subsets together and choose 5000 reactions randomly for validation in this whole subset.
Our preprocessing pipeline for reagent prediction is as follows: for each reaction, we first remove all the auxiliary information written in the form of ChemAxon extended SMILES (CXSMILES). Then, we mix together all the molecules which are not products. The procedure of extracting the original data from patents included the detection of catalysts and solvents and placing them in the reagent section. However, we do not place our trust in it since it is quite imprecise and also does not account for other possible reagent types. Therefore, we separate reactants from reagents according to the fingerprint-based algorithm described by Schneider et al.37 and implemented in RDKit. A small number of all reactions, in this case, end up with no reactants whatsoever. For them, we decide on the separation based on the original atom mapping in USPTO: reactants are molecules with atom mapping labels that also appear in the products. We avoid using this approach for all reactions as the default atom mapping in USPTO is not reliable enough. Then, we canonicalize the SMILES of all molecules in a reaction, drop atom mapping and remove the isotope information if there are any. Finally, we order all molecules in the reaction: the molecules with the longest SMILES strings come first; strings of the same length are ordered alphabetically. We also tried implementing a step in which we remove all molecule duplicates in the reaction. This would make the reactions unbalanced but the reaction SMILES shorter while preserving the chemical context. However, this step did not prove to be useful and eventually, we did not include it in the preprocessing pipeline.
After processing every reaction in the described fashion, we proceed to remove rare reagents from the data, i.e. the reagents which appear in the training data less than 20 times. This lowers the number of unique reagents in the training set from 37602 (from which 26431 were encountered only once) to 1314. The model is unlikely to learn to predict a reagent from such few examples even if they are correct, although the visual analysis shows that such rare reagents are in fact rather reactants in not well-specified reactions. For example, if a reaction includes several isomers of one reactant, only one of which is reported to become a product, all the other isomers get recognized as reagents. By removing rare reagents, we alleviate this problem. Finally, we drop duplicate reactions and the reactions where a product appears among reactants or reagents. In accordance with the common procedure of data augmentation in reaction modeling, we employ SMILES augmentation as implemented in the PySMILESutils Python package.44 Only the SMILES of reactants and products get augmented in the case of reagent prediction. In addition to that, we use “role augmentations”: some molecules from the side of the reagent have a chance to move to the reactants in an augmented example.
All molecules in the target sequences are canonicalized and ordered by their detailed roles: catalysts come first, then redox agents, then acids and bases, then any other molecules and ions, but solvents come last. In this case, we utilize the model's autoregressive nature to predict the most important reagents first based solely on the input, and then the more interchangeable reagents based both on the input and the reagents suggested so far. A similar ordering of reagents by role was used by Gao et al.19 and it generally follows the line of thought of a chemist who would suggest reagents for a reaction based on their experience. The roles of the molecules are determined using the following heuristics:
(1) Every molecule in a SMILES string of reagents is assigned a role in the following order of decreasing priority: solvent, catalyst, oxidizing agent, reducing agent, acid, base, unspecified role.
(2) A molecule is a solvent if it is one of the standard 46 solvent molecules, like THF, hexane, benzene etc.
(3) A molecule is a catalyst if
(a) it is a free metal.
(b) it contains a cycle together with a metal or phosphorus atom.
(c) it is a metal halide.
(4) A molecule is an oxidizing agent if
(a) it contains a peroxide group.
(b) it contains at least two oxygen atoms and a transition metal or iodine atom.
(c) it is a standard oxidizing agent like free halogens.
(d) it is a standard halogenating agent.
(e) it contains both a positively charged atom and a negatively charged oxygen but is not a nitrate anion.
(5) A molecule is a reducing agent if it is one of the standard reducing agents or some hydride of boron, silicon or aluminum.
(6) A molecule is an acid if
(a) it is a derivative of sulphuric, sulfamic or phosphoric acid with the acidic –OH group intact.
(b) it is a carboxylic acid.
(c) it is a hydrohalic acid or a common Lewis acid like aluminium chloride.
(7) A molecule is a base if
(a) it is a tertiary or secondary amine.
(b) it contains a negatively charged oxygen atom and consists of only C, O, S, and P atoms.
(c) it consists only of lithium and carbon.
(d) it is the hydride ion or the hydroxide ion.
To tokenize our sequences, we employ the standard atomwise tokenization scheme.5 Additionally, we experimented with the scheme in which entire molecules get their own tokens, namely all solvents and some common reagents. However, this does not seem to improve the quality of a trained reagent prediction model, so we resort to standard atomwise tokenization in our final model.
For product prediction, we employ the same procedure that was employed by Schwaller et al.5 The tokenization is atomwise as well. Some of the current deep learning reaction prediction methods use explicit reagent information to make predictions,24,45,46 and some allow mixing all the precursors together, but the separation of reactant and reagent information improves the performance of such models.5,47,48 We trained product prediction models both in the separated setting (reactants and reagents are separated by the token “>” in the input sequences) and the mixed setting (all molecules are separated by dots in the input sequences). To evaluate the quality of the models, we used both the USPTO MIT test set and our Reaxys test set on which we tested the reagent prediction model. We did not use SMILES augmentations for product prediction.
(1) Exact match accuracy: the prediction of the model is considered correct if the symmetric difference between the set of predicted molecules and the set of the ground truth molecules is an empty set.
Example: A.C.B is an exact match to A.B.C.
(2) Partial match accuracy: the prediction counts as correct if the ground truth contains at least one of the predicted molecules.
Example: A.B is a partial match to A.C.D.
(3) Recall: the number of the correctly predicted molecules divided by the number of molecules in the ground truth.
Example: A.B.C has 100% recall of A.C, A.D has 50% recall of A.B.C.D.
Here A, B, C, and D denote the SMILES strings of some molecules.
We use beam search with beam size 5 to obtain predictions from the transformer. Therefore, all performance metrics report the top-N predictions with N from 1 to 5. A correct top-N prediction means that the correct answer appeared among the first N sequences decoded with the beam search.
Additionally, our test set contains duplicate reactions with different reported reagent sets. While gathering performance statistics, we group predictions by unique reactions: if the model correctly predicts reagents for one of the duplicates, we count the reaction as correctly predicted.
The performance on the test set is summarized in Table 2.
Metric | Top-1 | Top-2 | Top-3 | Top-4 | Top-5 |
---|---|---|---|---|---|
Exact match accuracy | 17.0 | 24.7 | 29.2 | 31.8 | 33.5 |
Full recall | 19.2 | 28.4 | 35.1 | 39.3 | 42.8 |
Partial match accuracy | 70.9 | 80.5 | 84.9 | 87.3 | 88.9 |
The model performs quite well on the test dataset. For each test reaction, each of the top-5 predictions is a valid SMILES string. An exact match of the prediction and the ground truth sequence is observed in 17.0% of the cases for top-1, 29.2% for top-3, and 33.5% for top-5. At the same time, full recall is higher at 19.2%, 28.4%, and 42.8%, respectively. As for the partial match between predictions and ground truth sequences, it is much higher at 70.9%, 84.9%, and 88.9% in the top-1, top-3, and top-5 cases, respectively. Thus, the model cannot correctly predict a single reagent for 11.1% of the reactions in the test dataset. In our evaluation, we do not use any methods to assess the plausibility of an incorrect prediction such as similarity metrics for interchangeable solvents, so the performance scores should be considered to be underestimates.
Fig. 5 shows both the exact and partial accuracy of our reagent model on reactions published every single year between 1980 and 2022. Solid lines illustrate the moving average of accuracy over five years with a centered window. Top-1 exact match accuracy tends to increase on average from 1980 to 1998, then we see a decrease until 2005 and a plateau after that. The picture is similar for top-5 exact accuracy and top-2 to top-4 as well. The dependence for total recall is alike. Top-1 partial match accuracy tends to increase on average from 1980 to 2008 and stagnate after that. Top-2 to top-5 accuracies demonstrate similar behavior as well.
Occurrence | N | Criterion | Reactions | Top-1 exact acc., % | Top-1 partial acc., % |
---|---|---|---|---|---|
Common | 20 | N > 1000 | 67633 | 17.0 | 73.5 |
Frequent | 67 | 100 < N ≤ 1000 | 23685 | 16.2 | 65.1 |
Rare | 120 | 10 < N ≤ 100 | 4219 | 13.5 | 60.3 |
Very rare | 232 | 1 < N ≤ 10 | 947 | 14.2 | 53.1 |
Singular | 245 | N = 1 | 245 | 13.5 | 62.4 |
In the “common” and “frequent” bins there are no reaction types with only a single possible reagent in any role. In the “rare” and “very rare” bins there is a small number of types in which it is the case. However, the analysis is limited by NameRXN and by the heuristics of role classification. If the type label is a name reaction, which is an infrequent case, then the reactions with that label may have a single option for one of the roles. For example, all instances of the rare “8.1.24 Ketone Swern oxidation” type have an oxidizing agent which is the same for all, and the instances of the rare “9.1.2 Appel chlorination” and the very rare “1.7.8 Ullmann condensation” types all have their one specific catalyst. Also, some “very rare” types may have only one option even for solvents in the test set, but it may be an accident due to under-representation. The reagent may or may not give a perfect top-1 prediction for all reactions of such types.
There can be various strategies for reagent string replacement. We apply the following rule: if the top-1 prediction of the model contains more molecules than the original string, then the prediction replaces that string. With that, we can improve the reactions with missing reagents without corrupting the good ones. We are aware, however, that this strategy is not ideal and we are convinced that better ones are possible. Some examples of the reactions with reagents improved after the reagent model inference are shown in Fig. 8.
The first reaction is an example of peptide coupling. The typical reagents in this case comprise HOBt or its analogs (e.g. HATU) usually used together with Hünig's base (DIPEA). The reagent model reintroduces the missing reagents to the reaction. The second one is an example of reductive amination, and the information about the solvent alone is not enough. The model proposed a suitable reducing agent, sodium triacetoxyborohydride. The last two reactions are Suzuki coupling and Sonogashira reaction, respectively. The model suggests the standard reagents that define these reaction types.
We compared two models, both of which were standard Molecular Transformers with the same hyperparameters but trained on different data. The first model, which we denote as “MT base”, was trained on the standard USPTO MIT. This is the model from the original Molecular Transformer paper.5 The second model, which we denote as “MT new”, was trained on USPTO MIT in which some of the reactions had reagents replaced according to the procedure described above. We chose top-1 exact match accuracy as the quality metric. Before testing on Reaxys, we reassigned the reactant–reagent partition in every test reaction with the role assignment algorithm.37 This was done to be consistent with the inference procedure, in which the reagents that will be replaced are the reagents determined by this algorithm. Additionally, we trained another Molecular Transformer for product prediction without any reagents in the source sequences. The performance summary of the models is presented in Table 4.
Reaxys | USPTO MIT | |
---|---|---|
MT, no reagents | 77.3 | 84.0 |
MT base, mixed | 82.0 | 87.7 |
MT new, mixed | 83.0 | 88.3 |
MT base, separated | 84.3 | 89.2 |
MT new, separated | 84.6 | 89.6 |
The results of the base model are reproduced as described by Schwaller et al.5 without SMILES augmentations, checkpoint averaging, or model ensembling. The new model performs better than the old model on both Reaxys and USPTO in both separated and mixed settings. This performance improvement is statistically significant. To prove the statistical significance, we employed McNemar's test.51 The details are provided in the ESI.† The performance on Reaxys is worse than on USPTO because the distribution of data in Reaxys differs more from the distribution in the training set than in the USPTO test set. However, it is important to emphasize that the USPTO test set also underwent a reagent change to test the new model. The performance in the mixed setting is slightly worse than in the separated setting, which is expected.5 The score of the model trained with no reagents is expectedly the lowest both on Reaxys and USPTO. However, surprisingly, it is only 4.7 percentage points below the base model's score on Reaxys and 3.7 percentage points below the base model's score on USPTO. Therefore, we can conclude that even though the reagent information helps the product models trained on USPTO, which it should from a chemical perspective, the effect is not that drastic. However, we conjecture that such improvement will be much more noticeable on a larger scale, e.g. if all the models are trained on the entire Reaxys. We suggest that this is a manifestation of USPTO's flaws: the dataset does not contain many reactions in which the same reactant transforms into different products under different conditions.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d2sc06798f |
‡ https://github.com/mcs07/PubChemPy |
This journal is © The Royal Society of Chemistry 2023 |