David
Fooshee
a,
Aaron
Mood
b,
Eugene
Gutman
b,
Mohammadamin
Tavakoli
a,
Gregor
Urban
a,
Frances
Liu
b,
Nancy
Huynh
b,
David
Van Vranken
b and
Pierre
Baldi
*a
aDepartment of Computer Science, University of California, Irvine, Irvine, CA, USA. E-mail: pfbaldi@ics.uci.edu
bDepartment of Chemistry, University of California, Irvine, Irvine, CA, USA
First published on 1st December 2017
Reaction predictor is an application for predicting chemical reactions and reaction pathways. It uses deep learning to predict and rank elementary reactions by first identifying electron sources and sinks, pairing those sources and sinks to propose elementary reactions, and finally ranking the reactions by favorability. Global reactions can be identified by chaining together these elementary reaction predictions. We carefully curated a data set consisting of over 11000 elementary reactions, covering a broad range of advanced organic chemistry. Using this data for training, we demonstrate an 80% top-5 recovery rate on a separate, challenging benchmark set of reactions drawn from modern organic chemistry literature. A fundamental problem of synthetic chemistry is the identification of unknown products observed via mass spectrometry. Reaction predictor includes a pathway search feature that can help identify such products through multi-target mass search. Finally, we discuss an alternative approach to predicting electron sources and sinks using recurrent neural networks, specifically long short-term memory (LSTM) architectures, operating directly on SMILES strings. This approach has shown promising preliminary results.
Design, System, ApplicationWe present reaction predictor, a system for predicting chemical reactions using deep learning. Reactions are predicted at the level of elementary mechanistic steps, and these predictions can be chained together to yield complex global reactions. Additionally, when given a set of starting materials and target masses, the system can search for unidentified products and suggest possible structures. Reaction predictor's design is modular in such a way that its predictions are readily interpretable. We can tell how it has made a prediction, and why, by following the sequence of elementary reaction steps involved. Reaction predictor uses our own carefully developed data set for training, and continues to improve as more data becomes available. |
Rule-based approaches to reaction prediction1–3 can be fast, but the requisite systems of manually-implemented rules and exceptions require painstaking maintenance. While they may provide good results within a limited chemical domain, rule-based systems are constrained by the extent to which a human expert has defined the underlying rules, and no comprehensive system of rules covering all of chemistry currently exists. Rule-based systems do not scale well over the long term as new areas of chemistry are added. Furthermore, these systems typically make predictions at the level of overall chemical transformations. Multi-step reactions are condensed into a single transformation, and information about the elementary arrow-pushing steps comprising the multi-step reaction is not available. Yet these elementary steps are the building blocks for predicting novel multi-step global reactions and identifying side products.
Quantum-mechanical (QM) approaches may theoretically yield accurate results based on physical first principles, but in practice are highly sensitive to operator set-up, and are computationally expensive. Many recent studies involving QM-based prediction of reaction pathways are narrowly limited to a single chemical system.4–8 A clear benefit of these methods, when successful, is their ability to quantitatively predict important reaction parameters such as free energies, energy barriers, transition states, and reaction rates. Still, they require considerable human intervention and are not suitable for making high-throughput predictions.
Machine learning (ML) approaches9–11 are fast and scalable, but ideally require large data sets from which to learn. Obtaining such chemical data sets is a significant challenge, as many are proprietary and not readily available for academic use. Furthermore, reactions may be unbalanced or not atom-mapped, complicating attempts at statistical learning. Nonetheless, the ML approach to reaction prediction remains promising, and recent advances in deep learning have opened the door to further performance gains.
We also note that these three approaches can be complementary to one another. Rule-based systems have been used to provide training examples for ML-based systems.9 Similarly, ML-based systems can benefit from the implementation of a small number of carefully selected rules. In the case of reaction predictor, such rules are designed to address corner cases or express strong priors about specific products or resonance structures that may be problematic or redundant. Other work has explored using ML-based approaches to predict traditional QM observables like molecular atomization energies and electronic structure properties.12–14 Recent work in our group demonstrated synergies between the QM- and ML-based approaches.15 We showed that results from QM modeling can be used to derive new training examples for ML algorithms, creating a closed and automated positive feedback loop that improves the ML system.
Reaction predictor is a deep learning-based approach to reaction prediction that operates at the level of elementary reactions. This is a fundamental design choice that reflects how human experts think about chemical reactions. Each elementary step involves the movement of electrons from an electron source to an electron sink. In summary, given a set of input reactants, the reaction predictor pipeline operates in the following multi-step fashion:
1. Enumerate all possible electron sources and electron sinks within the input reactant molecules.
2. Filter the list of candidate sources and sinks, predicting a smaller list containing only the most reactive sources and sinks.
3. Propose reactions by enumerating all combinations of source-sink pairings.
4. Rank the proposed reactions by favorability.
5. Iterate the above process to identify global reactions, or search for unidentified products.
Here we describe the ML design and methodology underpinning reaction predictor's ML-based predictions. To train our models, we carefully curated a training data set consisting of over 11000 elementary reactions, covering a broad range of advanced organic chemistry. We tested reaction predictor's performance on a benchmark data set of challenging real-world reactions, and demonstrate a high degree of accuracy. We also compare these results with the performance of an early prototype system that was developed in our group.10 Finally, we also demonstrate that an alternative approach to predicting electron sources and sinks simply by examining SMILES strings, using a long short-term memory (LSTM) architecture,16,17 shows promising results.
We used a benchmark data set of 289 elementary reactions taken from challenging multi-step transformations to test the performance of our system. These reactions were chosen from reactions in the literature and inferred from reactions in Strategic Applications of Named Reactions in Organic Synthesis22 to cover a broad range of advanced organic chemistry intended to test the system's ability to generalize on real-world reactions. They are not a subset of the training data.
Using this cleaned data set as our starting point, we have added 6361 high quality hand-curated reactions: (1) we added about two thousand additional elementary training reactions to cover gaps in various areas of chemistry: phosphorus functional groups, sulfur functional groups, silicon functional groups, tetrahedral intermediates, proton transfers, and migratory displacements. (2) About 1500 additional polar elementary reaction steps were chosen from reaction pathways covered in a first-year graduate course in organic reaction mechanisms including, for example: carbonyl chemistry, enolates, enols, enamines, phosphorus chemistry, sulfur chemistry, allylsilanes, vinylsilanes, strained rings, non-classical carbocations, and others. (3) About a thousand (1004) polar mechanistic steps were sampled from multi-step reaction mechanisms in a well-known book on named reactions.22 (4) About two thousand additional reactions were drawn randomly from the literature and organic chemistry research presentations in an attempt to further complement the data set. Some examples of notable new reaction types added to the data set are shown in Fig. 1. Fig. 1a shows an MPV reduction involving 12 mechanistic steps.23Fig. 1b represents a Mitsunobu reaction that has seven mechanistic steps.24Fig. 1c shows an eight-step Stetter reaction.25 For each of these examples, critical steps in the reaction mechanism required elementary reactions that were not originally represented in the data set, but have since been added.
Fig. 1 Examples of complex transformations requiring chemistry that was not represented in the original data set. Hand-curated elementary reactions were added to the training data to address these reaction types, and many more, greatly improving predictive capability. Reaction (a) is a 12-step MPV reduction;23 (b) is a seven-step Mitsunobu reaction;24 and (c) is an eight-step Stetter reaction.25 |
At time of writing, the total number of elementary reactions in our data set is 11028. We have observed that the system improves as we grow the data set.
For source/sink filtering, we experimented with a variety of architectures, and obtained best results using a single fully-connected feedforward neural network (also called a multilayer perceptron, or MLP) with 1500 inputs, three hidden layers of 200 rectified linear units, and two independent sigmoid output units corresponding to a source prediction and a sink prediction. To avoid overfitting, we use a number of methods including 50% dropout applied to each hidden layer,26,27 and early stopping. To train the model, weights were initialized as described in Glorot and Bengio,28 and updated using the Adam optimizer29 on mini-batches of 64 examples. An exponentially decaying learning rate, and early stopping based on a validation set of 10% of the training set were used. Models were implemented using Keras and TensorFlow, and training was performed on an NVIDIA Titan X GPU.
We built the training data for our source/sink filtering network as follows. For each elementary reaction in the database, we extracted four training examples of atom reactivity: (1) the labeled source, (2) the labeled sink, and (3, 4) two randomly sampled non-source, non-sink examples. This is illustrated in Fig. 3. There are two advantages to this method compared with the original method of extracting two positively labeled examples (true source and true sink), plus all remaining negative examples. First, we avoid the significant imbalance inherent in the data, as we observe approximately 22-times more negative examples than positive examples. Previously this was addressed by oversampling the positive examples. The second advantage is we avoid adding potentially misleading examples to our training data. Specifically, we avoid negatively-labeled examples that should actually be considered secondary sources or sinks. For a given elementary reaction, the set of atoms not explicitly labeled source or sink contains mostly poor sources and poor sinks. However, some could be considered “second-tier” sources/sinks, either on their own or within a different molecular context. By randomly selecting from these atoms to generate our negative examples, we gain a representative sample of the non-source, non-sink examples in our data, while also avoiding labeling all potential second-tier sources/sinks as negative examples. Extracting the data as described above, our training set for source/sink filtering consisted of 23850 examples, half of which were positive examples.
After identifying sources and sinks, and pairing them together, we must rank the resulting set of proposed reactions. To do so, we train a deep Siamese architecture neural network30,31 to compute a reaction favorability score. Fig. 4 illustrates this architecture. Training examples consist of ordered reaction pairs (Rfavorable, Runfavorable), where the favorable reaction is always presented to the left instance of the shared-weight neural network. Fixed weights of +1 and −1 for left and right outputs are connected to a final sigmoid unit. Thus the final output y approaches 1 if the left reaction was scored higher than the right reaction, and 0 otherwise. After model training, we use one instance of the shared-weight network to compute favorability scores for all reactions, and rank them based on those scores.
For the shared-weight network, we used two hidden layers of 300 tanh units, and a sigmoid output. Initialization and training proceeded as described above for the source/sink models. We generated training examples (Rfavorable, Runfavorable) as follows. For each elementary reaction in the data set, we have one reaction, Rfavorable, formed by pairing the labeled electron source with the labeled electron sink. We can propose many additional unfavorable reactions, by pairing the labeled source with all non-sinks, and all non-sources with the labeled sink, within the constraints of chemical feasibility. We use this set of unfavorable reactions to create additional training pairs (Rfavorable, Runfavorable), yielding 387744 total training examples.
Then, to accurately rank reactions by favorability, we must extract reaction-level information that captures both the source/sink interaction and the overall molecular change(s) occurring in the elementary reaction. These reaction-level features include (1) a combination of concatenated source and sink atom-level features, (2) features describing the type of orbitals involved, and (3) net change features, which are created by computing molecular fingerprints for both the reactants and products, then subtracting the reactants fingerprint from the products fingerprint to capture the net changes that occurred during the reaction, e.g. which functional groups and motifs were formed or destroyed.
Extracting the features described above for all examples in our data set yields 293046 atom-level features, and 62560 reaction-level features. We select the top 1500 atom-level and top 2000 reaction-level features with the highest mutual information and use those features as the input representations for our source/sink filtering and ranking models.33,34
Fig. 6 Initially, reaction predictor only modeled lithium to include at most one bond. Now the system is capable of correctly predicting this LDA deprotonation to make an enolate via an Ireland transition state.38 |
Fig. 7 Reaction predictor can predict this intermolecular Prins mechanistic step because it can model the Ti atom in TiCl4.39 |
Fig. 8 Examples of complex transformations requiring multiple reactant equivalents. In (a), two equivalents of chloroacetyl chloride and pyridine are necessary to complete the double acylation reaction involving six mechanistic steps. In (b), one equivalent of a Grignard reagent adds to the ester, and a second equivalent is needed to add to the ketone that is generated in situ. Reaction predictor is able to predict both of these products from the literature by adding additional reactant equivalents.40,41 |
Metric | RP | Prototype |
---|---|---|
Products recovered | 83.0% | 58.1% |
Products ranked in top-5 | 80.0% | 76.8% |
Mean time per reaction (s) | 8.3 | 91.8 |
Total reactions proposed | 20812 | 92158 |
First we note that the early prototype failed while attempting to run the benchmark data set, due to its inability to handle certain atom types, as described above. We incorporated the necessary updated library into that version, so that it could complete the test.
We observe that reaction predictor recovers 83.0% of the expected products anywhere in the ranked list, vs. 58.1% for the prototype. This difference is particularly striking when we consider that the prototype version is biased towards predicting more false positive sources/sinks, in exchange for, ideally, improved recall. We see this reflected in the total number of reactions proposed by each version, with the prototype proposing 92158 reactions, while our current version proposes 20812. Yet, even though the prototype predicted so many more reactions, it still only recovered 58.1% of the expected products. That reaction predictor can predict nearly 72000 fewer reactions while recovering significantly more of the products (83.0%) indicates a dramatic improvement in source/sink filtering capability. We believe this is a result of two factors: the improved breadth and quality of the expanded training data, and the application of deep learning to better learn from that data.
Reaction predictor also performs well in terms of how many correct products are ranked within the top-5 of the ordered reaction list, with 80.0% of correct reactions ranked within the top-5. We have observed that the ranking performance was slower to improve as we grew the training data, compared with source/sink accuracy. Being able to generate hundreds of thousands of training examples for the ranking model, even with only several thousand elementary reactions in the training set, seems to convey enough information to yield a highly accurate trained ranking model. Our observation that doubling the number of elementary reactions in the training set led to a dramatic improvement in source/sink identification, but a relatively minor improvement in ranking performance, seems to imply that the task of source/sink prediction is more difficult than the ranking task. How much of this apparent imbalance is a reflection of chemical reality, versus how much may be an artifact of our particular computational approach and ML design, is unclear. From the perspective of a human chemist, it seems surprising that the source/sink prediction task should be significantly more challenging. We suspect that ultimately the problem is the scarcity of available training data. As more data becomes available, we expect the source/sink predictions to become increasingly accurate.
We also note the significant improvement in average runtime per reaction. For the early prototype, each prediction took 91.8 seconds on average, while reaction predictor is more than ten times faster, averaging 8.3 seconds per reaction. This is a major boon for offline pathway searches, which benefit greatly from faster prediction speed. Results that would have previously taken 10 days can now be processed overnight.
We hypothesize that the homogeneity of the combinatorially-generated reactions introduced biases that degraded predictive performance. We attempted to counteract this by randomly sampling a small fraction of the total reactions generated from each of our combinatorial reaction templates and including only those in the training set. Nonetheless, we observed degraded performance even when the smaller samples of reactions were used in the training set. We believe there is likely value in using combinatorial reaction generation, if done in such a way as to closely mimic the molecular variety observed in real-world reactions. However, the time required to carefully design and validate the necessary templates and constraints, while also aiming to simulate a realistic variety of molecular contexts, is by our estimation better spent simply writing high-quality elementary reactions by hand. For now, we leave combinatorial reaction generation aside for potential future work.
We submitted a number of pathway search jobs based on actual mass spectrometry data to test the feature's ability to identify unknown peaks. Fig. 9 and 10 illustrate typical instances where pathway search suggested plausible product pathways or structures for unidentified target masses. In Fig. 9, an unexpected product was found after fluoride deprotection.42 Pathway search suggested 23 pathways to generate this mass beginning from the actual starting materials used, including the highly plausible imidazolidine-2,4-dione shown. Fig. 10 illustrates a malonate alkylation that generated the desired product in low yield. Pathway search identified a plausible structure corresponding to over-alkylation of the reactant, in effect “troubleshooting” the reaction by suggesting an explanation for the low yield observed.
Other times, pathway search may not find a match for a target of interest, or it may find matches that are implausible or were arrived at via implausible mechanisms. These failure modes can be alleviated, to some extent, by adjusting the parameters of the pathway search, which are fully customizable. Users can control the branching factor – how many of the top-ranked reactions are pursued at each search step – as well as the maximum search depth. Even implausible structures matching a target mass may spark ideas or suggest paths to identifying plausible alternatives. Continued improvements to source/sink prediction and ranking accuracy are ultimately the most important factors contributing to pathway search efficacy.
Our MLP-based source and sink predictor has some inherent limitations. The input features only cover a limited amount of context for any given molecule, e.g. neighborhoods of atoms within six bonds. Furthermore, these features do not contain any information about other reactants in a given reaction, which could render the prediction task virtually impossible in some cases, where a part of a molecule could act as a sink or a source depending on which other reactants are present.
We therefore explore a fundamentally different model to overcome these limitations. On a high level, this model is based on recurrent neural networks that operate on the canonicalized SMILES strings of all reactants. It is able to learn and use features that encode the context of the entire reaction when making predictions for locations of sinks or sources. This model is able to operate on an arbitrary number of reactants of arbitrary size/length† and is invariant to the ordering of reactants as presented. We achieve this invariance by operating on reactants in parallel, and extracting fixed-length feature vectors for all reactants, which are merged into groups of reaction-level features by averaging. A different part of our network uses these reaction-level, and order-invariant, features as context when it predicts sources and sinks for a given molecule.
A more technical description of the model and operating procedure is as follows: SMILES representations of reactants are canonicalized using an implicit representation of hydrogen atoms. We then remove all information that could trivialize the source/sink prediction task. For example, in our data set all potential hydrogen sinks (or sources) appear as the first element in the SMILES string. Thus we remove these and add a ‘start-of-sequence’ character to every SMILES string. The training objective of our model is to predict, for every character in the input SMILES strings of all reactants, whether it corresponds to an atom (or bond) that will act as a sink, source, or both. We further replace multi-letter atoms (e.g. ‘Cu’, ‘Al’) with new unique single characters.
The neural network model conceptually consists of eight elements, shown in Fig. 11. The layers of this model operate on all reactants in parallel, and receive information about the operations on other reactants of the reaction only at the reaction-level-feature merging step (see step 4). The input consists of a variable number (≈ number of reactants) of character strings of variable length (≈ length of individual reactant SMILES). We therefore use zero-padding for mini-batches to speed up training.
1. The first layer of the neural network maps the input characters into a learned set of vectors (also called embedding), where the number of vectors equals the number of unique characters in all SMILES.
2. This layer is followed by a one-dimensional convolutional neural network (CNN),44 and then further by a recurrent bi-directional LSTM.16,17
3. The layers of the bi-directional LSTM traverse the sequences that correspond to a processed version of the original SMILES strings in the forwards and backwards directions. By doing so they accumulate and compute a fixed-length vector representation of the individual reactants, i.e. learned fingerprint vectors for reactions.
4. These representation vectors are then redistributed across all processing streams of all reactants: a given stream receives the fingerprint of its own molecule and the average of fingerprints of the other molecules/reactants. This way we achieve commutativity of reactants while not mixing contexts of “this” and “other” reactants.
5. These two context vectors are concatenated and replicated across the output of the convolutional layer (2). We thereby re-use the character-level representations that are produced by the convolutional network (for efficiency and to promote weight sharing), while also augmenting them with reaction-level information about all reactants.
6. A bi-directional LSTM, separate from and unrelated to (3), then operates on this augmented stream of vectors and produces one output for each vector. These vectors have a one-to-one correspondence to characters in the input SMILES.
7. The LSTM outputs are then (optionally) processed by a CNN that can refine and sharpen predictions on a local level.
8. A final sequence-distributed MLP then computes the class-probability predictions for each element of a given reactant. These are the character-level predictions of sources/sinks for every atom (or non-atom symbol) in the input SMILES strings, computed for all reactants in parallel.
This model can predict multiple sources/sinks for one reaction or even reactant. We therefore post-process the model's predictions for all reactions by ranking the predicted probabilities, and only using the single most confident source and sink prediction for any given reaction.
The LSTM architecture uses rectified linear units for its hidden layers, and sigmoid outputs. It was trained for 60 epochs on a set of 10052 labeled SMILES strings. Weights were initialized according to Glorot and Bengio, and updated using the Adam optimizer, with an exponentially decaying learning rate. Training was performed on an NVIDIA Titan X GPU.
To gauge the performance of the LSTM on challenging reactions, we checked its source/sink prediction accuracy on the benchmark data set of 289 reactions, and compared those results with the MLP-based predictor's performance. Table 2 summarizes these results. When we considered only the top-1 highest-scored source and sink, the MLP recovered both correctly in 31.5% of the reactions, while the LSTM recovered both correctly for 29.7% of the reactions. If we consider the top-5 predictions, the MLP recovers both the correct source and the correct sink for 90.0% of the reactions, and the LSTM for 72.4% of them. Similarly, top-10 accuracy for the MLP was 96.9%, and 86.4% for the LSTM. Finally, top-20 accuracy for the MLP was 99.0%, versus 94.6% for the LSTM.
Top-N accuracy (source & sink) | MLP | LSTM |
---|---|---|
Top-1 | 31.5% | 29.7% |
Top-2 | 60.2% | 51.6% |
Top-3 | 75.8% | 63.4% |
Top-4 | 85.8% | 68.8% |
Top-5 | 90.0% | 72.4% |
Top-10 | 96.9% | 86.4% |
Top-20 | 99.0% | 94.6% |
The top-1 score of 29.7% for the LSTM is a promising result, and is particularly impressive as it is comparable to the MLP's top-1 performance of 31.5%. But this result is tempered by the LSTM's lower performance at recovering the labeled sources/sinks within top-5 and top-10 constraints. The crucial job of the source/sink filtering model is to recover all reactive electron sources and sinks, while minimizing the prediction of false positives. Table 2 shows that while the MLP is comparably accurate at predicting a single best source and sink, it is considerably better at recovering the correct sources/sinks, with fewer false positives, when multiple source/sink predictions are allowed. Specifically, the top-2 throughtop-5 scores are 60.2%, 75.8%, 85.8%, and 90.0% for the MLP, vs. 51.6%, 63.4%, 68.8%, and 72.4% for the LSTM. In reality, we expect that a complex chemical reactant will have multiple potentially-reactive source sand sinks, and the MLP recovers those more efficiently than the LSTM. In fact, even if we allow the LSTM to pick its top-10 proposed sources and sinks, its reaction-level accuracy of 86.4% is still lower than the top-5 MLP result of 90.0%. Nonetheless, these results are very promising and indicate potential for using an LSTM architecture for effective source/sink prediction. We emphasize that one advantage of our LSTM approach is its ability to consider the entire context of the reactants when making its source/sink predictions. This capability fundamentally shifts our expectations of what a good source/sink filter should do. Whereas the MLP predicts a set of potentially many reactive sources and sinks, but is necessarily less accurate in choosing the single best ones because it lacks contextual information, the LSTM can see the entire reaction context, and, at least in theory, predict the best source/sink pair given that complete information.
The design of reaction predictor significantly mitigates the “black box” problem of deep learning systems. This refers to the observation that predictions from these systems can be difficult to interpret, if not inscrutable. A hypothetical LSTM making global reaction predictions, for example, has the problem that the “logic” behind any given prediction is obscured, and mistakes are opaque. In contrast, because reaction predictor's design is modular and operates at the level of elementary reactions, we can tell how it has made a prediction, and why, simply by following the sequence of elementary reaction steps involved. Specifically, if reaction predictor makes a mistake in a global reaction, we can look at the underlying elementary step sand identify which step(s) contains an error. And for any erroneous step, we can look at the specific electron sources and sinks that were selected and their rankings in order to narrow down the error and take specific measures to fix it, for instance by adding corrective examples to the training set.
Finally, we demonstrate a promising LSTM-based approach to predicting reactive sites based solely on SMILES strings. This could be used in future work to complement and improve the existing MLP-based source/sink filters. Ultimately we expect reaction predictor will continue to improve over time as new opportunities for refinement are identified, and as more training data becomes available.
Reaction predictor will be made publicly available viahttp://cdb.ics.uci.edu upon acceptance of the paper.
Footnote |
† However, we expect that the accuracy of predictions will degrade for reactions with very large molecules. |
This journal is © The Royal Society of Chemistry 2018 |