A graph-convolutional neural network model for the prediction of chemical reactivity

We present a supervised learning approach to predict the products of organic reactions given their reactants, reagents, and solvent(s).


S2.2 Weisfeiler-Lehman Network (WLN)
Weisfeiler-Lehman Network 32 is a type of graph convolutional network derived from Weisfeiler-Lehman (WL) graph kernel 38 . The architecture is designed to embed the computations inherent in WL graph kernel to learn isomorphism invariant representation of atoms. The atom representation is computed by iteratively augmenting the representation of adjacent atoms. Specifically, each atom v is initialized with a feature vector f v indicating its atomic number, formal charge, degree of connectivity, explicit and implicit valence, and aromaticity. Each bond (u, v) is associated with a feature vector f uv indicating its bond order and ring status. In each iteration, we updated atom representations as follows: where f l v is the atom representation at the lth iteration, initialized with f 0 v = f v atom features. U 1 ,U 2 ,V 1 ,V 2 are model parameters to be learned, shared across all L iterations. The final local atom representations are computed as We refer the reader to 32 for more details about the mathematical intuition and justification of the WLN.

S2.3 Attention Mechanism
The atom embedding c v only record local chemical environment, namely atoms and bonds accessible within L steps from atom v. Even if L were very large, c v could not encode any information about other reactant molecules, as information cannot be propagated between two reactant molecules that are disconnected. We argue that it is important to enable information to flow between distant or disconnected atoms. For example, the reaction center may be influenced by certain reagents that are disconnected from reactant molecules. In this case, it is necessary for atom representation c v to encode such distal chemical effects. Therefore, we propose to enhance the model in previous section with an attention mechanism 39 . Specifically, let α vz be the attention score of atom v upon atom z. The "global" atom representationc v of atom v is calculated as the weighted sum of all reactant atoms where the weight comes from the attention module: The attention score is computed based on "local" atom representations c v from WLN.

S2.4 Reaction Center Prediction
The WLN is trained to predict reaction center, a set of changes in graph connectivity that describe the difference between reactant molecules and major products. Mathematically, a reaction center is a set {(u, v, b)}, where (u, v) is a pair of atoms whose connecting bond has changed to type b. We predict the likelihood of (u, v, b) being in reaction center by passing atom representations from WLN through another neural network: The above neural network is jointly optimized with WLN to minimize the cross entropy loss: where y u,v,b = 1 iff (u, v, b) is in the reaction center, and the above loss sweeps over every pair of atoms and bond types (including no bond).

S2.5 Candidate Ranking via Weisfeiler-Lehman Difference Network (WLDN)
At the stage of candidate reaction evaluation, we have a list of candidate products {p 0 , p 1 , · · · , p m } given a set of reactant molecules r. The goal is to learn a scoring function that ranks the true product p 0 to be the highest. The challenge in ranking candidate products is again representational. We must learn to represent (r, p) in a manner that can focus on the key difference between the reactants r and products p, while also incorporating the necessary chemical contexts surrounding the changes.
The architecture of WLDN is designed to highlight such differences. Specifically, it has two components. The first component is a Siamese WLN that learns atom representation of reactant r and candidate products p i . Let c be the learned atom representation of atom v in candidate product molecule p i . We define difference vector d pertaining to atom v as follows: Because the reactants and products are atom-mapped, we can use v to refer to the same atom in different molecules. The second component of WLDN is another WLN that operates on the difference graph between reactants and products. A difference graph D(r, p i ) is defined as a molecular graph which has the graph structure as p i , with atom v's feature vector replaced by d Operating on the difference graph has several benefits. First, in D(r, p i ), atom v's feature vector deviates from zero only if it is close to the reaction center, thus focusing the processing on the reaction center and its immediate context. Second, D(r, p i ) explicates neighbor dependencies between difference vectors. The WLDN maps this graph-based representation into a fixed-length vector, by applying the second WLN on top of D(r, p i ): Note that though with the same notation, matrices U * ,V * ,W * are distinct parameters from the WLN used in the reaction center prediction. We use the same character for notational convenience.
Let RC(p i ) = {(u i , v i , b i )}, the set of bonds that changed from the reactant r to product p i . The final score of candidate p i is: Compared to 29, we augment the WLDN with the quantitative scores s u,v,b for each bond change in reaction center prediction. This is beneficial as the candidate outcomes produced by combinations of more likely bond changes are themselves more likely to be the true outcome.

S3.1 Number of bond changes per reaction
The combinatorics of the enumeration scales poorly with the number of bond changes allowed per reaction. As stated in the main text, the number of candidates per reaction is bounded by Where we have allowed up to 5 simultaneous bond changes and K = 16 (i.e., select up to 5 bond changes from the 16 most likely bond changes) in our later evaluations. The choice of 5 was motivated by an analysis of the number of bond changes in training and validation examples shown in Table S1. We sacrifice 0.1-0.2% loss in maximum possible predictive accuracy through this limitation, but significantly restrict the number of candidates that must be ranked. To allow predictions of the remaining 0.1-0.2%, it would be possible to have a dynamic upper limit of the number of simultaneous bond changes that takes into account what those bond changes are (e.g., to allow complex pericyclic reactions that may have many bond rearrangements).

S3.2 Computational cost of training and prediction
An important aspect of predictive model performance is speed or computational cost, particularly in cases where the model may be applied in a high throughput virtual screen. All experiments were run using a single NVIDIA Titan X graphics card. Training of the Reaction Center Prediction model (WLN) completed after 19 hours (140,000 minibatches of 20, an average of 24 ms per example). Training of the Candidate Ranking model (WLDN) completed after 72 hours (2,400,000 minibatches of a single reaction and its candidate outcomes, an average of 108 ms per example). Prediction times for the 40,000 test examples were 28.5 minutes and 141 minutes respectively using a single Titan X GPU and a single data preprocessing thread. This translates to a throughput of 43 ms/example and 212 ms/example, respectively. It's important to note that when making these predictions, preprocessing occurs on a single thread (i.e., converting a SMILES to the reactant graph, combinatorically enumerating candidate outcomes and determining their validity). The throughput for testing âȂŞ as implemented âȂŞ is currently limited by these CPU-related tasks, particularly for the candidate ranking model. These preprocessing steps could be trivially parallelized at the reaction level as was done during training and would enable inference times below that of training times (i.e., below 24 and 108 ms per example for each model).

S3.3 Reaction prediction performance compared to Schwaller et al.'s single product subset
In 28, Schwaller et al. report their model performance on single product reactions in addition to those in the full data set. Only 3.4% of test examples in the Jin et al. data set have multiple species in the products (e.g., counterions, amine salts). However, their neural translation model is currently not designed to predict multiple species separated by a "." SMILES token. Table S2 shows the comparison using the subset of 38,648 single product examples. While the difference in performance is less significant than with the full test set, our graph-based model still achieves several percent higher accuracy. this data set are impossible to predict because they do not fit within this linear framework. The first five test reactions that are not in their subset are a reductive amination, a total deprotection of a tertiary amine to a primary amine, a thioether oxidation to a sulfoxide, thiourea addition to an alkyl iodide, and an alkene ozonolysis to an aldehyde. These are reaction types that one should expect these models to be able to predict. Our formulation of reactions as sets of bond order changes allows 98.6% of test examples to be represented and reconstructed (note: we consider the remaining 1.4% as failed predictions in all evaluations). In their preprint, Bradshaw et al. show a comparison between ELECTRO and the previous models of Jin et al. and Schwaller et al. However, a very important point must be made: the focus on reactions that can be described as sequential electron movements represents a significant restriction on possible outcomes. This added problem structure simplifies the prediction task and, in the context of our model, would restrict the number of valid enumerated candidates. In general, one should expect it to be easier to perform well on a narrower task with a model that is tailored to that task's scope.
We make a comparison using the test subset of 29,360 reactions provided by Bradshaw et al.. For the sake of this evaluation, because the exact training data is not available, we use the trained models designed for the broader prediction task over the whole data set. While it would be possible to restrict candidate enumeration to only include products consistent with the restricted reaction scope, we have not done so in this comparison. The results are shown in Table S3. Although the model described in this study is designed for a more general prediction task, it still outperforms the ELECTRO model in top-1 and top-2 accuracy by a small margin.

Table S3
Performance in reaction prediction when only testing on the 29,360/40,000 reactions able to be predicted by the ELECTRO model; our model was not designed to take advantage of the narrower prediction scope but still achieves slightly higher top-1 and top-2 accuracies.

Method
Top-

Supporting information file "Human benchmarking (80 examples) and answers" contains all 80 questions from the human benchmarking test and their answers. Recorded reactants and products are shown in black.
For reactions where the model did not predict the true (recorded) product as its top prediction, the predicted outcome is shown in red. Reactions from the test set of 40,000 reactions were divided into 8 categories based on the rarity of the retrosynthetic reaction template required to reproduce the example using a template-based method. Ten reactions were randomly selected from each of these eight categories to cover both common and rare reactions. Human performers were given depictions of the reactant molecules and asked to draw or otherwise indicate the expected major product, which exactly matches the model's prediction task; no explicit time limit was provided. To evaluate the model on the same data set of 80 test reactions, the model was not given explicit information about which species are known a priori to be reagents, to make it a fair comparison.
The comparison between the model and human performers serves to indicate that this is a nontrivial prediction task and that the model is able to perform at the level of an expert human. A more tightly controlled, larger-scale study would be required for a more rigorous comparison.

S3.7 Near-miss predictions
Fourteen mispredictions are shown in Fig. S2 and Fig. S3 where the model predicts the recorded outcome with rank 2; the rank 1 prediction is shown in red. In all cases, the model makes reasonable predictions given the problem formulation and information it has access to. Fig. S2A shows an example of regioisomerism, where the model predicts an isopropylation at the 1 position of the indazole, but the recorded product is at the 2 position. Fig. S2B is an examply where the recorded reaction is neutralization of a deprotonated carboxylic acid; under our formalization of reactions as changes in bond order, this would be seen as âȂIJno reactionâȂİ âȂŞ an outcome that the model is not allowed to predict. Given that the model must make a prediction of a product with modified bond orders, it defaults to a fairly common reaction from the database, a deprotection, although Cbz is generally stable to base. Fig.  S2C is another case of regioisomerism, where condensation can occur on either side of the methyl butyl ketone. Fig. S2D is a case of complex regioselectivity, where the model identifies the recorded iodination as likely, but believes a different site to be more so. One would expect that both products would be observed experimentally. The model prediction for Fig. S2E can be seen as the single-step intermediate prior to a second amidation to the recorded outcome; likewise, the model prediction for Fig. S2F is an intermediate prior to elimination to form the recorded product. Fig. S2G is a C-N bond forming reaction where the rank-1 and rank-2 predictions correspond to the two most likely sites of addition; the distinction between the two is subtle, resulting from a distant bromine that breaks the molecule's symmetry. The site that the model predicts as more likely does not match what was recorded. Fig. S3A shows the acylation of an alkene mispredicted as the esterification of an enol. Fig. S3B is a case of regioisomerism similar to Fig. S2G, where the asymmetry between the two reactive sites is subtle. The model rank-1 prediction for Fig. S3C is singly-alkylated Fig. S1 Example of impurity prediction for the preparation of an isocyanate. The model correctly predicts the recorded product (rank 1, >99% confidence), but also suggests several minor side products (only ranks 2-6 shown) as potential outcomes.
intermediate to the recorded and rank-2 prediction, the double-alkylated ethanolamine. The presence of both HF and pyridine for the epoxide opening in Fig. S3D leads to the prediction of both regioisomers, whereas only one is recorded. The rank-1 and rank-2 predictions of Fig. S3E are tautomers, highlighting the fact that while the SMILES strings of each species are sanitized and canonicalized by RDKit, there are additional standardization steps that might reveal some products to be equivalent. The recorded reaction of Fig.  S3F suggests that the example may be missing a reagent; the model, required to make a prediction and in the absence of any better candidates, suggests an unlikely SNAr between chloropyridine and triethylamine. Indeed, the true reaction example includes potassium trifluoro(vinyl)borate (PFIZER LIMITED -US2007/197478, 2007. The recorded Chan-Lam coupling in Fig. S3G is mispredicted as a Suzuki coupling. Despite the absence of any Pd catalyst, the model has learned that this is a very likely outcome and scores it higher than the true product.

S3.8 Complete-miss predictions
Nine mispredictions are shown in Fig. S4 where the model does not predict the recorded outcome in its top ten predictions; the rank 1 prediction is shown in red. In most cases, the model makes reasonable predictions given the problem formulation and information it has access to. Fig. S4A shows a two-step azidation followed by a reduction with sodium borohydride, where the model only predicts the azide product and does not continue to the aniline. Fig. S4B shows a chlorination reaction where the recorded outcome is a substitution at the aryl nitro group; the model instead predicts that the phallic anhydride will open to form the acid chloride. Fig.  S4C is a case where the recorded outcome does not make physical sense, due to the presence of an additional carbon atom unlikely to be contributed by any of the reagents. Fig. S4D is similar, where the n-propylation is challenging to explain based on the recorded reactant/reagent species. In Fig. S4E, it is unclear what the source of the carbonyl carbon and oxygen are in the recorded outcome species, but the predicted outcome is quite reasonable. The recorded and predicted outcomes in Fig. S4F are tautomers, yet this is considered a misprediction due to having distinct SMILES representations. The misprediction in Fig. S4G is a legitimate one, where the model perhaps doesn't understand the role of mercury oxide and falls back on predicting an alkyne reduction. The spiroether motif is not terribly common in the patent set, so it is likely that the model has not seen enough of these examples to confidently predict this type of reaction. The recorded outcome in Fig. S4H is essentially âȂIJno reactionâȂİ, and the model's prediction of the acid chloride makes more physical sense. The final example shown in Fig. S4I is what appears to be a simple N-alkylation with an alkyl bromide, but the recorded product is the imine, rather than the amine.

Fig. S4
Miscellaneous "complete-miss" mispredictions from the test set where the model does not propose the recorded outcome (black) in any of its top-10 predictions; the incorrect rank-1 prediction is shown in red. (A) azidation predicted, two-step azidation and reduction recorded; (B) ring-opening chlorination of anhydride predicted, nitro substitution recorded; (C) alpha acylation predicted, unexplainable alpha arylation recorded with additional carbon; (D) ester cleavage predicted, unexplainable n-propylation recorded; (E) esterification predicted, unexplainably amidation recorded with additional carbonyl; (F) amidation predicted, equivalent amidation tautomer recorded; (G) alkyne reduction predicted, spiroether formation recorded; (H) chlorination of carboxylic acid predicted, disassociated chloride salt recorded; (I) N-alkylation predicted, unexplainable N-alkylation to imine recorded.

Fig. S5
Analysis of the USPTO data set used in this study in terms of the number of reactant fragments present in each reaction SMILES as determined by the presence of the period (".") symbol.

Fig. S6
Analysis of the USPTO data set used in this study in terms of the number of product fragments present in each reaction SMILES as determined by the presence of the period (".") symbol. The relatively few examples with multiple product species are primarily salts (e.g., amine HCl salts) or counterions.

Fig. S7
Analysis of the USPTO data set used in this study in terms of the diversity of atomic identities appearing in the reactant species. Note the log scale of the y axis.

Fig. S8
Analysis of the USPTO data set used in this study in terms of the diversity of atomic identities appearing in the product species. Note the log scale of the y axis. Comparison to Fig. S7 shows that there are fewer distinct heavy atoms that appear in products as compared to reactants; many predominantly appear in reagents or catalysts.