Generating transition states of isomerization reactions with deep learning

Lack of quality data and diﬃculty generating these data hinder quantitative understanding of reaction kinetics. Specifically, conventional methods to generate transition state structures are deficient in speed, accuracy, or scope. We describe a novel method to generate three-dimensional transition state structures for isomerization reactions using reactant and product geometries. Our approach relies on a graph neural network to predict the transition state distance matrix and a least squares optimization to reconstruct the coordinates based on which entries of the distance matrix the model perceives to be important. We feed the structures generated by our algorithm through a rigorous quantum mechanics workflow to ensure the predicted transition state corresponds to the ground truth reactant and product. In both generating viable geometries and predicting accurate transition states, our method achieves excellent results. We envision workflows like this, which combine neural networks and quantum chemistry calculations, will become the preferred methods for computing chemical reactions.


Introduction
Computational methods that provide detailed knowledge about the reactivity of chemical species are becoming increasingly reliant on large datasets. For example, quantitative modeling of gas-phase systems such as combustion and pyrolysis require thousands of thermodynamics and kinetics data points to produce a mechanism used to predict just a few observables. [1][2][3] Spaces where quantitative data are lacking, such as organic reaction prediction and retrosynthesis, use databases populated with millions of qualitative reaction examples (i.e., major product only). 4,5 Because there are insufficient high-quality quantitative reaction data, these studies are typically either supplemented by experimental studies 6,7 and/or are severely limited in scope. 8,9 Indeed, creating and expanding databases with reaction rates and energetic barriers remains challenging, especially since such a task necessitates the generation of transition state (TS) structures for each reaction.
A breadth of literature describes TS structure generation as a bottleneck issue in kinetic modeling. Optimization methods to converge initial guesses for TS structures to saddle points on a potential energy surface (PES) are well-established, but producing viable initial guesses remains difficult. [10][11][12][13] Although researchers continue to make such guesses manually by using expert-guided decisions to position atoms by hand, automated workflows do exist. Linear or quadratic interpolations between reaction and product structures are quick but error-prone methods that provide an automated method to produce a viable TS guess. 14,15 Force field methods to generate TS guesses also exist, and while these are effective, they must be highly parameterized to a specific reaction type. 16 Advanced double-ended methods, which begin with optimized reactants and products, attempt to connect these ground state structures through snapshots of atomic configurations, which are then reoptimized to converge to a TS. [17][18][19][20][21][22] Single-ended methods, which begin with only the reactant structures, iteratively or systematically alter the input to reach a TS. [23][24][25][26][27][28][29][30] These algorithms, which rely on expensive ab initio or density functional theory (DFT) calculations, require significant computation to arrive at an initial guess, which often fails with subsequent optimization. 31 Heuristic approaches exist to avoid this computational expense, such as KinBot 32,33 and AutoTST, 34,35 which reposition reactant atoms based on expert-defined templates or molecular group contributions. While these heuristic methods address TS generation in a high-throughput sense, they are limited in scope; users must explicitly add transformation classes before the algorithm can make predictions for new chemistries. Advantages and drawbacks of these various methods are discussed in depth elsewhere. 31,[36][37][38][39] Recently, deep learning methods have advanced workflows in chemistry relating to organic synthesis planning, [40][41][42][43][44] forward reaction prediction, [45][46][47][48][49] property prediction, 50-55 and generative optimization. [56][57][58][59][60][61][62] However, few approaches tackle three-dimensional structure generation. Most research in this space focuses on protein structure prediction, which benefits from their linear and constrained arrangement. [63][64][65][66] The few studies that investigate small molecules target equilibrium geometries and conformer generation. [67][68][69][70][71] Notably, several groups use machine learning (ML) to train force fields at ab initio accuracy. [72][73][74] Training an ML force field to generate TS structures is an enticing approach but requires significant amounts of offequilibrium data for generalizability. Gerrits et al., used this approach to study the surface reactivity of CHD 3 on Cu(111) and required 38000 data points to train their neural network potential. 75 Thus, here we aim to directly predict elements of the TS structure rather than trying to predict the full PES.
Specifically, we describe a novel method to generate 3D TS structures in a data-driven fashion. Our automated method requires optimized reactants and products as inputs and produces coordinates of the TS as the output. The key innovations in our method are its end-to-end differentiability combined with its power to both learn the TS distance matrix and weight the importance of specific distances during structure generation. Our goal is to produce viable initial guesses for subsequent eigenvector-following optimization methods faster than traditional quantum chemical methods and in a lessrestricted manner than the available heuristic options.

Overview
The workflow to generate transition state geometries involves several steps, which we roughly divide into featurization, prediction, and optimization. We first generate attributed graph structures of the reactant and product and effectively average them to create a graph structure representative of the TS. We feed this TS graph through a fully-connected graph neural network (GNN), which generates an updated graph representation of the TS. 76,77 An additional dense layer receives the updated TS graph and predicts values for two matrices necessary for a nonlinear least squares (NLS) optimization. The first matrix, D init , is the network prediction of the TS distance matrix, which represents Euclidean distances between all pairs of atoms. The second matrix, W, is a set of weights used for the NLS optimization. Each entry in W corresponds to a pairwise distance in D init and denotes the importance of each distance during reconstruction of the TS coordinates. The final coordinates of the TS, X, are the result of the NLS optimization, which minimizes the weighted residuals between D init and pairwise distances in X. Once trained, our method generates initial guesses for isomerization TS structures within seconds. Fig. 1 highlights the key steps in the method, while the ESI † provides further details. Our code is available on GitHub at https://github.com/PattanaikL/ts_gen.

Dataset
We train the model on a recently published dataset of gas phase organic chemistry reactions. 78 These data were generated with the single-ended growing string method, so the reactions correspond to elementary steps through a single saddle point on the PES. All reactions involve anywhere from 3-7 heavy atoms including carbon, oxygen, and nitrogen (4-21 total atoms including hydrogens). We filter the data to reactions that contain a single product to include only isomerization reactions. Ref. 78 provides the data at two levels of DFT, but we limit our focus to the high level (oB97X-D3 with a def2-TZVP basis set). Our final dataset consists of approximately 1000 unique reactants and 8500 reactions, which we divide randomly in an 80 : 10 : 10 split for training, validation, and testing.

Featurization
To train and test the model, we first transform individual log files from ref. 78 into structure-data files (SDFs) for the reactant, product, and TS. Our workflow uses reactant and product data to create an input attributed graph G with vertices (atoms), edges (bonds), and corresponding initial features for atoms and bonds. Note that since both the reactant and product contain the same number of atoms, it is straightforward to build G as an averaged representation for the TS. Importantly, the edge features include the exponential of the averaged distance between atoms i and j between the reactant and product along with whether or not a bond is broken or formed and if the bond is aromatic. The atom features only include the identity of the atom encoded numerically via atomic number and with a one-hot vector.

Prediction
The attributed 2D graph representation serves as the input to the GNN, which follows an update procedure for nodes and edges similar to ref. 77. First, we feed both the atom and bond features through dense layers to initialize the network. Next, the GNN iteratively updates the bond and atom features, and since this network is fully connected, it updates each edge and vertex with information from all other edges and vertices. After three iterations, we feed the final edge representation to another dense layer and symmetrize the output. The corresponding tensors represent the model's prediction of the TS distance matrix (D init ) and a weight matrix (W) for the forthcoming optimization. We detail the mathematical formalization of this network in the ESI. †

Optimization
We generate an initialization of the TS geometry (X init in Fig. 1) for the NLS optimization in two steps using multidimensional scaling theory. First, we calculate the Gram matrix with the predicted D init matrix. Then, we perform an eigendecomposition of the Gram matrix to calculate X init by concatenating the eigenvectors corresponding to the three largest eigenvalues. To recover the final TS coordinates, we optimize this initialization with an NLS algorithm defined by the following: The desired TS geometry is a result of minimizing the weighted residuals between D init and pairwise distances in X. When training the network, we calculate network loss by first recalculating the distance matrix from X (which we denote as D) and computing the absolute difference between D and the ground truth distance matrix given by ref. 78. We backpropagate this loss through the coordinate recovery procedure and the GNN in an end-to-end manner. This way, we not only learn the TS distance matrix prediction (D init ), but we also learn to weight the distance prediction appropriately (W). The end-toend differentiability of our network combined with its ability to both learn and weight the TS distance matrix prediction are the crucial aspects of our method that make it successful. The ESI † provides further details regarding the mathematics.

Verification
We employ a TS verification method similar to ref. 34. The automated quantum mechanics (QM) workflow first uses the TS geometry generated by the model as an initial guess in a TS optimization with the Berny method 10 and verifies the presence of a first-order saddle point by checking for exactly one imaginary frequency. It then feeds the optimized transition state to two intrinsic reaction coordinate (IRC) calculations, which link the saddle point to its corresponding reactant and product PES minima by a mass-weighted downhill optimization. 79 Thus, we spawn forward and reverse IRCs on the optimized TS to generate the corresponding reactant and product, which we further verify with a stationary point optimization and frequency calculation, ensuring all frequencies are real. Our workflow feeds these generated reactant and product geometries to OpenBabel, which infers connectivity using its ''connect-thedots'' method by adding bonds to atoms closer than their combined covalent radii while maintaining minimum distance and valence constraints. 80 We use these connections to compare graph connectivity between IRC reactants and products and original reactants and products. Matches between both sets for both molecules correspond to a successful TS generation.

Model structures before QM optimization
Before submitting the structures generated by the model to the verification workflow, we ensure that our method predicts valid geometries. That is, we evaluate whether the method in Fig. 1 independently produces realistic and realizable distance matrices. To check this, we plot the distribution of normalized distances generated by our model before and after the NLS optimization (D init and D, respectively) and compare it to the ground truth distribution of normalized distances reported by ref. 78, shown in Fig. 2. The dashed lines correspond to the model predictions, while the solid line corresponds to the ground truth distribution. Note that we limit these distributions to the reaction core (i.e., bonds that changed during reaction) to further highlight differences. Even before optimization, the D init distribution closely resembles the ground truth distribution; after optimization, the D and ground truth distributions are nearly identical, emphasizing the benefits of the Fig. 1 Illustration of workflow used to generate TS structures. Given input 3D reactant and product structures, we generate an initial representation of the TS graph by calculating averaged atom and bond features. We feed this input to a GNN, which generates an updated representation of the TS graph. We then pass this updated representation through a dense layer to predict the TS distance matrix, D init , and the NLS weight matrix, W. Finally, we perform an NLS optimization to recover the TS coordinates, X.
NLS procedure. We additionally compare against a fourth distribution of distances generated by using the average distance between reactant and product, shown as the dotted line in Fig. 2. This distribution is significantly different from D init , demonstrating that the GNN produces an effective initialization and further verifying that our model generates reasonable geometries to use as initial guesses.

Model importance weighting
To understand why the model produces valid geometries, we investigate the weight matrices predicted by the model, which emphasize which bonds are considered important during reconstruction of the final TS guess geometry (step 3 in Fig. 1). We extract these weights from our network and normalize them by the maximum weight for each molecule. We then bin the weights based on the average topological distance between reactant and product (i.e., the average number of bonds separating the two atoms to which the weight corresponds), shown in Fig. 3, along with counts for each bin. As topological distance between atoms increases, the value of the weight assigned to this distance decreases. That is, the model prioritizes preserving distances of nearby atoms when reconstructing coordinates from the predicted distance matrix. Again, this reinforces that the model learns which distances are important and focuses on retaining these local distances during generation of the TS structure.

Model performance on similar reactions
We evaluate overall performance on a held-out test set on which the model performs well, achieving an average test loss of 0.11 angstroms (average atomic difference in distance between model and ground truth structures per atom) before QM optimization. After investigating several trends, we report a noteworthy correlation, plotted in Fig. 4. To generate this plot, we extract learned embeddings of each reaction as the output of the GNN in both the training and test sets. We calculate pairwise cosine distances between each point in the two sets and plot the minimum cosine distance value for each test data point against its loss from the network. Thus, the plot claims that network loss increases as the similarity between the test data point and training data decreases. While we expect this trend, its proof allows us to add a pseudo-confidence metric to each new TS prediction: any new reaction (within the interpolation space of the training data) that is similar to data in the training set (as defined by a GNN embedding similarity less than a cutoff value of 0.3 Â 10 À4 ) should generally receive a reliable prediction from the network. The decreased loss above a cosine distance of 0.8 Â 10 À4 is likely an artifact of the few examples that have a high dissimilarity from the training set (n o 5 for each of the last two bins).

Correlation between difference success criteria
It is important to recognize the multiple success criteria we use to assess our results. While the deep learning method generates    4 Minimum cosine distance of test set from training set using learned embeddings from the GNN. Network loss increases as the distance between the test data point and training data increases. Each full bar corresponds to a single standard deviation, and the gray shading corresponds to a full standard deviation with respect to the average loss.
TS geometries based on difference in distances from the ground truth (network loss), our true measure of success is defined by the verification workflow. We employ this workflow on the test set of 834 reactions and achieve a 71% success rate, correctly following 595 initial guesses generated by our model to the ground truth reactants and products. Because the dataset contains a number of uncommon reactions, we check performance on the subset of reactions that match reaction families defined in the Reaction Mechanism Generator (RMG). 3 On this set of well-known gas-phase reaction types, our model achieves a success rate of 89%. These success rates, while short of ideal production percentages, represent better success rates than traditional string and expert-guided methods. 31 Furthermore, of the TS optimizations that succeeded, they required an average of only 20 Berny iterations to converge, showing the success of the algorithm at generating initial guess geometries.
We also assess the various failure modes: 41% of failures involve failed TS optimizations, 6% involve failed IRCs, 21% involve failed ground state optimizations, and the final 32% of failures are mismatches between the calculated ground state structures and the reported reactant and product structures. Of the TS optimization failures, 83% involve poor geometry failures (ex. exceeding the maximum number of steps allowed), 10% involve symmetry issues related to the generated internal coordinates, and 7% identify the incorrect number of imaginary frequencies. Clearly, a plurality of the failures involve inadequate initial guesses, motivating further optimization of our network.
We ensure the deep learning method is compatible with our true measure of success by plotting the loss distributions of the successes and failures as defined by the verification workflow, shown in Fig. 5. While the distributions are clearly different, the overlap suggests that the naïve network loss function (difference in distances) we use could be improved. We further comment on this issue after investigating specific examples. Fig. 6 illustrates several examples of reactions fed to our model; each illustration positions the reactant and product at the beginning and end of the arrow, respectively, and indicates the ground truth TS with a blue highlight (left) and the modelgenerated TS with a green highlight (right). A.1 shows a simple 1,3-migration reaction, predicted correctly by the linear approximation method as well. Our network correctly predicts many similar simple reactions. A.2 shows a more complicated example, where the geometry of the migrating group changes during the transformation. While the linear approximation method does not correctly predict this TS structure, our method effectively learns the geometric changes necessary to generate this TS. Note that, while our method slightly rotates the oxygenated group, the guess geometry is adequate enough to converge to the saddle point structure with the subsequent Berny optimization. In other words, many of the guesses generated by our method are ''close enough'' to the target geometry to pass the subsequent verification workflow.

Negative examples
Of the reactions that fail verification, many mirror example B.1. Here, several bonds break and form at once, as the linear oxocarbon group rearranges its connection to the aziridine group, which also simultaneously breaks apart. The model does not capture the orientation between the two groups well enough to pass a Berny optimization. Similar complicated reactions with large distance changes between reactant and product structures compose the failures (Fig. S3, ESI †). B.2 shows an example of a different but more significant failure mode. While the model TS structure correctly captures the transformation, it is nearly a mirror image of the reported transition state. Because we use the distance matrix to generate a TS structure, the target geometry is not entirely unique, and in a few cases, this fact results in a converged TS guess that contains one or more planes of symmetry with the ground truth geometry. Since our goal is to generate TS geometries to calculate rate constants, this is not a significant issue; a mirror image-converged TS produces the same numerical values for a rate constant. B.2, however, is an example where the mirror TS fails subsequent Berny optimization, which is problematic.

False positives
Examples C.1 and C.2 show similar TS structures generated by our model that are near mirror images of the reported ground truth but pass the verification workflow. Again, while this is not always an issue, we designate these examples as ''false positives'' and are actively working on modifications to relieve this issue. For database construction, we suggest using the IRC-optimized reactants and products.

False negatives
While investigating specific examples, we also discover failed structures we designated as ''false negatives.'' These examples are a direct result of our multiple criteria for success discussed earlier: loss as calculated by the network and success as defined by the verification workflow. Example D.1 shows a structure that registers a large loss value from the network but passes the verification workflow. In this example, the model TS methyl and aldehyde groups are rotated from the ground truth. While this difference in distances triggers a large loss, it is easily adjusted during subsequent optimization, further suggesting that using absolute difference of distances as a loss function for the network is a naïve option that should be improved. Conversely, D.2 shows a structure that registers a near-zero network loss but fails the verification workflow. Initial glance at the structure shows little difference between model and ground truth geometries, suggesting an unknown job error. We run each quantum calculation at a set of default parameters without further refinement for failed jobs; we are also implementing an automated strategy to restart failed jobs.

Limitations
The incapability of the current model to capture orientation constraints represents a significant limitation. Again, while this issue-which manifests itself with symmetric structures from the reported ground truth-does not usually affect any derived rate constants, it may affect generation of complex TS 2) A reaction with a near zero neural network loss that surprisingly fails the verification workflow. We render all 3D structures with PyMol. 81 structures in the future, such as those for metal-catalyzed reactions with intricate ligand interactions or generally for enantioselective reactions. Traditional methods that use distance geometry to generate structures use orientation constraints defined for chiral atoms based on signed volumes. Structure optimization methods use these constraints as a part of the energy function, usually within a simulated annealing framework (that is, the orientation constraints are only used to refine the structure, not to generate an initial guess). Replicating such an approach is possible to alleviate the current model limitations, but the orientation constraints for a TS must be learned since they are not known a priori.
The model design also prevents TS generation for multireactant reactions. Such a model would need to orient multiple reactants together before using the current method to generate a TS structure; we leave this task for future models. For large structures with many TS conformations, this method should be paired with more developed theories which consider TS conformations. That is, upon finding a single TS structure, one must explore other low energy TS structures to accurately calculate kinetic rates. If the identity of the reactive species is different than the structure provided by the user, (e.g., for reactions with specific reactive protonation states or tautomers), the current algorithm will not find an appropriate TS. Finally, input reactions to the model must indicate the atom-mapping between reactants and products. While this is the case for most double-ended methods, it represents a barrier towards a fully-automated TS search algorithm. Fortunately, automated atom-mapping algorithms do exist and researchers continue to refine and improve them. [82][83][84][85]

Conclusions
The availability of high-quality data and the recent prevalence of deep learning in the natural sciences have combined to usher in a new era of predictive chemistry. Computation especially offers a platform to both generate large quantities of data and extract trends generally applicable to out-of-domain samples. Here, we developed a deep learning model to generate three-dimensional transition state structures given the geometries of the reactant and product trained on a diverse dataset of gas-phase organic reactions. Our method produced viable geometries before QM optimization and achieved excellent results on an external test set via a rigorous verification workflow. We hope that users adopt our method to generate isomerization transition state structures relevant for their research and use our model as a basis for architectures involving even more complex chemical transformations.

Conflicts of interest
There are no conflicts to declare.