Open Access Article
Leonard
Galustian
,
Konstantin
Mark
,
Johannes
Karwounopoulos
,
Maximilian P.-P.
Kovar
and
Esther
Heid
*
Institute of Materials Chemistry, TU Wien, A-1060 Vienna, Austria. E-mail: esther.heid@tuwien.ac.at
First published on 21st October 2025
Transition state (TS) geometries of chemical reactions are key to understanding reaction mechanisms and estimating kinetic properties. Inferring these directly from 2D reaction graphs offers chemists a powerful tool for rapid and accessible reaction analysis. Quantum chemical methods for computing TSs are computationally intensive and often infeasible for larger molecular systems. Recently, deep learning-based diffusion models have shown promise in generating TSs from 2D reaction graphs for single-step reactions. However, framing TS generation as a diffusion process, by design, requires a prohibitively large number of sampling steps during inference. Here we show that modeling TS generation as an optimal transport flow problem, solved via E(3)-equivariant flow matching with geometric tensor networks, achieves over a hundredfold speedup in inference while improving geometric accuracy compared to the state-of-the-art. This breakthrough increase in sampling efficiency and predictive accuracy enables the practical use of deep learning-based TS generators in high-throughput settings for larger and more complex chemical systems. Our method, GoFlow, thus represents a significant methodological advancement in machine learning-based TS generation, bringing it closer to widespread use in computational chemistry workflows.
TSs are high-energy structures that are extremely short-lived and exist only on the order of femtoseconds.3 To better understand chemical reactions, for decades, quantum mechanical calculations have been the only way of accessing TS structures. Density functional theory (DFT) has commonly been used as the method of choice.4 TS search algorithms generally fall into two categories: single-ended methods mostly start from the 3D geometries of the reactants,5 while double-ended methods utilize both reactant and product geometries for determining the TS.6 These quantum chemical methods are, however, computationally highly demanding, for example, using ≈35% of the computational resources of the Swiss National Supercomputing Center in 2017.7 They also suffer from convergence difficulties, often yielding TSs that do not lie on the minimum energy pathway.8
Several reaction datasets containing optimized TS geometries have been developed in recent years.8–13 One notable example is RDB7,9 which builds on the work of Grambow et al.10 In their approach, small organic molecules were sampled as reactants from GDB-7,14 and their optimized geometries were used as starting points for a growing string method and subsequent saddle point optimization to identify TSs. The main aim of such reaction datasets is to enable researchers to develop machine learning (ML) based algorithms for estimating TS geometries and barrier heights. ML algorithms identify patterns from thousands of chemical reactions that are predictive of the TS geometries, and thus avoid running expensive quantum chemical calculations.
Multiple deep learning (DL) based algorithms have been proposed for predicting TS structures. Some of them rely on the 3D geometry of the reactant and product as input, in addition to the 2D graphs,15,16 while others take only the 2D graph as input.17,18 Generating 3D geometries of reactants and products can be computationally costly, with current studies relying on external quantum chemistry calculations that provide relaxed 3D coordinates.15,16,19 Furthermore, current methods perform post-processing using quantum chemical methods on the generated TS structures.15,17,18 In both cases, high-throughput applications cannot rely on these expensive calculations.
Kim et al.18 recently introduced TsDiff, the current state-of-the-art (SOTA) approach for generative modeling of TS geometries leveraging only the 2D reaction graph as input, encoded as a condensed graph of reaction.20 TsDiff relies on diffusion modeling21 of the distribution of possible TS geometries with an E(3)-invariant, distance-based graph neural network (GNN) as encoder.18 While achieving remarkable results, framing the problem as a diffusion process requires a large number of denoising steps during inference. TsDiff in particular uses annealed Langevin dynamics for inference, requiring 5000 sampling steps per reaction, considerably slowing down inference. Duan et al.16 have recently shown improved TS prediction performance on the Transition1X (T1X) dataset12 using a deterministic optimal transport based model that, given a TS structure guess, produces a single TS prediction. However, deterministic methods require knowledge of the reactant and product geometries, which correspond to a particular TS, which we do not assume.
Another important drawback of current deep learning approaches for the generation of TS coordinates arises from overfitting and lack of generalization capabilities. Previous studies found GNNs and, in general, DL-based reaction prediction models to consistently generalize poorly,8,22 or perform significantly worse on several possible out-of-distribution dataset splits,8,19 even though the test reactions shared many common reaction mechanisms with the training reactions.8,23
In this paper, we address the current limitations of efficiency, necessity of additional quantum-mechanical calculations, and missing out-of-distribution generalization ability. Specifically, we frame the TS structure generation problem as an optimal transport flow process with an E(3)-equivariant geometric tensor network24 as graph encoder, taking as input only the 2D reaction graphs. Using a flow matching-based algorithm25 for fitting the velocity field, and the efficiency of geometric tensor networks, increases sampling efficiency during inference by more than a hundredfold, compared to the SOTA. Second, we analyze out-of-distribution generalization capabilities of our generative model using splits based on the clustered reaction cores and the magnitude of the barrier height, in addition to random splitting. Current literature for TS prediction using DL has so far only used random splits.15–18 We demonstrate that our approach, GoFlow, outperforms previous methods in terms of generalization capability. Together with its improved in-distribution performance at a fraction of the cost of current approaches, we establish GoFlow as the new state-of-the-art in generative TS models.
Initial atom and bond features ({hi}i,{ej}j) are extracted using RDKit.26 The model outputs cartesian coordinates
per atom i.
It has been shown to be highly effective for both TS geometry prediction,18 and reaction property prediction.20,29
, and p1(x) be the target data distribution that we want to model. Flow matching considers a continuous path of probability distributions pt(x) for t ∈ [0, 1] such that pt smoothly interpolates between p0 and p1. This path is induced by a time-dependent vector field vt(x). Samples xt evolving along this path follow the ordinary differential equation (ODE):30| ∂tpt(x) = −∇ × (pt(x)vt(x)). |
The goal is to train a parameterized neural network uθ(x, t) to approximate the true, often intractable, vector field vt(x).
A key challenge is that vt(x) depends on the marginal probability path pt(x), which is typically unknown. Instead of directly matching the marginal vector field vt(x), conditional flow matching (CFM) defines a simpler target vector field based on conditional paths.30
Consider a specific pair of samples x0 ∼ p0 and x1 ∼ p1. We can define a path xt connecting them, for instance, a simple linear interpolation, also called the optimal transport (OT) path:
| xt = (1 − t)x0 + tx1. |
The vector field corresponding to this specific conditional path is simply the time derivative
Crucially, this target vector field (x1 − x0) is independent of t and xt along the path and does not require knowledge of pt(x).
The OT CFM objective trains the model uθ(x, t) by minimizing the expected squared error against this conditional target vector field, averaged over time t and pairs (x0, x1):
. Minimizing this objective results in uθ(x, t) approximating the marginal vector field vt(x).25
Once the model uθ(x, t) is trained to approximate the vector field vt(x), it can be used for generating new samples from the target distribution p1(x). This is achieved by numerically simulating the probability flow ODE. Starting with an initial sample x0 drawn from the base distribution p0(x), we integrate the differential equation
Let X = ({ri}i,{hi}i) represent the input molecule, with coordinates
and initial features hi. Let Φ be the network function mapping X to output features F = Φ(X). An E(3) transformation g ∈ E(3) acts on the coordinates as g × ri. We denote the transformed input as g × X=({g × ri}i,{hi}i).
Let V be the space of output features F and GL(3) the group of invertible linear transformations on it. The network Φ is E(3)-equivariant if there exists a representation ρ: E(3) → GL(V) such that for all g ∈ E(3) and all inputs X:
| Φ(g × X) = ρ(g)Φ(X). |
This means that transforming the input geometry by g results in a predictable transformation ρ(g) of the output features. For instance, rotating the input molecule causes the predicted force vectors to rotate accordingly.
Scalar outputs, such as energy, must be E(3)-invariant, a special case where ρ(g) is the identity transformation, ρ(g) = I, for all g.
Implementations often use spherical harmonics (e.g. up to L = 2) as bases for features, which transform via Wigner D-matrices under O(3) rotations. Tensor products are key operations for maintaining equivariance. Examples include SE(3)-Transformers, NequIP, and MACE,31,34–36 known for their expressiveness but potentially high computational cost.
In this work, we adapt and use the E(3)-equivariant Geometric Tensor Network (GotenNet) architecture.24 GotenNet aims to bridge this gap between expressiveness and efficiency, particularly addressing the computational overhead associated with traditional tensor product based methods. It achieves E(3)-equivariance without explicitly relying on tensor products with Clebsch–Gordan coefficients for its core message passing.
GotenNet works with different types of features, capturing geometric information. Nodes have invariant scalar features hi and steerable features
(l)i that behave as spherical harmonics up to a degree Lmax. Edges also have invariant scalar features tij and initial geometric tensors
ij derived directly from the relative positions of connected atoms using spherical harmonics.
Specifically, the authors introduce multiple equivariant modules, such as geometry-aware tensor attention and hierarchical tensor refinement. They modify transformer-based architectures by refining edge representations through high-degree steerable features, which enable the attention mechanism to leverage geometric relationships in determining node interactions. For details, we refer the reader to the paper.24 The key point is that GotenNet does not use high-degree tensor product operations, thus improving efficiency, while still capturing essential geometric information.
During inference, we sample multiple TS geometries for each reaction. Let S be the number of samples and let
denote the matrix of atomic coordinates for the final geometry of the s-th sample, where N is the number of atoms. The coordinates for the i-th atom in sample s are
.
To choose the final prediction, we first compute the median atomic position
i for each atom i across the S samples, where atoms are identified by their atom mapping number:
These median positions form the aggregate median geometry
.
Finally, we choose the sample
whose geometry is closest to the median geometry
. The distance ds for each sample is calculated as the Frobenius norm of the difference between the sample's coordinates and the median coordinates:
The index s* of the best sample is found by minimizing this distance:
| s* = argminsds |
The final predicted geometry Rfinal is then the geometry of the sample with index s*:
| Rfinal = Rs* |
Note that we do not access the ground truth TS geometry for selecting the final sample out of the ensemble. This procedure is named AggregateSamples in Algorithm 1. We choose median over mean aggregation, to omit sampling low probability conformers in case of multimodal distributions.
(1) To incorporate time awareness into the GotenNet architecture for flow matching, we add sinusoidal time embeddings to the initial node and edge features.
(2) To obtain an optimal transport path, we first align the randomly initialized atomic positions with the ground truth positions. We align their center of mass (CoM) and rotationally align the positions using the Kabsch algorithm to find the optimal rotation matrix.37 These steps are performed in the Align function of Algorithm 1.
(3) We employ the previously proposed median sample aggregation method during inference.
(4) We initialize node and edge features using CGR-based embeddings. This procedure is described in detail in Algorithm 2.
(5) In an additional experiment we initialize the atomic positions at t = 0 with the reactant geometry plus Gaussian noise (μ = 0, σ2 = 0.25). With this modification, we expect to better model chiral TSs.
We perform ablation studies on the number of ODE integration steps, the number of samples to be aggregated during inference, the number of trainable parameters, and three different dataset splitting strategies.
We also evaluate the median absolute deviation in the performance metrics, using an ensemble of 8 models, each trained and evaluated separately using different initialization seeds.
926 gas-phase reactions involving H, C, N, and O with molecules containing up to seven heavy atoms. An evaluation of GoFlow on T1X dataset,12 a recomputation of RDB7, is also performed and results reported in detail in the SI. Geometries and vibrational frequencies for RDB7 were obtained at the B97-D3/def2-mSVP and ωB97X-D3/def2-TZVP levels of theory. We split the dataset into training, validation, and test sets in an 80%, 10%, and 10% ratio using different splitting strategies (see Section 2.8). For T1X we use the official split. We use RDKit to extract the following atomic features from SMILES strings: aromaticity, formal charge, hybridization, number of bonds per atom, degree, and ring membership. We adopt the same pre-processing pipeline as Kim et al.18
p(Ct|Grxn). To compare it to our method, we trained it on RDB7.9 Contrary to the authors,18 we avoid data augmentation in our work.
Firstly, it does not account for extrapolation capabilities to out-of-domain samples, which might vary among different model classes, such as when comparing equivariant to invariant models or different generative methods. Secondly, the reactants in the dataset were generated using graph enumeration,14 which can result in highly similar reactions ending up in both the training and validation/test sets.
For our ablation studies, we train and test using the random split strategy only, assuming it to be a sufficiently good heuristic for evaluating single-parameter changes on our model. In the following, we summarize the three splitting strategies employed.
(1) Random split. Randomly assign reactions to training, validation, or test set.
(2) Reaction core split. Extract the reaction core (i.e., template), the set of atoms for which adjacent bond types are changed during the reaction, and group all reactions by their common core. Randomly assign a core to the training, validation, or test set. Thus, different sets do not contain reactions of the same core.
(3) Barrier height split. Add reactions in the upper and lower 10% of the barrier heights to the validation or test sets. The rest is added to the training set.
Compared to previous work, we do not report minimum-over-samples (MOS) metrics such as the matching score or the average minimum RMSD (AMR)18,38–40 for the evaluation of model performance on RDB7. However we perform an evaluation using a MOS approach in the SI to showcase the future potential of steering-based approaches.41 MOS metrics take several independently generated candidate geometries for the same reaction and report the minimum RMSD (or D-MAE) to the ground truth among them. While we evaluate our predictions against the ground truth TS geometry on the test set, we avoid minimum-over-samples metrics in our main evaluation on RDB7 for two reasons:
First, in a real application, the ground truth TS geometry is unknown, so the user cannot identify the “best” sample post hoc. Second, they can give a misleadingly optimistic picture of model accuracy – for example, even a poor model (or random guessing) can achieve an artificially low AMR if enough samples are generated, simply by chance.
Instead, we report metrics computed on the single geometry that would actually be returned to the user in practice, such as the median prediction from multiple samples of the model. This yields error estimates that better reflect the accuracy a chemist could expect when using the model prospectively, without the benefit of knowing which sample is closest to the truth.
Furthermore, we introduce a metric, called steric clash error, with which we aim to identify gross structural deviations in nonbonded interactions. Steric clashes, while barely affecting the D-MAE when most of the molecule's geometry is predicted correctly, result in unrealistically high repulsive energies and thus unrealistic activation energies.42 Vost et al. demonstrated in recent work43 that conditioning diffusion models on conformer quality significantly improves steric clash test results in generated molecules. The recent Boltz-1 architecture for biomolecular structure prediction also uses physical constraint potentials, including steric clash constraints, during inference.44 In this work, we use the steric clash error to analyze the results only.
To accomplish this, we use a simplified Lennard-Jones (LJ) interaction potential, where we omitted the London dispersion force term, set ε = 0.25 kcal mol−1, σ = 0.7
Å, and are thus left with
.12 We set the steric clash of edges with distances greater than 0.7 Å to zero. Thus, we only consider interatomic distances close to or smaller than the shortest bond lengths (of hydrogen molecules) in our dataset.
| Method | D-MAE (Å) | RMSD (Å) | Angle (°) | Runtime (ms) |
|---|---|---|---|---|
| GoFlow-1 | 0.118 | 0.20 | 3.65 | 10 |
| GoFlow-25 | 0.108 ± 0.006 | 0.18 ± 0.005 | 3.63 ± 0.28 | 125 ± 0 |
| GoFlow-25-R | 0.104 ± 0.002 | 0.17 ± 0.002 | 3.56 ± 0.08 | 130 ± 0 |
| TsDiff | 0.164 | 0.29 | 4.77 | 1544 |
We also plot the distribution of the obtained D-MAEs on the random split test set in Fig. 1. The bell-shaped distribution features a tail to the right, indicative of outliers with high error that increase the overall D-MAE, and a low peak close to 0.05 Å. In comparison, the low peak of TsDiff is shifted to the right, and the tail is much more pronounced, producing more predictions with a high D-MAE. Fig. 1 thus highlights the improved prediction accuracy reported in the current study, with both a lower number of high D-MAE (low-quality) predictions, and a lower D-MAE for high-quality predictions.
![]() | ||
| Fig. 1 Distributions of the mean absolute error of interatomic distances (D-MAE) in angstroms for GoFlow and TsDiff. Inference performed with 25 ODE solver steps and 25 samples per run. | ||
Evaluations of GoFlow on T1X are reported in the SI.
The results in Table 2 show that increasing the number of ODE steps and keeping the number of samples at 25, reduces the D-MAE, the angle error, and steric clashes. At 25 ODE steps, the model achieves the lowest D-MAE (0.107 Å), the smallest angle error (3.68°), and a low steric clash error (14 kcal mol−1), indicating an optimal balance. Increasing the number of samples and keeping the ODE steps at 25 (Table 3) shows the lowest angle error when using 10 samples and the lowest D-MAE when using 50 samples, while the steric clash remains equally low in both cases. This shows that the ODE steps are crucial to reducing steric clashes.
| # ODE steps | D-MAE (Å) | Angle (°) | Steric clash (kcal mol−1) |
|---|---|---|---|
| 1 | 1.263 | 52.56 | 9351 |
| 3 | 0.182 | 9.37 | 2815 |
| 5 | 0.125 | 5.02 | 362 |
| 10 | 0.111 | 3.94 | 59 |
| 25 | 0.107 | 3.68 | 14 |
| 50 | 0.107 | 3.70 | 7 |
| # Samples | D-MAE (Å) | Angle (°) | Steric clash (kcal mol−1) |
|---|---|---|---|
| 1 | 0.119 | 3.77 | 4 |
| 3 | 0.117 | 3.76 | 12 |
| 5 | 0.113 | 3.66 | 22 |
| 10 | 0.107 | 3.64 | 21 |
| 25 | 0.107 | 3.68 | 14 |
| 50 | 0.105 | 3.70 | 14 |
Furthermore, we conducted ablation studies on the number of trainable parameters of the model by increasing the dimensionality of the atom basis latent space. The results are shown in Table 4. We found the model with 5.2 M parameters to yield the best results. However, the model with 1.4 M parameters still achieved a D-MAE of 0.124, compared to the 0.164 D-MAE of TsDiff, which has 2.7 M trainable parameters.
| # Parameters | Atom basis | D-MAE (Å) | Angle (°) |
|---|---|---|---|
| 0.4 M | 64 | 0.163 | 5.94 |
| 1.4 M | 128 | 0.124 | 4.26 |
| 5.2 M | 256 | 0.102 | 3.49 |
| 9.3 M | 344 | 0.105 | 3.56 |
| 20.4 M | 512 | 0.114 | 3.90 |
| Method | Split | D-MAE (Å) | RMSD (Å) | Angle (°) |
|---|---|---|---|---|
| GoFlow | Random | 0.108 | 0.18 | 3.63 |
| Reaction core | 0.138 | 0.22 | 5.00 | |
| Barrier height | 0.149 | 0.22 | 5.63 | |
| TsDiff | Random | 0.164 | 0.29 | 4.77 |
| Reaction core | 0.174 | 0.30 | 5.51 | |
| Barrier height | 0.191 | 0.32 | 6.39 |
In particular, these errors show that relying solely on metrics such as the D-MAE for performance evaluation, which is currently common practice,15,18 is insufficient. D-MAE is insensitive to steric clashes, and because distances are preserved under reflections, it also fails to capture chirality errors.
Moreover, the identified error modalities appear to occur more frequently in reactions that trained chemists would classify as unlikely or unphysical. Examples include reactions involving unstabilized carbenes (Fig. 3(a) and (b)), energetically unfavorable polycycles (Fig. 3(c)), or zwitterions, the latter being present in unusually many products in the dataset. This observation highlights the importance of chemically informed dataset curation for developing robust TS prediction models.
The optimized geometries of GoFlow were significantly lower in energy, with the mean of individual differences in the TS energy being 4.76 kcal mol−1, and requiring less optimization steps to reach the TS (median of 26 for GoFlow versus 35 for TsDiff). In subsequent intrinsic reaction coordinate (IRC) calculations, 225 of the 280 optimized geometries for GoFlow converged to the correct reactants and products, and 216 of 269 for TsDiff. Matches in those reactants and products were determined by converting the IRC geometries to SMILES strings and comparing those. These results, summarized in Table 6, indicate that GoFlow provides effective starting points for quantum mechanical TS optimization, and outperforms TsDiff.
| Method | Success rate | IRC match | Optimization cycles |
|---|---|---|---|
| GoFlow | 94% | 80.4% | 25 |
| TsDiff | 92% | 80.3% | 35 |
Moreover, out of the 225 IRC validated geometries by GoFlow, 136 had lower single-point energies than their reference structure. Of those 136, 21 had an energy that was more than 0.1 kcal mol−1 lower than the reference energy, and for 13 it was more than 1 kcal mol−1 lower. In the SI we show one such example, as well as the distributions of atomic force magnitudes.
The distribution of atomic force magnitudes of the non-optimized TS structures is highly similar for both methods. GoFlow has slightly more atoms at the low-force end of the distribution. This is consistent with GoFlow reaching converged TSs slightly more often and with fewer optimization cycles.
Supplementary information: results for Transition1X, details on the metrics, quantum mechanical analysis, details on hyperparameters. See DOI: https://doi.org/10.1039/d5dd00283d.
| This journal is © The Royal Society of Chemistry 2025 |