Open Access Article
Joonyoung F. Joung
*ab,
Nicholas Casetti
a,
Priyanka Raghavan
a and
Connor W. Coley
*ac
aDepartment of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA. E-mail: ccoley@mit.edu
bDepartment of Chemistry, Kookmin University, 77 Jeongneung-ro, Seongbuk-gu, Seoul, 02707, Republic of Korea. E-mail: jjoung@kookmin.ac.kr
cDepartment of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
First published on 19th May 2026
The prediction of reaction outcomes is a longstanding challenge in chemistry, with the ability to do so serving as a direct reflection of our understanding of chemical reactivity. Accurately predicting reaction products is crucial not only for synthetic planning but also for designing reaction pathways and experiments in silico. This review explores the diverse methodologies used to predict reaction outcomes, which can be broadly divided into two main categories. Some approaches predict reaction products and their likelihoods in a single step, while others break the task into two distinct parts: candidate enumeration and the subsequent prediction of product likelihoods. We examine both data-driven methods, such as graph-based and sequence-generation models, and physics-based methods, including potential energy surface exploration and reactive molecular dynamics. In addition, we discuss quantitative predictions of reaction selectivity, regioselectivity, stereoselectivity, and yield. This review summarizes trends and advances in reaction outcome prediction and briefly outlines future directions for the field.
However, chemical reactivity is inherently complex. While some transformations can be reliably predicted based on well-established heuristics and rules, accurate prediction of others remains elusive, requiring advanced reasoning and extensive experience.2 It is not uncommon for experts to encounter reactions where outcomes deviate from expectations, reflecting the unpredictability of many chemical systems.3–5 As such, computational models of outcome prediction are not only predictive tools but also reflections of the underlying logic of chemical reactivity that may elucidate yet-unknown trends.
The diversity of factors relevant to chemical reactions further complicates the task of predicting outcomes. Reactivity is influenced by electronic effects, steric hindrance, solvent environments, and diverse reaction conditions including photochemistry,6–8 electrochemistry,9–11 and high pressure,12 as well as more ubiquitous factors such as temperature, concentration, and time, which are often not explicitly incorporated in current machine-learning models for reaction prediction. Under these more activating conditions, the intuition that many chemists rely on for conventional thermal reactions can fall short—as can many models. Unlike well-studied counterparts, these environments can involve excited states, electron transfer, or non-equilibrium dynamics, making it much harder to predict outcomes using established methods. The fact that chemical transformations become highly context-dependent motivates the development of new computational and theoretical frameworks to unravel their complex and diverse reactivity.
The term reaction outcome prediction can refer to a variety of modeling tasks, each pursuing different task formulations. Some approaches aim to predict the most likely product(s) and their associated likelihoods directly in a single step (Fig. 1a), while others simulate full mechanistic pathways using first-principles methods (Fig. 1b). In contrast, two-step approaches decouple the task of outcome prediction into first enumerating plausible candidate products and then scoring them according to their likelihoods (Fig. 1c). Other common formulations include predicting selectivity (Fig. 1d), or reaction yields under specified conditions (Fig. 1e); less common formulations may seek to predict reaction rates.
These different instantiations of the reaction outcome prediction task place more or less emphasis on the scope of potential transformation products relative to their relative likelihoods. For example, one might consider reaction yield prediction as a task where the set of relevant possible outcomes has been narrowed down to a single product and the challenge is exclusively on scoring in absolute terms; site selectivity prediction as a task where the set of possible outcomes is also relatively unambiguous; major product prediction as a task where the second stage of scoring need not be as quantitative; mechanism generation and simulation as a task where both defining the scope of possible transformations and quantitatively anticipating their rates can prove very challenging.
This review draws on that framing to clarify how different methods—whether rule-based, physics-driven, or data-driven-approach these tasks, and how each method implicitly defines what constitutes a “reaction outcome” in the first place. We pay close attention to the problem formulation, i.e., how different tasks are posed. How are potential products defined? How are they scored? Is this approached as one simultaneous task or in two distinct parts? For most data-driven formulations of the problem, there are only weak distinctions between the settings of predicting “overall” transformations and predicting mechanistic transformations; the greatest differences are observed when considering the role of physics-based methods that use knowledge of the potential energy landscape to propose products, score products, or both. This distinction between one-step and two-step formulations also serves as a central organizational principle for the structure of this review. Rather than grouping methods solely by algorithmic paradigm (e.g., physics-based or data-driven), we organize our discussion according to how each method conceptualizes the task of reaction outcome prediction. Specifically, we examine whether candidate outcomes are generated and scored simultaneously, or treated as distinct stages—an axis along which many of the most meaningful differences in methodology and assumptions emerge.
We first provide a brief background on the data and physical principles that underpin modern approaches to reaction outcome prediction. We then examine methods that perform joint prediction of outcomes and likelihoods, followed by approaches that decouple the task into separate stages of candidate enumeration and scoring. In the literature, the term “reaction” may refer either to an overall transformation from reactants to products or to an individual mechanistic step conventionally drawn as arrow-pushing diagrams. We then discuss models that aim to quantitatively predict other aspects of reaction outcomes, such as site selectivity and product yield. Finally, we highlight a select number of examples of reaction prediction tools and their application to advancing scientific understanding and discovery.
| Name | Size | Description |
|---|---|---|
| a Reactions extracted from USPTO patents (2008–2011).b Reactions extracted from USPTO patents (1976–2016). | ||
| USPTO14,15 | 424 621 reactionsa |
A curated collection of chemical reactions extracted from United States patents, constructed by Lowe14 through systematic text-mining, normalization, and atom-mapping procedures applied to the USPTO corpus. |
1 808 937 reactionb |
||
| USPTO-50k19 | 50 000 reactions |
A subset pseudorandomly selected from the USPTO corpus14,15 from reactions able to be classified or mapped by NameRxn20 and Indigo. |
| USPTO-480k (USPTO-MIT)21 | 480 000 reactions |
A dataset derived from the USPTO corpus.14,15 This dataset was filtered to exclude stereochemical information. |
| USPTO-LEF22 | 349 898 reactions |
This dataset consists of reactions from the USPTO-480k, filtered to focus on reactions that involve linear electron flow (LEF). |
| USPTO-STEREO23 | 1 002 970 reactions |
A dataset derived from the USPTO corpus,14,15 that retains stereochemistry information for reactions. It includes details about chiral centers and their transformations. |
| USPTO-Full24 | 1 100 105 reactions |
A dataset derived from the USPTO corpus14,15 and filtered by Dai et al.,24 containing a wide variety of reaction types. |
| Pistachio17 | Over 9 million reactions | A dataset of over 9 million reactions extracted from US & EPO patents, containing compound information (SMILES, trivial names), reaction details (reaction types, yields), and document info (publication date). |
| Ahneman et al.'s Buchwald–Hartwig amination25 | 4312 reactions | A high-throughput experimental dataset focused on the palladium-catalyzed Buchwald–Hartwig amination of aryl halides with 4-methylaniline, including yield, reaction conditions, and additives. |
| Perera et al.'s Suzuki cross-coupling26 | 5760 reactions | A high-throughput experimental dataset focused on the palladium-catalyzed Suzuki-Miyaura cross-coupling of aryl halides with boronic acids, including yield, reaction conditions, and additives. |
| HiTEA27 | 3083 Buchwald–Hartwig | High-throughput experimental datasets collected over a decade from pharmaceutical companies' internal screening platforms, covering four reaction classes with detailed records of yields, conditions, and reagents. |
| 2908 hetero hydrogenation | ||
| 2567 homo hydrogenation | ||
| 1575 Ullmann coupling | ||
The USPTO-derived reaction dataset is worth special mention due to its prominence as a fully open CC0-licensed dataset and the source of many data subsets used for model evaluation. It is built from the patent text-mining work of Daniel M. Lowe and contains data algorithmically extracted from U.S. patents dating from 1976 to September 2016.14,15 Since the full dataset may contain reactions with missing components, questionable atom mapping, or other potential quality concerns, several versions of the USPTO dataset have been created based on different filtering methods.
The USPTO-15k subset consists of 15
000 curated reactions, originally used by Coley et al.28 for reaction product prediction. The filtering process involved selecting reactions that could be mapped to a common set of reaction templates, with special care taken to eliminate data with unreliable atom mappings or incomplete reaction information. This dataset primarily focuses on reactions with well-defined and easily distinguishable reactants and products.
The USPTO-50k subset is a larger collection of 50
000 reactions able to be assigned a reaction class from the USPTO,19 offering a sampling of reactions encountered in patent literature. The filtering process here focused on ensuring that only reactions with valid atom mappings and complete reaction schemes were included. This dataset was originally released in the context of reaction classification, but quickly became popular in retrosynthesis studies due to its curation and manageable size. The USPTO-480k dataset (sometimes referred to as USPTO-MIT) represents a larger and more specialized subset, containing 480
000 reactions that were created by Jin et al.21 A key feature of this dataset is the requirement for reactive atoms to form a continuous reaction center, meaning that reactive atoms must form direct bonds between the reactants and as the product. From USPTO-480k, Bradshaw et al.22 curated USPTO-LEF by further filtering out reactions that do not follow linear electron flow, resulting in 73% of the reactions. Neither dataset contains stereochemical information or reactions that alter aromatic bonds.
The USPTO-Full dataset is the most commonly used dataset that is intended to be a maximally-inclusive version of Lowe's USPTO dataset, with a cleaned version containing approximately 1
100
105 unique reactions after preprocessing to remove duplicates and reactions with incorrect atom mappings.24 As the dataset with the largest number of reactions among the USPTO derivatives, it includes a relatively wide variety of reaction types covering standard chemical transformations from patent literature.
The USPTO-STEREO dataset is another important derivative, which specifically retains stereochemical information for reactions.23 It contains 1
002
970 reactions, with single-product reactions constituting 92% of the dataset. The cleaning process involved removing 720
768 duplicates from Lowe's dataset by comparing reaction strings without atom mapping. Additionally, 780 reactions were removed because the SMILES strings could not be canonicalized with RDKit due to issues like invalid valence electron counts.
It is important to pay careful attention to what dataset is being described even if the name sounds familiar; for example, the USPTO-Full used by Tetko et al.29 actually contains 4% fewer reactions than the USPTO-Full used by Dai et al.24 due to additional “cleaning” by the authors. Even if this data cleaning is fully justified, one must be hesitant to draw direct comparisons between performance metrics across datasets just because the dataset is referred to by the same name.
Reaction datasets minimally include the identities of reactants, agents, and products. These can be represented in two ways: either by combining the reactants and agents or by keeping them separate. In SMILES notation, the separated format is represented as reactants > agents > products, while the combined format is represented as reactants.agents ≫ products. The former is referred to as “separated”, and the latter as “mixed” in some evaluations. Reactants are the compounds that contribute to the structure of the product, and when atom mapping is available, the atoms in the reactants that correspond to those in the product are identified. Atom mapping is a technique used to track the movement of atoms from reactants to products during a chemical reaction.20,30–34 It assigns a unique identifier to each atom in the reactant molecules and ensures that these identifiers are preserved and mapped to the corresponding atoms in the product. Agents, by convention, are compounds such as solvents or catalysts that do not directly contribute to the product's structure. However, there are cases where small functional groups (e.g., in the case of methylation, solvolysis) or even a single atom (e.g., in the case of halogenations) may contribute to the product's structure, yet the compound is still classified by the dataset as an agent. In the “separated” format, the reaction prediction task is simplified by implicitly restricting possible outcomes to those that do not involve atoms from agents. “Mixed” is the more realistic and practical evaluation, but also more challenging from a statistical learning perspective, as the model must infer which of the chemical species are most likely to contribute to the product structure(s) from a larger pool.
In contrast to literature-derived reaction datasets, which typically report only successful reactions that yield an identifiable product, high-throughput experimentation datasets often include measurements across a wide range of reaction conditions, including low-yielding or unproductive outcomes.27,43,44 Because these experiments systematically explore combinatorial reaction spaces, they can capture both successful and unsuccessful conditions within the same dataset. Such data provide valuable information about reaction feasibility and can help mitigate survivor bias that arises when models are trained exclusively on successful transformations, thereby improving the robustness of machine learning models trained on these datasets.
| Name | Size | Description |
|---|---|---|
| RMechDB46 | 5300 elementary steps | A public database of elementary radical reaction steps derived from textbooks and research literature, containing curated mechanistic pathways, transition states, and electron flow annotations for radical reactions. |
| PMechDB47 | 108 987 elementary steps |
A publicly accessible database of elementary polar reaction steps, derived from curated literature and combinatorial methods, containing arrow-pushing mechanisms, and balanced reactions for various polar mechanisms. |
| Joung et al. 202448 | 898 155 elementary steps |
A large-scale dataset of elementary reactions derived from the USPTO-Full dataset using expert-curated reaction templates. This dataset includes detailed mechanistic steps, imputed intermediates, and side products. |
| FlowER dataset49 | 1 445 189 elementary steps |
A fully balanced, large-scale mechanistic dataset derived from the USPTO-Full dataset. It includes 1220 expert-curated reaction templates across 252 reaction classes. The dataset ensures broad coverage of organic transformations and is designed to maintain mass and electron conservation. |
| Rad-650 | 32 515 reactions |
A dataset of reactions for open- and closed-shell organic molecules, derived from DFT calculations. It includes bond-breaking reactions and molecules with up to six non-hydrogen atoms. |
| mech-USPTO-31k51 | 31 364 reactions |
A dataset derived from the USPTO corpus,14,15 containing organic reactions annotated with 100 hand-encoded reaction types and 63 mechanistic classes. Each reaction is labeled with mechanistic steps in the form of arrow-pushing diagrams. |
| Reactron dataset55,56 | ca. 2 850 000 elementary steps |
A dataset derived from USPTO-480K,21 containing reaction mechanism annotations generated by MechFinder.51 Each reaction is labeled with mechanistic steps in the form of arrow-pushing diagrams. The dataset is currently partially available, with the full dataset comprising approximately 2.85M reaction mechanisms. |
| RGD-152 | 176 992 reactions |
Organic reactions of 413 519 molecules involving C, H, O, and N atoms, containing transition states, activation energies, and enthalpies of reaction. |
| Grambow et al.53 | 16 365 reactions |
Organic reactions derived from an automated potential energy surface exploration method. |
| Transition1x54 | 10 073 reactions |
A NEB/CINEB-based reaction-path dataset containing reactant, transition-state, and product geometries produced from automated PES exploration. |
One method to generate mechanistic datasets is to do so through the application of expert-written mechanistic templates to different substrate combinations. Baldi has championed this approach of manually curating plausible elementary steps from organic chemistry textbooks and advanced organic chemistry books for over a decade, with recent releases of RMechDB and PMechDB.46,47 RMechDB contains over 5300 elementary radical reaction steps, each carefully curated to represent plausible reaction pathways. On the other hand, PMechDB focuses on polar reactions, specifically those involving heterolytic bond cleavage and charged intermediates, and includes over 100
000 elementary reaction steps. While RMechDB primarily consists of manually curated data, PMechDB combines manually curated entries with computationally generated steps, expanding its coverage and diversity. Both databases use SMIRKS notation to represent the transformations.
A similar approach to generating mechanistic datasets is through imputation of overall reactions. This also involves constructing reaction templates that dictate transformations at the level of elementary steps, but rather than applying them solely in a forward enumerative sense, they are applied to full reactions (with defined reactants and products) to impute the intermediates.48,49,51 These reaction templates are intended to cover widely agreed-upon and plausible reaction mechanisms, ensuring that they accurately reflect expert knowledge. Two mechanistic datasets have been made publicly available: Joung et al.48 created a dataset by considering 175 reaction conditions across 86 reaction types. They applied templates in a designated order according to each reaction type to generate the mechanistic dataset. In contrast, Chen et al.51 developed a mechanistic dataset using MechFinder, based on 63 general templates with high coverage, which were applied based on a lookup table to impute a mechanistic dataset for 31
364 overall reactions from the USPTO-50k dataset, excluding radical and organometallic reactions. This line of work has been further extended in the Reactron dataset,51,55,56 which scales up mechanistic annotation to a larger set of reactions derived from USPTO-Full.21 Contemporaneously, the FlowER dataset49 was introduced, emphasizing fully balanced reactions that strictly adhere to mass, hydrogen, and electron conservation principles. Unlike the earlier dataset by Joung et al.,48 FlowER explicitly considers acid-base reactions, significantly enriching its mechanistic diversity. This approach ensures comprehensive electron accounting, providing more chemically consistent mechanistic pathways suitable for advanced generative modeling. Similar adherence to conservation law is also enforced in datasets using MechFinder-based approaches.51,56 Beyond these publicly available mechanistic datasets, several groups have developed large sets of mechanistic transforms that are used in rule-based reaction prediction and reaction discovery.57–59 Because these collections are not released as standalone datasets, we discuss them separately in Section 4.1.2.
The Rad-6 database was created to explore the space of radical reactions, specifically focusing on organic molecules involving both closed- and open-shell systems.50 It contains 10
712 molecules, primarily selected based on a graph-based enumeration of radical and non-radical systems.60 These molecules, containing hydrogen, carbon, and oxygen atoms, were optimized using DFT calculations with the PBE0 functional and Tkatchenko-Scheffler dispersion corrections. The database is rich in radical fragments and unconventional motifs, such as poly-radicals, and includes molecules up to six non-hydrogen atoms. In addition to the molecular data, the Rad-6-RE dataset provides reaction energies calculated from the atomization energies of the molecules.
The Reaction Graph Depth 1 (RGD-1)52 dataset was developed to comprehensively explore reaction spaces for organic reactions involving C, H, O, and N atoms. The dataset contains 176
992 reactions, with transition states, activation energies, and enthalpies of reaction. The reactions are derived from 413
519 molecules curated from PubChem, all having no more than 10 heavy atoms. The dataset was generated using a graphically-defined elementary reaction step (ERS) that applies a systematic enumeration process, allowing the identification of a wide range of reactions. Conformational sampling was used to generate diverse reaction pathways, followed by transition state localization using Yet Another Reaction Program (YARP)61 and verified using intrinsic reaction coordinate calculations.
Grambow et al.53 created a large-scale dataset of elementary chemical reactions, focusing on the reactants, products, and transition states based on quantum chemical calculations. The dataset consists of 12
000 reactions computed using the ωB97X-D3/def2-TZVP level of theory and 16
365 reactions computed at the B97-D3/def2-mSVP level. These reactions include H, C, N, and O atoms, and the data was derived from an automated potential energy surface exploration method, using the single-ended growing string method62,63 to optimize reaction paths and transition states. The results provide critical information such as atom-mapped SMILES, activation energies, and enthalpies of reaction.
Building on similar ideas of automated PES exploration, the Transition1x dataset54 provides dense sampling of reaction pathways rather than focusing solely on discrete transition states. The dataset contains 10
073 reactions generated using nudged elastic band (NEB) and climbing-NEB (CINEB) calculations initiated from 11
961 reactant-product pairs. Each pathway includes reactant, product, and transition-state structures as well as intermediate images along the reaction coordinate, all computed at the ωB97X/6-31G(d) level of theory.
Transition state theory relies on several key assumptions that connect the potential energy surface to reaction kinetics. It assumes that the reactant ensemble is in thermal equilibrium and forms a quasi-equilibrium with the transition state, allowing the relative population of the transition state to be expressed through Gibbs free energies. Nuclear motion is treated classically, and the reaction is assumed to proceed along a single well-defined minimum-energy path connecting reactants and products, without contributions from alternative pathways. A central requirement is the no-recrossing assumption, which states that trajectories crossing the dividing surface at the transition state do not return to the reactant region. These assumptions enable the derivation of the typical TST rate expression,
![]() | (1) |
However, the simplifying assumptions also define the situations in which the theory may break down. For example, if there are large energetic releases from reactions that confound the dynamics,65 trajectories are produced that do not remain on the minimum-energy path. The classical treatment of nuclear motion also neglects quantum tunneling, which allows particles to traverse barriers even when they lack sufficient classical energy.66 In addition, the assumption that trajectories cross the dividing surface only once is often violated, and recrossing events can return flux to the reactant side and reduce the effective rate.67 There are formulations of TST that seek to address some of the simplifications associated with the original theory. Variational transition state theory accounts for recrossing effects by redefining the dividing surface.68 Multipath69 and multistructure70 variational TST account for recrossing and contributions from several reactive pathways and structures respectively. Although these formulations improve TST's modeling capability, conventional TST remains widely used for evaluating reaction rates from computed transition states.
Because transition states determine the activation free energy of an elementary step, locating the saddle point on the potential energy surface is essential for connecting molecular structure to reaction kinetics. Unlike reactant and product minima, transition states correspond to isolated first-order saddle points, and small deviations in geometry can drive the system toward either well. As a result, identifying a molecular configuration that lies sufficiently close to the saddle point is often challenging, and practical TS optimization requires an initial guess structure that captures the correct chemical transformation. These considerations motivate the development of algorithms that construct or refine such guesses, which are discussed in the following Section 4.2.2.
![]() | ||
| Fig. 2 Visualization of the many possible conformers available to a molecule. The number of accessible conformations combinatorially increases with the number of rotatable bonds. Reproduced with permission from ref. 72. | ||
Given these challenges, conformer generation methods algorithmically construct three-dimensional geometries by sampling torsional degrees of freedom according to chemically reasonable constraints. Existing approaches fall broadly into systematic, stochastic, and machine-learned strategies, each differing in how torsion angles are selected and how the resulting geometries are filtered to produce a representative ensemble.
Systematic conformer generation methods enumerate conformational space by explicitly rotating each rotatable bond according to a set of allowed torsion angles derived from empirical structural data. These approaches rely on knowledge-based torsion libraries constructed from crystallographic or protein-bound ligand datasets, which encode the preferred low-energy torsional states observed in real molecules.73–76 Conformers are assembled incrementally by building the molecular structure from a central fragment outward and applying all feasible torsion-angle assignments in either a depth-first or combined depth-first and breadth-first manner. Throughout this construction process, chemically invalid structures are removed through checks on bond lengths, bond angles, and steric clashes, often supported by VSEPR-derived ideal geometries and covalent radii. Many systematic generators incorporate specialized handling of ring systems or macrocycles, either through ring template libraries or through numerical optimization methods that rebuild macrocyclic torsions after enumeration.75 Because systematic enumeration rapidly produces large numbers of candidate conformers, most algorithms apply RMSD-based clustering to identify a diverse and representative ensemble that remains within user-specified size limits.
Stochastic conformer generation methods explore conformational space by combining random coordinate perturbations with local energy minimization, rather than exhaustively enumerating torsional states. Distance-geometry approaches construct candidate geometries by sampling distance matrices that satisfy chemically reasonable upper and lower bounds, followed by force-field refinement.77,78 Other stochastic strategies operate directly in internal-coordinate space by assigning random torsion angles and optimizing the resulting structures.79–81 Energy-directed stochastic methods use molecular-dynamics trajectories perturbed along low-frequency vibrational modes to preferentially access low-strain conformations that may be difficult to locate through unbiased sampling.82 These stochastic approaches provide a flexible framework for conformer generation in systems where combinatorial torsion enumeration becomes intractable.
Machine-learned conformer generation methods use generative models to predict three-dimensional molecular geometries directly from the molecular graph or from torsional degrees of freedom. Unlike systematic or stochastic approaches, which rely on predefined torsion libraries or random perturbations, machine-learned methods train on large datasets of experimentally derived or simulation-generated conformers to learn distributions over 3D structures. Graph neural networks can be used to predict local atomic frames and torsion angles conditioned on the molecular graph,72,83,84 while diffusion-based approaches operate either in full Cartesian coordinate space or in torsion angle space to sample conformers through learned denoising trajectories.85,86 Other models explicitly target the Boltzmann distribution by incorporating energy-based rewards into their training objectives.87,88 These approaches have been shown to excel at generating low energy conformations however are known to struggle to recover ensembles that are distributed differently than their underlying reference structures (e.g., bioactive conformations vs. low energy conformations).89
![]() | ||
| Fig. 3 Cost-accuracy tradeoff associated with methods used to evaluate the energy of a 3D geometry of a molecule or atomistic system. More accurate methods yield poorer scaling which leads to higher relative compute times. Reproduced with permission from ref. 101. Copyright 2023 John Wiley and Sons. | ||
In practice, accurate energetic evaluation for reaction modeling involves several additional challenges beyond the choice of electronic structure method. Many reactions proceed on multiple potential energy surfaces, requiring the treatment of excited states, nonadiabatic effects, or spin-state crossings, particularly in photochemical systems and transition-metal catalysis.102,103 Open-shell species, near-degenerate electronic configurations, and multireference character can further complicate the reliable description of reaction intermediates and transition states.104 Environmental effects such as solvation and conformational flexibility introduce additional sources of uncertainty, as they can substantially alter relative energetics and barrier heights.105 These factors make the construction of chemically accurate reaction energy profiles a nontrivial task, even when high-level electronic structure methods are employed.106
These methods estimate the wave function to determine the energy of a given molecular system. The Hartree–Fock method is one of the original methods to perform this calculation; it assumes the wave function can be approximated by a single Slater determinant.107 Although Hartree–Fock has practical utility, it neglects electron correlation which limits its accuracy. There are several methods, post-Hartree–Fock methods, that build upon Hartree–Fock and address electron correlation. Full configuration interaction (CI) is an example of this that yields the exact solution to the non-relativistic Schrödinger equation, however is not practical for molecules with more than just a few atoms due to its exponential scaling.108–110 Other post-Hartree–Fock methods employ estimations for electron correlation. For example, Møller–Plesset perturbation methods add electron correlation using Rayleigh–Schrödinger perturbation,111 and coupled cluster methods use the exponential cluster operator to include electron correlation.112,113 These methods trade achieving the exact solution for more favorable scaling (O(N5) and O(N7) respectively) allowing them to be used on larger molecules when compared to full CI.
The more common way to estimate the wave function is with density functional theory (DFT) calculations. The theory behind DFT are the Hohenberg–Kohn theorems which state that the ground state energy of a many electron system is uniquely defined by the electron density, and that an energy functional exists that is minimized at the ground state density.114 By using the electron density, DFT can achieve better scaling than post-Hartree–Fock methods (O(N3)) while still accounting for electron correlation. However, since the exact energy functional is not known, DFT remains an approximation and is generally less accurate than post-Hartree–Fock methods. Within DFT, there is a cost-accuracy trade-off where more accurate functionals and larger basis sets require more computational power but provide a more accurate description of the physics governing the system. For example, moving up Jacob's ladder to more accurate functionals involves an increasingly nonlocal description of exchange-correlation energy which provides a more exact description of electron density while necessitating an increased number of integral evaluations.115 DFT has been used extensively in the characterization of reactive systems,116,117 and is more often employed for providing post hoc insight into the results of experimentally performed reactions than for providing true predictive ability. It is worth noting too that “more accurate” methods are still context-dependent, and different chemical systems benefit from different design choices.118
Semi-empirical methods have emerged as a class of methods that reduce computational expense through the use of a less complete approximation for electronic integrals.119–123 To mitigate the loss of fidelity, these methods use adjustable empirical parameters to correct energies based on experimental data or accurate theoretical data.124 Semi-empirical methods can be several orders of magnitude faster than DFT125 and have been shown to output reliable ground state and transition state geometries for certain reactions.126–128 Although the accuracies of energies estimated by semi-empirical methods are often regarded as insufficient for reaction modeling,126 the recently developed method g-xTB has demonstrated substantially more promising accuracy than its predecessors when estimating reaction barriers.121
Interatomic potentials are a set of methods that directly predict the energy of a molecular system from its geometry in order to reduce cost. Much of the computational expense associated with the above methods is the necessity of a self-consistent field method for calculating the wave function. Interatomic potentials circumvent this need. Force fields are the prototypical category of interatomic potentials parameterized by experimental data or quantum chemical calculations, but there are few force fields specifically parameterized for reactive geometries which allow them probe kinetics of reactions directly. Two such examples, ReaxFF129,130 and the charge-optimized many-body potential (COMB),131,132 were initially developed to handle hydrocarbons but were eventually expanded to handle more organic elements and metals. One key difference between the two is ReaxFF is parameterized to reproduce reactive barrier heights whereas COMB is parameterized to reproduce elastic properties of materials.130
The traditional means of defining and parameterizing empirical force fields has been revisited in recent years through the lens of machine learning. Specifically, machine-learned interatomic potentials (or neural network potentials) have emerged as a new category of interatomic potentials parameterized by a neural network trained on quantum chemical calculations. This allows machine-learned interatomic potentials to, in principle, overcome the cost-accuracy trade-off by providing quantum chemistry accuracy at the cost of a single model inference pass;133–135 their accuracy relies on having adequate training data. Certain machine-learned interatomic potentials are trained on both equilibrium and transition state-like geometries which can allow them to characterize geometries associated with reactions. Due to the millions of CPU hours it can take to generate training data, most reactive machine-learned interatomic potentials have been limited to narrow chemical spaces (i.e. small number of elements and reaction types) but still have been used to computationally explore chemical reactions.136–141 One strategy to improve the performance of these potentials for a specific reaction is active learning. This entails sampling points in the chemical regime of interest where the model is uncertain with quantum chemical calculations. This minimizes the amount of training data necessary to achieve parity with a reference functional for specific reactions.142–146 Furthermore, recent developments have resulted in increasingly general reactive machine-learned interatomic potentials. The dataset Open Molecules 2025 (OMol25) includes over 10 million DFT snapshots along reaction pathways with samples from organic and organometallic reactions; neutral, charged and radical reactions; and electrolyte reactions.147 Models trained on this dataset should be expected to exhibit a broader domain of applicability. More targeted efforts have resulted in the development of reactive AIMNet models capable of modeling neutral reactions (AIMNet2-rxn),148 Pd-catalyzed reactions (AIMNet2-Pd),149 and radical reactions (AIMNet2-NSE).150
This category also includes certain physics-based methods that perform simultaneous product prediction, for instance, by exploring potential energy surfaces and identifying both feasible products and their corresponding energetics. These two methodologies, data-driven and physics-based, will both be discussed in this section. Each method has its own way of defining and computing the “likelihood” of an outcome. For data-driven models, this often corresponds to predicted probabilities or ranked scores inferred from statistical patterns in historical data. For physics-based methods, likelihood is typically evaluated using computed activation energies or thermodynamic quantities such as Gibbs free energy, which reflect the kinetic or thermodynamic feasibility of a transformation. An illustrative comparison between these two paradigms is shown in Fig. 4. While these likelihoods are not directly comparable, treating them as instances of the same overarching task of assessing the favorability of possible outcomes helps to conceptually unify methods that are developed from different principles.
Directly learning a normalized likelihood p(P|R) is challenging due to the combinatorially large and discrete nature of the product space. Furthermore, product molecules are structured outputs such as graphs or sequences, making it difficult to define and estimate probability densities. As a result, most approaches will train a model to sample products from this distribution,
∼ p(P|R), rather than actually learning a normalized likelihood p(P|R).49,152,153
After sampling one or more products, different models lend themselves to different ways of then estimating a likelihood that resembles or correlates with p(P|R). A primary axis of organization is how one defines the model and representation of product molecules in order to perform this sampling.
These models differ in how they represent molecules—either as graphs154,155 or as sequences29,152—and accordingly adopt different neural architectures and modeling assumptions. Because these representations encode different aspects of chemical structure and syntax, they lead to distinct modeling strategies, each with trade-offs in capturing topological detail, chemical validity, and predictive uncertainty.
The performance of various data-driven models for predicting reaction outcomes is summarized in Table 3. This table contains the vast majority of methods or models that predict reaction products from reactants in a data-driven manner, including both models that directly generate reaction products along with their associated likelihoods—discussed in Sections 3.1.1 and 3.1.2—and two-step models that decouple product enumeration and scoring, as described in Section 4. Most models represent molecules as either molecular graphs or SMILES sequences (or other line notations analogous to SMILES,156,157) though some adopt alternative encodings such as fingerprints. Model performance is most often reported as the top-k accuracy, defined as the proportion of test instances for which the correct product appears within the model's top k ranked predictions.
| Year | Model | Dataset | Setting | Representation | Top-1 | Top-2 | Top-3 | Top-5 | Top-10 |
|---|---|---|---|---|---|---|---|---|---|
| a Datasets with the same name may differ in preprocessing or splits (e.g., train/validation/test), making top-k accuracy values not directly comparable.b In mechanism prediction, multiple valid outcomes may exist, but models can assign only one per rank, imposing an upper bound on top-k accuracy. | |||||||||
| Graph generation models (Section 3.1.1) | |||||||||
| 2019 | GTPN154 | USPTO-15k | Separated | Graph | 82.4 | 85.7 | 85.8 | ||
| 2020 | MPNN + ILP158 | USPTO-480k | Separated | Graph | 90.4 | 93.2 | 94.1 | 95.0 | |
| 2021 | MEGAN159 | USPTO-480k | Separated | Graph | 89.3 | 92.7 | 94.4 | 95.6 | 96.7 |
| 2021 | MolR160 | USPTO-480k | Graph | 88.2 | |||||
| 2021 | NERF161 | USPTO-480k | Graph | 90.7 | 92.3 | 93.3 | 93.7 | ||
| 2023 | NERF* 162 |
USPTO-480k | Graph | 91.5 | 93.6 | 94.4 | 95.1 | 95.6 | |
| 2023 | ReactionSink163 | USPTO-480k | Graph | 91.3 | 93.3 | 94.0 | 94.5 | 94.9 | |
| 2025 | FlowER49 b |
FlowER dataset | Mixed | Graph | 88.4 | 96.3 | 97.8 | 98.4 | 98.5 |
| 2025 | Reactron56 | Reactron dataset | Mixed | Graph | 96.4 | 97.6 | 97.8 | — | — |
| Sequence generation models (Section 3.1.2) | |||||||||
| 2018 | Seq2Seq23 | USPTO-480k | Separated | Sequence | 83.2 | 87.7 | 89.2 | ||
| 2019 | Transformer153 | USPTO-480k | Separated | Sequence | 90.4 | 93.7 | 94.6 | 95.3 | |
| 2020 | Augmented Transformer29 | USPTO-480k | Separated | Sequence | 92.0 | 95.4 | 97.0 | ||
| 2022 | Chemformer164 | USPTO-480k | Separated | Sequence | 92.5 | 94.9 | 95.1 | ||
| 2022 | Chemformer-Large164 | USPTO-480k | Separated | Sequence | 92.8 | 94.9 | 95.0 | ||
| 2019 | Transformer153 | USPTO-480k | Mixed | Sequence | 88.6 | 92.4 | 93.5 | 94.2 | |
| 2020 | Augmented Transformer29 | USPTO-480k | Mixed | Sequence | 90.6 | 94.4 | 96.1 | ||
| 2022 | Graph2SMILES155 | USPTO-480k | Mixed | Sequence | 90.3 | 94.0 | 94.8 | 95.3 | |
| 2022 | Chemformer164 | USPTO-480k | Mixed | Sequence | 90.9 | 93.8 | 94.1 | ||
| 2022 | Chemformer-Large164 | USPTO-480k | Mixed | Sequence | 91.3 | 93.7 | 94.0 | ||
| 2022 | T5Chem165 | USPTO-480k | Mixed | Sequence | 88.9 | 92.9 | 95.2 | ||
| 2022 | R-SMILES166 | USPTO-480k | Mixed | Sequence | 91.0 | 95.0 | 96.8 | 97.0 | |
| 2023 | SeqAGraph167 | USPTO-480k | Mixed | Graph | 88.9 | 94.6 | 95.5 | 96.5 | |
| 2023 | BiG2S168 | USPTO-480k | Mixed | Graph | 89.0 | 94.6 | 95.6 | 96.6 | |
| 2019 | Transformer153 | USPTO-STEREO | Separated | Sequence | 78.1 | 84.0 | 85.8 | 87.1 | |
| 2022 | Motif-Reaction169 | USPTO-STEREO | Separated | Sequence | 82.7 | 88.8 | 89.6 | ||
| 2019 | Transformer153 | USPTO-STEREO | Mixed | Sequence | 76.2 | 82.4 | 84.3 | 85.8 | |
| 2022 | Graph2SMILES155 | USPTO-STEREO | Mixed | Graph | 78.1 | 84.6 | 85.8 | 86.8 | |
| 2022 | Motif-Reaction169 | USPTO-STEREO | Mixed | Sequence | 79.9 | 86.4 | 88.2 | ||
| 2024 | Transformer48 | Subset of Pistachio | Mixed | Sequence | 90.2 | 93.8 | 94.8 | 95.5 | |
| 2024 | Graph2SMILES48 | Subset of Pistachio | Mixed | Graph | 98.3 | 98.6 | 98.7 | 98.7 | |
| 2024 | Transformer48 b |
Joung et al. 202448 | Mixed | Sequence | 83.5 | 88.3 | 89.3 | 90.1 | 90.7 |
| 2024 | Graph2SMILES48 b |
Joung et al. 202448 | Mixed | Graph | 88.7 | 91.0 | 91.3 | 91.5 | 91.6 |
| Two-step models (Section 4) | |||||||||
| 2022 | LocalTransform170 | USPTO-480k | Separated | Graph | 92.3 | 95.6 | 96.5 | 97.2 | |
| 2017 | WLDN21 | USPTO-480k | Mixed | Graph | 79.6 | 87.7 | 89.2 | ||
| 2019 | WLDN171 | USPTO-480k | Mixed | Graph | 85.6 | 90.5 | 92.8 | 93.4 | |
| 2021 | MEGAN159 | USPTO-480k | Mixed | Graph | 86.3 | 90.3 | 92.4 | 94.0 | 95.4 |
| 2022 | LocalTransform170 | USPTO-480k | Mixed | Graph | 90.8 | 94.8 | 95.7 | 96.3 | |
| 2019 | GTPN154 | USPTO-15k | Separated | Graph | 83.4 | 85.7 | 86.7 | ||
| 2019 | GTPN154 | USPTO-480k | Separated | Graph | 83.2 | 86.0 | 86.5 | ||
| 2018 | Electro22 | USPTO-LEF | Separated | Graph | 87.0 | 92.6 | 94.5 | 95.9 | |
| 2017 | Neural-Symbolic172 | Reaxys | Fingerprint | 78.0 | 98.0 | ||||
| 2021 | WLDN173 | 2225 Baeyer-Villiger oxidation | Graph | 90.4 | 93.4 | 93.9 | |||
| 2024 | WLDN48 | Subset of Pistachio | Mixed | Graph | 95.0 | 97.1 | 97.5 | 97.6 | |
| 2024 | WLDN48 b |
Joung et al. 202448 | Mixed | Graph | 79.4 | 86.5 | 87.4 | 88.0 | 88.3 |
Despite nearly a decade of progress, top-1 accuracies on commonly used USPTO-derived benchmarks have largely plateaued around 90%. Rather than indicating a strict saturation of model performance, this trend likely reflects the fact that reaction outcome prediction is inherently under-specified in these datasets. In particular, the absence of detailed reaction conditions and procedures means that the mapping from reactants to products is not uniquely defined, introducing legitimate ambiguity in what constitutes the “major” product. In many cases, reported products may not correspond to a dominant (>50%) outcome, but rather reflect experimentally convenient or selectively reported transformations. As a result, evaluation metrics based on exact product matching may penalize chemically reasonable predictions that differ from the recorded product, effectively imposing a ceiling on achievable top-1 accuracy. Additional factors, including reporting bias and data quality issues in patent-derived datasets, may further contribute to this performance limit.
While many models produce and rank multiple candidate outcomes, the notion of ‘scoring’ varies significantly across frameworks. These range from learned likelihoods or autoregressive beam search scores154,159 to empirical frequencies from repeated sampling.49 These scores are generally only relative measures of plausibility and are not calibrated as absolute probabilities. Formally, after all, these models are trained to predict the major product of a reaction as it is recorded, not a quantitative branching ratio, yield, etc. Therefore, the term “likelihood” in this context refers not to a calibrated probability but to a more general notion of confidence or favorability assigned to predicted outcomes. Still, likelihood may and often does empirically correlate with accuracy.
In this section, we focus on models that produce product structures and associated likelihoods in a single integrated process. This includes models that predict a sequence of graph edits in an autoregressive manner so long as they do not separately evaluate or rank candidate outcomes after generation. In contrast, models that involve an explicit evaluation of predicted candidates (e.g., via scoring, ranking, or binary feasibility assessment) are discussed later under two-step frameworks under Section 4.
While GTPN learns edit sequences via reinforcement learning, an alternative strategy is to directly supervise autoregressive edit prediction. Autoregressive graph-editing is exemplified by the Molecule Edit Graph Attention Network (MEGAN) in Fig. 5a, which formulates reaction prediction as a sequence of graph edits applied to the input reactant graph.159 Each edit corresponds to a discrete transformation, such as adding or removing atoms, modifying bonds, or adjusting atomic attributes like formal charge, and is selected from a learned vocabulary of chemically plausible actions. These edits are predicted autoregressively, one at a time, using masking mechanisms to ensure chemical validity at each step. Autoregressive graph-editing has also been extended beyond product prediction to model reaction mechanisms. For example, Reactron56 operates at the level of electron flow by autoregressively predicting source-sink interactions that define arrow-pushing steps, thereby constructing reaction mechanisms as sequences of elementary transformations rather than directly generating the final product.
![]() | ||
| Fig. 5 Two representation models for graph-based models for reaction product prediction. (a) Prediction of sequence of graph edits (MEGAN). Reproduced with permission from ref. 159. Copyright 2021 American Chemical Society. (b) Non-autoregressive prediction (NERF). Reproduced with permission from ref. 161. | ||
A representative example of this one-shot prediction strategy is NERF (non-autoregressive electron redistribution framework), which frames reaction prediction as a global electron redistribution task over the molecular graph.161 Instead of generating atom-level edits or bond changes sequentially, NERF samples latent vectors from a learned distribution and decodes them into predictions of how electrons are redistributed across the molecule. Specifically, the model estimates the movement probabilities of electrons between all atom pairs simultaneously, using attention-based mechanisms to infer bond formation and breaking events.
Building on the non-autoregressive formulation of NERF, ReactionSink addresses a key limitation in electron redistribution modeling: the failure to satisfy fundamental physical constraints simultaneously.163 While NERF enforces a form of row-wise electron conservation through attention-based bond updates, it does not guarantee symmetry or full electron-counting consistency in its predicted bond changes. In particular, NERF applies row-wise normalization to attention weights, ensuring that each atom distributes a fixed amount of electron flow, but not that receiving atoms obey the same constraint. Moreover, its post-hoc symmetrization step can disrupt the initial conservation property, leading to inconsistencies between predicted donor and acceptor electron counts. ReactionSink introduces a refined decoder architecture that imposes both the electron-counting rule and the symmetry rule, ensuring doubly conservative predictions. This is achieved through the use of Sinkhorn normalization,174 which transforms attention maps into doubly stochastic matrices. Compared to NERF, ReactionSink the improvement in top-1 accuracy on USPTO-480k is modest (from 90.7% to 91.3%). Note that this benchmarking dataset exhibits a plateau in top-1 accuracy (Table 3), making it increasingly difficult to distinguish model performance at the high-accuracy regime, as discussed further in Section 3.1.
A parallel line of work seeks to improve product diversity and uncertainty modeling within non-autoregressive frameworks. Guo et al.162 augment NERF by eliminating the use of a latent prior and instead introducing two sources of controlled stochasticity: boosting, which trains an ensemble of specialized models to capture diverse reaction outcomes, and dropout, which introduces finer variations during inference. While their method does not explicitly enforce conservation laws like ReactionSink, it achieves improved product diversity on multi-selectivity reactions.
While these models operate within the non-autoregressive paradigm, they adopt an end-to-end formulation that trains on final product structures without capturing reaction intermediates. However, in many areas of chemistry, understanding the underlying reaction mechanism—rather than just the final product—is of interest. To capture this mechanistic detail, some machine learning approaches operate at the level of elementary steps.48,49,56 Flow matching for Electron Redistribution (FlowER) implements this non-autoregressive paradigm by treating elementary step prediction as a flow-matching problem over the bond-electron matrix.49 FlowER predicts mechanistic steps by learning a time-dependent vector field over interpolated bond-electron matrices, enabling the model to generate electron movements that update the bond–electron matrix in a physically consistent manner.
Graph-based models provide a powerful framework for chemical reaction prediction by explicitly representing molecular structures as graphs, where atoms are nodes and bonds are edges. Their primary strength lies in their ability to capture the local chemical environment and structural changes using message-passing algorithms, enabling interpretable predictions that align closely with chemical intuition, such as bond formation, bond breaking,154,159,175 and electron redistribution.49,161 Models like MEGAN159 and ReactionSink163 further enhance the reliability and accuracy of predictions by incorporating structural constraints and enforcing valence rules, ensuring that the predicted reactions are not only structurally valid but also consistent with fundamental chemical principles.
However, the sequential nature of graph-editing models (such as MEGAN159 and GTPN154) can be computationally expensive, especially for complex reactions that involve numerous steps. The process of iteratively predicting and updating graph structures poses significant scalability challenges, particularly when dealing with reactions requiring multiple transformations. While these patterns may resemble a reaction mechanism in some cases, they do not accurately represent true reaction mechanisms. To predict actual mechanisms, including electron flow (or arrow-pushing diagrams), the model would need to operate at the level of elementary steps, as demonstrated by models like FlowER,49 and DeepMech,175 as well as autoregressive approaches like Reactron,56 which explicitly generate the sequence of mass- and charge-conserving mechanistic transformations rather than only the final product structure.
The use of sequence-to-sequence models for chemical reaction prediction was first demonstrated, to our knowledge, by Nam and Kim,152 who used the then state-of-the-art technique of a long short-term memory (LSTM)-based model (Fig. 6) operating on SMILES strings, trained on reaction data from the USPTO dataset prepared by Lowe.14,15 Building on this idea and bringing significantly greater scale and practicality, Schwaller et al.23 and demonstrated improved performance on larger-scale datasets, such as USPTO-480k, establishing sequence-to-sequence translation as a viable paradigm for practical reaction outcome prediction, favorably competing with the graph-based methods published shortly before (Table 3).
![]() | ||
| Fig. 6 Model architecture for seq2seq. Reproduced with permission from ref. 152. | ||
Transformer-based models further advanced this approach by replacing recurrence with attention-based architectures,180 allowing for improved modeling of long-range dependencies and better scalability to large datasets. When applied to chemical reaction prediction, these models improved the accuracy of product prediction over their predecessor LSTMs,29,153,164 and are now widely used for SMILES-based modeling across reaction prediction tasks just as transformers serve as the backbone for the vast majority of autoregressive prediction tasks in contemporary deep learning.
The notion of “likelihood” in these sequence-generation models conveniently arises from natural language processing settings. To provide a ranked list of multiple candidate products, sequence generation models employ various beam search strategies. The predicted product is typically the SMILES string with the highest joint probability—computed as the product of per-token probabilities. While this provides a natural ranking, it does not guarantee chemical feasibility. In practice, models may assign high likelihoods to products that are syntactically well-formed but chemically implausible, especially when trained on data with limited constraints.48,181,182 This disconnect between statistical likelihood and chemical likelihood remains a key limitation in SMILES-based architectures.
It is worth emphasizing that this class of models is conceptually distinct from graph-based approaches, which treat molecules as topological graphs and rely on atom- and bond-level features. Sequence-generation models operate entirely in token space, where SMILES strings are parsed into sequences of discrete units—tokens—such as atoms, bonds, or common substructures. These models must learn the SMILES syntax, chemical validity, reactivity, and conservation laws implicitly through exposure to large datasets. However, they do so quite effectively in practice.
In this section, we review representative sequence-generation models for reaction prediction, focusing on their ability to not only generate plausible product structures but also rank multiple outcomes by predicted likelihood. We discuss architectural innovations, data augmentation strategies, domain adaptation through transfer learning, and attempts to improve interpretability and physical fidelity. We also examine how recent models incorporate molecular structure features, including graph-based inputs and motif-aware representations, while retaining the efficiency and generality of sequence-based generation. A summary of their performance across datasets is provided in Table 3.
A single molecular structure can be represented by many valid SMILES strings, even if cheminformatics tools like RDKit contain canonicalization strategies to generate strings reproducibly. Handling this SMILES ambiguity has been a challenge in sequence-based prediction. Unlike graph-based approaches, which inherently preserve molecular connectivity regardless of atom ordering through permutation equivariance, SMILES-based models must recover structural information from a linear sequence of tokens. This makes them sensitive to the exact SMILES representation encountered during training and inference, with the risk of overfitting to specific token patterns rather than learning the underlying chemical structure.
One way to mitigate the sensitivity to specific SMILES representations is to expose the model to multiple valid encodings of the same molecule during training. By learning from varied token sequences corresponding to the same underlying structure, the model can better generalize to unseen SMILES formats at inference time. The Augmented Transformer29 incorporated randomized SMILES augmentations throughout training and inference, applying them either to the input SMILES alone or to both input and output SMILES. This strategy exposed the model to diverse syntactic variations of the same underlying molecules, modestly improving its top-1 accuracy with full reaction augmentation.
In contrast to random augmentation, the R-SMILES approach166 promotes consistent token ordering in the most chemically relevant regions of the SMILES string, specifically the reactive centers. By aligning reaction sites while allowing variability in non-reactive portions of the molecule, R-SMILES encourages substrings to be copied over from the reactants to products. Such randomized augmentation and structure-aware alignment strategies demonstrate how targeted exposure to multiple SMILES representations can improve the robustness of SMILES-based models. These techniques are now common components in recent sequence-based models.
Building upon this insight, Chemformer164 introduced a pretrained Transformer architecture tailored for molecular tasks. Rather than training models from scratch, Chemformer adopted a two-stage process: unsupervised pretraining on large corpora of molecular SMILES (e.g., from PubChem183) followed by task-specific fine-tuning on reaction data. This approach aimed to encode generalizable chemical priors such as SMILES grammar, connectivity patterns, and functional group identities into the model's weights before learning the specific task of reaction prediction. In forward prediction on the USPTO-480k dataset, this strategy leads to incremental differences in top-1 accuracy from 91.1% (random initialization) to 91.8% (with pretraining),164 but did reduce the amount of task-specific data and training time needed to reach competitive performance.
While earlier models focused on a single prediction task, T5Chem165 pursued a unified architecture that could support multiple tasks as text-to-text problems using task-specific prompts (e.g., “Product:”, “Reactants:”, “Reagents:”) for forward prediction, retrosynthesis, reaction classification, reagent suggestion, and yield estimation as shown in Fig. 7). This design enabled the model to share representations across tasks while learning task-specific behavior. In T5Chem,165 the model achieved a top-1 accuracy of 88.9% for forward reaction prediction on USPTO-480k.
![]() | ||
| Fig. 7 Multitasking of T5Chem. Reproduced with permission from ref. 165. Copyright 2022 American Chemical Society. | ||
Despite performing well on broad benchmark tasks and pushing accuracy incrementally higher, ranking alternatives remains a persistent challenge for sequence-based reaction prediction models. Model-assigned scores derived from the product of token-level probabilities do not necessarily correspond to chemical feasibility or mechanistic plausibility.29,153,164 Several studies have observed that incorrect predictions often receive higher scores than chemically valid alternatives, reflecting a disconnect between statistical confidence and chemical correctness.29,164 This gap indicates the need for better-calibrated scoring mechanisms to enable more reliable selection among plausible reaction products.
![]() | ||
| Fig. 8 Top-k accuracy of reaction prediction models under different data split strategies, replotted from data reported in Bradshaw et al.184 “On reactions” indicates a random split over reactions, “on documents” splits by source document, and “on authors” holds out all reactions associated with specific authors. | ||
One approach to mitigate these domain shift and data scarcity issues is transfer learning, where a model is first pretrained on a large corpus of general reactions and subsequently fine-tuned on a small set of domain-specific examples. This strategy can encode broad chemical knowledge during pretraining while adapting to the unique reactivity patterns of specialized domains during fine-tuning. Below, we summarize case studies illustrating how this paradigm enables accurate modeling in data-scarce reactions.
In carbohydrate chemistry, fine-tuning a Transformer pretrained on 370
000 general reactions with 25
000 domain-specific examples increased top-1 accuracy to 72.3% from no meaningful baseline.185 For Heck reactions, the same pretraining raised accuracy from 66.3% to 94.9% using only 9959 examples.186 In Baeyer–Villiger oxidations, accuracies improved from 64.7% to 83.8% (88.1% with SMILES augmentation) for one dataset187 and from 58.4% to 81.8% (86.7% with augmentation) for another.173 In an industrial setting, applying the Molecular Transformer to 147
392 reactions from Pfizer ELNs showed that training solely on internal data achieved 97.0% top-1 accuracy, compared to 82.9% top-3 when transferring from USPTO-Full.188 These examples collectively show that pretraining on broad chemistry followed by domain-specific fine-tuning can recover or even exceed high accuracy in specialized chemical contexts with limited data. This ability to adapt pretrained sequence models to specialized chemical spaces-whether defined by functional group behavior, regioselectivity, or industrial context-suggests that their learned reactivity patterns are transferable and can be rapidly specialized in certain settings. A summary of these performance improvements is provided in Table 4.
| Reaction class | Domain-specific data size | Pretraining dataset | Top-1 accuracy (scratch) | Top-1 accuracy (after TL) |
|---|---|---|---|---|
| a The model trained from scratch did not yield meaningful predictions for carbohydrate reactions.b Accuracy after applying SMILES augmentation during fine-tuning.c Reported as top-3 accuracy when trained on USPTO and tested on Pfizer data. | ||||
| Carbohydrate185 | 25 000 |
USPTO-480k | —a | 72.3% |
| Heck186 | 9959 | USPTO-480k | 66.3% | 94.9% |
| Baeyer–Villiger (Zhang)187 | 2254 | USPTO-480k | 64.7% | 83.8% (88.1%b) |
| Baeyer–Villiger (Wu)173 | 2254 | USPTO-480k | 58.4% | 81.8% (86.7%b) |
| Pfizer ELN188 | 147 392 |
USPTO-Full | 82.9%c | 97.0% |
Graph2SMILES155 in Fig. 9 is a representative example of such a hybrid architecture. The model employs a graph encoder composed of a directed message passing neural network (D-MPNN) enriched with an attention mechanism and graph-aware positional embeddings. This encoder captures both local chemical environments and global molecular topology. A standard Transformer decoder generates product SMILES in an autoregressive manner. This architecture enables permutation-invariant input encoding without requiring SMILES augmentation. On the USPTO-480k dataset for forward reaction prediction, Graph2SMILES achieved a top-1 accuracy of 90.3% using canonical SMILES, closely matching the 90.6% accuracy of the Augmented Transformer29 that relied on extensive SMILES augmentation. This shows that graph-based representations can offer comparable performance with fewer augmentation-related dependencies.
![]() | ||
| Fig. 9 Graph2SMILES model architecture. Reproduced with permission from ref. 155. Copyright 2022 American Chemical Society. | ||
The value of this architectural shift has also been demonstrated in the context of mechanistic prediction. joungreproducing trained both standard Transformer and Graph2SMILES architectures on a large-scale dataset of over five million elementary reactions derived from expert-curated mechanistic templates. The performance gap between these two models on this dataset confirmed prior observations from ref. 155—namely, that graph-based encoders can improve performance even when outputs remain in SMILES format. Beyond performance, their results suggest a broader insight: it is not necessary to redesign the model architecture when moving from overall reaction prediction to mechanistic step prediction. Instead, this shift can be handled at the level of data preparation—by changing what the model is trained to predict, not how the model is built.
While Graph2SMILES focuses on replacing the input representation entirely with a graph, SeqAGraph167 offers a complementary perspective: how to retain compatibility with SMILES-based data augmentation while still incorporating graph-based information. To achieve this, SeqAGraph includes an extra label for each graph input that specifies which atom was used as the starting point when generating the corresponding SMILES. This allows the model to distinguish between different randomized SMILES for the same molecule, while still treating the graph structure in a permutation-invariant way. As a result, standard sequence-based augmentation techniques can be applied without disrupting the benefits of graph-based encoding. Empirically, the model achieved a top-1 accuracy of 88.9% on USPTO-480k forward prediction, outperforming both Transformer and MPNN baselines and showing that simple additions to the input format can make graph-based models compatible with SMILES augmentation. Beyond SeqAGraph, BiG2S168 extends the graph-to-sequence framework to both forward prediction and retrosynthesis within a shared encoder–decoder model, reporting a top-1 accuracy of 89.0% on USPTO-480k.
Graph-based models are capable of encoding stereochemistry through explicit labels for features such as cis/trans isomerism, E/Z configurations, and R/S chirality. In practice, however, many graph-based reaction prediction models omit these annotations. As a result, stereochemical information is often not incorporated into the learned representation.
Sequence-based models built on SMILES (or related line notations) can handle the forms of stereochemistry that these notations explicitly encode, including R/S chirality (denoted using the “@” symbols) and E/Z configurations (represented by “\” and “/”). However, SMILES does not cover all stereochemical types, such as axial chirality (e.g., BINOL derivatives), helical chirality, or certain forms of topological chirality. Consequently, sequence-based models can learn only the stereochemical distinctions that the SMILES representation itself defines and cannot capture other forms of stereochemistry outside this encoding. Although many of these limitations would, in principle, be resolved by using explicit three-dimensional coordinates, most data-driven architectures are designed around 2D graph or sequence inputs. Recent work has begun to explore machine learning architectures that explicitly incorporate three-dimensional information, including conformer-aware representations and equivariant neural networks that operate directly on molecular geometries.189 Such approaches have shown promise in related problems where geometric and stereochemical effects are important, such as enantioselectivity prediction and bond dissociation energy estimation,190 where conformer-level information can improve predictive accuracy. However, the value of incorporating three-dimensional structural information into large-scale reaction outcome prediction models remains uncertain, at least in terms of existing benchmarks, and is partially offset by the computational overhead imposed by conformer generation. As with analogous attempts to incorporate computed features into molecular representations,191 any benefit of explicitly encoding 3D shape is likely to vanish as datasets grow larger.
A second limitation arises from the structure of the reaction datasets themselves. As discussed in Section 2.1, most large-scale experimental or literature-derived datasets provide only the reported major product for each reaction. Minor products, byproducts, and alternative regio- or stereochemical outcomes are rarely documented. As a result, models are not trained to reproduce the full distribution of experimentally formed species but rather to match the single outcome recorded in the dataset. This incomplete ground truth also means that lower-ranked predictions do not necessarily correspond to chemically plausible alternatives (i.e. minor products), since the underlying data do not contain information about the relative likelihoods of unreported products.
A further limitation is the widespread absence of reaction context. Key variables such as solvent, catalyst, stoichiometry, temperature, reaction time, concentration, or the order of reagent addition are missing or inconsistently recorded in most of the datasets summarized in Section 2.1. Because reaction outcomes can depend sensitively on these conditions, models trained only on reactant and product structures must infer reactivity patterns without access to the experimental factors that influence them. This lack of contextual information contributes to poor generalization under domain shifts, such as when evaluated on reactions drawn from different documents, authors, or chemical subdomains.
A further limitation of data-driven models is their tendency to hallucinate chemically implausible products.48,181 Because data-driven models are trained to maximize statistical likelihood rather than to enforce chemical constraints, they may produce structures that are syntactically valid but violate basic principles of chemistry. Such hallucinations can be changes in the number of heavy atoms or hydrogens, violations of electron or valence conservation, incorrect stereochemical assignments, or unintended modifications to substituents peripheral to the true reaction center as shown in Fig. 10. Although several recent models incorporate explicit conservation constraints to prevent violations in atom or electron balance, these approaches do not eliminate hallucinations entirely.49,56,175 The predicted structures may satisfy formal conservation laws yet still be chemically unreasonable, or they may be reasonable molecules that nonetheless could not arise from the given reactants under any feasible reaction mechanism.
![]() | ||
| Fig. 10 Example failure mode of sequence generation models, where lower-ranked predictions are not chemically reasonable. Reproduced with permission from ref. 48. | ||
An early example of this approach is the gradient extremal walking approach (GEWA) from Jørgensen et al.194 In GEWA, the path connecting equilibrium structures is defined as the set of points where the gradient is minimized195 thus following this path should yield the exact minimum energy path (MEP). Yielding the exact minimum energy path can be quite useful but is not necessary if the goal is solely to identify likely reaction products. An example of this is the anharmonic downward distortion following (ADDF) method from Ohno and Maeda.196 Anharmonic downward distortions (ADDs) on a potential energy surface are deviations from the harmonic potential which are characteristic of reaction paths. ADDF searches for these ADDs by performing energy minimizations at various points along the harmonic potential surface to find downward bends in the PES. Selected ADDs are then followed and result in an approximate reactive path.
ADDF has demonstrated the ability to explore many different reacting systems197 but was limited to unimolecular reactions before Maeda et al. extended the approach to bimolecular reactions with the artificial force induced reaction method (AFIR).198–200 AFIR creates an artificial force between two sets of atoms and optimizes on a modified potential energy surface (which is the addition of the actual potential energy surface and the artificial force). If the force between the atoms is strong enough to overcome the barrier of the reaction, the optimization will result in an approximate reaction path. The force serves as a proxy for barrier height for the elementary step since high barrier steps will require a larger stimulus to induce the reactive event. By controlling the magnitude of the force, AFIR searches can be biased towards avoiding high barrier mechanistic steps that are kinetically less plausible. However, this means that multiple iterations of the same sets of atoms with different force constants may need to be run to extract all relevant pathways.
A different approach that conceptually achieves a similar goal without the need for iterative sampling is scanning along the reaction coordinate. Scanning involves forcing a simulation to follow a particular reaction coordinate through the whole corresponding elementary step. One method that implements this is Chemoton with its Newton trajectory scans.201,202 In a Newton trajectory scan, atoms are forced together (as in AFIR) but the force is dynamically updated throughout the simulation to ensure that the atoms are bonded by the end of the simulation. Chemoton uses heuristic rules to automatically select the reactive sites (atoms/pairs of atoms) to push together. This represents part of a completely automated strategy to explore chemical reaction networks. A very similar approach developed by Zimmerman is the Single-Ended Growing String Method (SE-GSM).62 This work evolved the Double-Ended Growing String Method (which will be discussed in a later section) from requiring a reactant and product conformer to only requiring a reactant conformer and a reaction coordinate. SE-GSM builds and optimizes a string along this reaction coordinate to generate the reaction path. Similar to Chemoton, ZStruct-2 is a mechanism exploration program that automatically provides the reaction coordinates to SE-GSM to drive the exploration of the reaction network.203 The imposed activation (IACTA) approach from Lavigne et al.204 expands upon the above methods by allowing reaction coordinates to be defined by bending and torsion angles as opposed to just interatomic distances. This approach is not self-driving, however, and thus requires user input of the reaction coordinate.
Perturbation of the vibrational modes of a molecule presents another method for exploring reactive events in a directed manner. This method, implemented in AutoMeKin,205 is known as Transition State Search using Chemical Dynamics Simulations (TSSCDS).206 To perform the search, a “sizeable” amount of energy is placed on the vibrational mode of interest during a molecular dynamics simulation. This stimulus biases reactive events which are then analyzed post-simulation for reaction paths. Users can input vibrational modes of interest into these simulations, but AutoMeKin also includes an option to automatically explore reactions with ChemKnow.205
Despite being undirected, key to the implementation of this strategy is a means to bias the simulation in a manner that increases the rate at which reactions are observed. Local elevation metadynamics207,208 adds an energetic penalty to molecular conformations that have already been visited during the simulation. This strategy is similar to the one used by bias potential driven dynamics,209,210 where a Gaussian is added to the potential energy surface at the current geometry. In both local elevation and bias potential driven dynamics, molecules are therefore driven out of the local equilibrium wells that they would typically remain in if the simulation were unbiased. As a result, higher-energy configurations are sampled out of necessity until configurations cross over reaction barriers into new local minima.
Rather than simply penalizing already-seen configurations, simulations can be run under other types of forcing conditions. One such condition is the inclusion of an electric field211 as has been used to explore prebiotic Urey Miller chemistry.211 Another condition is high temperature and high pressure as used in the original implementation of the ab initio nanoreactor.212,213 Wang et al. performed molecular dynamics simulations at 2000 K with a simulated piston that compresses to encourage collisions between molecules. The nanoreactor has been used to explore Miller–Urey chemistry213 and has also been applied to glycine synthesis,213 nitromethane decomposition,214 and phenyl radical oxidation chemistry.215 Grimme combined bias potential driven dynamics and a wall potential to explore reactivity of Miller–Urey chemistry (as a common test case), ethyne oligomerization, and the oxidation of hydrocarbons, among others.216
The major advantage of such reactive simulations is that they do not require human input in deciding how to bias the search process. However, to successfully yield interesting reactive events, the reactants must exist in many replicates in the system and the simulation must be run for long enough to observe rare events, even with biasing. These simulations can be very computationally expensive when compared to directed approaches.
Perfectly exhaustive enumeration of all transformations is generally not tractable. In many elementary steps, only a small number of bonds are broken and formed which means enumerating graph edits with only a small number of bond rearrangements can allow graph-based search to be near exhaustive and tractable. This has led these methods to constrain enumeration to allow only up to 2 broken and 2 formed bonds (“b2f2”; Fig. 12). Many reaction types adhere to this constraint, yet some require consideration of more bond breaking/forming events like Diels-Alder, electrocyclizations, and Claisen rearrangements.225 One area of weakness for graph-based enumeration is stereoisomerism. Since graph-based enumeration ignores stereochemical information, methods that employ this enumeration are not capable of explicitly exploring stereoselectivity without an additional subsequent enumeration of stereoisomers.
This class of methods is often referred to as “symbolic” because the rules operate on discrete structural features—such as the presence or absence of certain substructures or bonding patterns—rather than relying on statistical associations or numerical optimization. Chemical transformations are defined in terms of human-readable logic (e.g., “If a molecule contains a carbonyl adjacent to a leaving group, apply an elimination reaction”), and the system applies these rules conditionally based on the molecular structure at hand. Symbolic systems are inherently interpretable, and their behavior can be traced to explicitly defined rules.
Among the most influential examples, the LHASA system226–230 pioneered the formalization of retrosynthetic planning as a set of modular transformations encoded in custom-designed languages. Using PATRAN to specify reactive patterns and CHMTRN to capture logical conditions, LHASA allowed chemists to express context such as steric hindrance, electronic effects, or unstable motifs in an executable form. At the level of individual transforms, these conditions primarily acted as binary applicability filters, determining whether a transformation could be invoked. SECS231–234 shared this symbolic foundation but emphasized planning strategies and user interaction, combining transformation rules with heuristic search and a chemist-in-the-loop graphical interface. Both systems, though differing in implementation, demonstrated that codified chemical logic could meaningfully constrain the space of plausible transformations while preserving interpretability. Their impact extended into forward enumeration: the SAVI project235,236 directly repurposed the LHASA infrastructure to generate synthetically accessible molecules at scale by applying predefined transformations to commercially available building blocks to enumerate billion-member libraries of synthetically-tractable products. More broadly, as in many earlier rule-based forward enumeration efforts predating SAVI, such systems are not necessarily followed by a secondary scoring step to re-rank the products they generate.
Building on earlier symbolic expert systems such as LHASA, AIPHOS (Artificial Intelligence for Planning and Handling Organic Synthesis) was an early example of applying structural pattern recognition to reaction prediction, using heuristic rules.237 The system identified reactive centers by analyzing molecular substructures—such as functional groups, ring systems, and conjugated π-systems—and applied generalized patterns to propose plausible bond changes associated with substitution, addition, or elimination reactions. Unlike symbolic expert systems, AIPHOS did not require predefined, atom-mapped rules for each transformation; instead, it employed a pattern-based knowledge base to match local structural environments to known reaction types. Later developments further abstracted the concept of reactivity by encoding reaction sites as bit sequences that summarized the bonding and stereochemical features of atoms up to several bonds away.238
Rule-based enumeration underpins the Reaction Mechanism Generator (RMG).239–241 Originally developed for combustion chemistry, RMG automates the construction of detailed kinetic models by systematically applying reaction families—predefined sets of transformation rules curated by chemists—to a given pool of reactants. Chemical species are represented as graphs, as described in Section 3.1.1, and elementary steps are generated by matching reactive substructures to family templates and applying the associated bond rearrangements. The initial set of reaction families, established by Green and co-workers, encoded mechanistic knowledge and kinetic data from the literature to capture key pyrolysis and oxidation chemistries. This foundation of roughly thirty reaction families has since been continuously expanded. These families are not derived automatically from datasets but are hand-crafted symbolic rules, with the scope of possible outcomes determined entirely by explicit chemical knowledge. By repeatedly applying these rules to expand the reaction network, RMG produces a comprehensive yet chemically constrained ensemble of plausible products. RMG has been broadly applied across diverse chemistries, demonstrating how symbolic expert systems can scale forward enumeration to thousands of reactions while maintaining transparency through explicitly encoded logic.241 RMG was preceded by NetGen, a framework for automated network generation introduced by Broadbelt and co-workers.242 In NetGen, molecules are similarly represented as graphs or matrices, and chemical transformations are encoded as reaction operators that define how substructures can change during a reaction. These operators, corresponding to reaction families such as oxidation, cleavage, or condensation, are predefined by chemists.
Building on this foundation, the Biochemical Network Integrated Computational Explorer (BNICE)243–249 extends these formalisms to metabolic and biodegradation chemistry. Instead of general chemical operators, BNICE encodes generalized enzyme functions derived from the Enzyme Commission (EC) classification system, with each rule describing bond rearrangements catalyzed by a class of enzymes. Given an initial substrate, BNICE identifies functional groups and applies all relevant enzyme rules to generate new products, repeating this process iteratively to construct metabolic reaction networks. These rules are manually curated from biochemical databases such as KEGG250 and University of Minnesota Biocatalysis/Biodegradation Database (UM-BBD).251 In practice, BNICE has been used to reconstruct known metabolic routes, propose novel biodegradation pathways for xenobiotics, and analyze the combinatorial space of natural product biosynthesis.
Together, these symbolic expert systems demonstrate how hand-crafted logical rules provided early solutions for enumerating chemically reasonable reactions. Although largely superseded by data-driven methods, their modularity, transparency, and chemically grounded constraints established enduring design principles that continue to shape modern approaches to reaction prediction.
CAMEO (Computer-Assisted Mechanistic Evaluation of Organic Reactions) was among the first systems to generate reaction candidates through mechanistic reasoning rather than symbolic rule application.252–256 It encodes mechanistic modules that mimic physical organic reasoning rather than fixed structural templates. After perceiving functional groups and structural features, the program assigns relative acidities (values) and nucleophilic or electrophilic sites, then applies predefined mechanistic steps such as proton transfer, substitution, elimination, or addition. Competing pathways, such as proton transfer versus organometallic addition or halogen–metal exchange, are resolved using heuristics based on relative base strength and leaving group ability, so that only chemically reasonable transformations are instantiated. The resulting products are then screened to exclude unstable or implausible structures, yielding a set of mechanistically-grounded reaction outcomes.
EROS (Elaboration of Reactions for Organic Synthesis) applies generalized reaction schemes known as RGmn, where m bonds are broken and n bonds are formed in a concerted rearrangement.257–259 These schemes represent broad mechanistic motifs such as substitutions, eliminations, or pericyclic transformations, and are instantiated only when matching substructures are present. To control combinatorial growth, EROS employs physicochemical filters, including atomic charges, bond dissociation energies, or polarizability, estimated automatically by the PETRA (Parameter Estimation for the Treatment of Reactivity Applications) package. By combining formal reaction generators with descriptor-based constraints, EROS produces chemically plausible reaction outcomes that extend beyond predefined named transformations.
ROBIA (Reaction Outcomes by Informatics Analysis) similarly models transformations through stepwise mechanistic simulation rather than by substructure substitutions.261,262 It identifies functional groups and reactive sites in the input molecules, then applies curated transformation scripts corresponding to mechanistic classes such as enolate formation, aldol condensation, or Diels–Alder cycloaddition. Each transformation generates intermediates and all possible stereoisomers, with unstable structures filtered post hoc.
SOPHIA (System for Organic Reaction Prediction by Heuristic Approach) extended the principles of AIPHOS by introducing a more systematic and generalizable framework for pattern-based reaction enumeration.263,264 Rather than relying on predefined rules, SOPHIA constructed a reaction knowledge base by abstracting common transformation patterns from curated reaction examples. Each reaction center was encoded based on structural and bonding features—including atom types, bond orders, and neighboring environments up to the α, β, and γ positions—allowing the system to represent reactivity in a flexible, transferable format.
During enumeration, SOPHIA scanned the input molecule for substructures matching any of the abstracted patterns and applied the corresponding bond rearrangements to generate product candidates. These patterns were not exact subgraph matches but generalized structural motifs that captured common reactivity features. Unlike template-based methods (Section 4.1.3), SOPHIA did not require atom-mapped reactions or preservation of stereochemistry. Instead, it used structural plausibility and heuristic criteria to filter unreasonable candidates, enabling enumeration even in cases where no directly analogous transformation had been observed.
More recent approaches have shifted toward extracting transformation rules directly from reaction data.265,266 These methods identify recurring structural changes from large collections of known reactions, allowing templates to be constructed algorithmically rather than encoded by hand. The underlying philosophy remains the same, which focuses on chemical transformations guided by rules or templates. Data-driven template-based enumeration provides a systematic way to explore chemically plausible outcomes while balancing reaction scope, structural specificity, and computational feasibility.170,267
Although this discussion focuses on forward prediction, template extraction is equally central to retrosynthetic analysis.268 Methods ranging from systems with manually encoded rules to approaches that automatically extract retrosynthetic templates from reaction databases,269 as well as data-driven approaches that identify minimal reaction cores,265 SMARTS-based templates with careful treatment of stereochemistry,266 and generalized transformation patterns270,271 demonstrate that retrosynthetic template construction and application are mature techniques that share the same template-based strategies in the forward direction.
The balance between generality and specificity in template definition can be illustrated by comparing different levels of representation Fig. 13. A highly specific template, such as those generated by RDChiral,266 preserves surroundings of the reaction center, stereochemistry, and electronic states ensuring that the encoded transformation closely reflects the original chemical context. In contrast, local templates270 focus only on the immediate connectivity changes at the reaction center, while generalized templates170,267 abstract the transformation to broader symbolic patterns that can be applied across diverse chemical classes. As shown in Fig. 13, increasing specificity improves chemical fidelity but narrows the scope of applicability, whereas generalized templates expand coverage at the cost of reduced structural precision.
![]() | ||
| Fig. 13 Representative example illustrating how reaction templates can be extracted from a chemical reaction and expressed at increasingly abstract levels of representation. Reproduced with permission from ref. 267. Copyright 2024 American Chemical Society. | ||
Once constructed, these templates can be systematically applied to reactants, where only those containing the required reaction center are activated. This procedure enumerates a diverse set of possible products, regardless of whether they are chemically plausible in practice.272 Alternatively, template libraries can be restricted to a representative set of well-characterized reaction types, allowing them to generate expected products from given reactants in a more controlled manner.
Building on this idea, Joung et al.48 showed that forward template enumeration can be applied in a constrained setting to impute mechanistic information from overall reactions. In this framework, a fixed sequence of mechanistically plausible transformation templates was applied to the reported reactants within a given reaction class, and only the path that reproduced the experimentally observed product was retained. This procedure enabled the construction of a mechanistic dataset in which each overall reaction was expanded into a plausible sequence of elementary steps. A subsequent study introduced the FlowER dataset,49 which employed stricter constraints by explicitly enforcing mass, proton, and electron conservation during template-based mechanistic reconstruction. As noted in Section 2.1, overall reactions are widely documented, but mechanistic annotations are rare. These studies illustrate how forward template enumeration, when guided by class-level priors or conservation rules, can be used to derive mechanistic interpretations from existing reaction data.
Another recent example of forward template enumeration used for mechanistic reconstruction is the MechFinder framework introduced by Chen et al.51 In this system, reaction templates are extracted from atom-mapped overall reactions and applied in the forward direction to generate intermediate structures along a proposed mechanistic path. A machine learning model is used to select the appropriate mechanistic template for each reaction, based on the local transformation pattern. Each mechanistic template specifies a predefined sequence of elementary steps, which is then deterministically applied to the reactants to yield a multi-step transformation matching the reported product.
This idea can be further extended to the discovery of multicomponent reactions. Such reactions allow the construction of complex scaffolds from simple starting materials in a single step, and many of them have historically been uncovered largely by serendipity. By leveraging a mechanistic template library, it becomes possible to construct reaction networks in which three or more reactants participate simultaneously. When combined with the evaluation strategies discussed in the following Section 4.2, these networks can be systematically assessed to identify novel multicomponent reaction pathways.58,274,275
Several studies have applied template-based enumeration to generate candidate outcomes for site-selectivity problems, where the transformation is fixed and the goal is to identify all chemically plausible application sites within a molecule. Tomberg et al.276 used a template for electrophilic aromatic substitution and applied it exhaustively to all eligible positions on aromatic systems to enumerate possible substitution products. Guan et al.277 followed a similar approach for nucleophilic aromatic substitution, systematically generating candidates by applying the SNAr template across all reactive aryl halide sites. Hoque et al.278 used template-based enumeration to identify all hydrogen atoms that could plausibly participate in radical abstraction reactions. These examples show how, when the scope of the question is narrow, enumeration can be reduced to the application of a single transformation rule across multiple candidate sites. Similar strategies have been adopted in a range of studies dealing with selectivity.
Broadly, current ML-based enumeration methods fall into two categories, as illustrated in Fig. 14. The first approach focuses on identifying pairs of reactive atoms—often formulated as an electron-donating (source) and electron-accepting (sink) sites in polar reaction mechanisms—based on their chemical environment and topological context.22,56,279–284 While this formulation aligns naturally with polar reaction mechanisms, related extensions have adapted similar enumeration principles to radical or concerted reactions by modifying how reactive centers and bond changes are represented.285,286 These reactive pairs are then enumerated to construct plausible mechanistic steps or reaction products. The second approach predicts, for each atom pair, the likelihood of a change in bond order after the reaction.21,48,171 These bond-centric predictions are assembled into full product candidates under chemical constraints such as valence and connectivity. Despite differences in formulation, both strategies aim to sample a manageable set of candidate outcomes that are chemically plausible and consistent with learned reactivity patterns.
The Baldi group has developed a sequence of models that frame reaction prediction as a two-stage process centered on identifying reactive source and sink sites. The earliest of these, ReactionPredictor,281,282 treats reactions as elementary mechanistic steps involving electron flow from a source to a sink atom or orbital. In the first stage, atom-level reactivity is predicted using classifiers trained on graph-topological and physicochemical descriptors, such as atom type, partial charge, and local bonding environment. The second stage performs exhaustive enumeration of all source-sink atom pairs to generate candidate mechanistic steps. The extended version of ReactionPredictor282 supports not only polar but also radical and pericyclic mechanisms, and introduces a representation based on idealized molecular orbitals (MOs) such as lone pairs and π-bonds (donors), or empty orbitals and σ*-bonds (acceptors).
Later work by the group elevated orbitals from a conceptual enumeration unit to the primary learning objective, introducing orbital-level reactivity prediction to more explicitly reflect the electronic structure of molecules. In this model,285 each atom is associated with a set of possible molecular orbitals (MOs)—such as lone pairs, π-bonds, or empty orbitals—and each MO is encoded using a tree-structured fingerprint that captures its local chemical environment, replacing hand-crafted atomic descriptors with a learned orbital-centered representation. A fully connected neural network is trained to assign reactivity scores to these orbitals using a balanced dataset of reactive and nonreactive examples. High-scoring source and sink orbitals are then paired to enumerate plausible elementary steps. While preserving the two-stage framework of ReactionPredictor, this approach replaces hand-crafted, heuristic-based atomic descriptors with learned representations that more directly reflect chemically meaningful orbital interactions.
A related line of work explored ML-based forward enumeration by leveraging learned reactivity models to expand reaction networks. Tavakoli et al.287 applied a data-driven reactivity predictor to multi-step network expansion, where possible reaction pathways were enumerated from given reactants based on pairwise donor–acceptor combinations. In this framework, the model scores the likelihood of electron donation and acceptance at the atom level, and these scores are used to guide the systematic construction of reaction pathways.
The underlying reactivity model was introduced earlier by Tavakoli et al.,284 who framed nucleophilicity and electrophilicity prediction as supervised learning tasks using methyl cation and anion affinities (MCA*, MAA*) as training targets. The model takes as input a combination of graph-topological and physicochemical atom-level descriptors and is implemented using graph neural network architectures, including Graph Convolutional Networks (GCNs) and Graph Attention Networks (GATs). These learned reactivity scores provide the quantitative basis for the subsequent enumeration and expansion of reaction networks.
More recently, the RMechRP model286 introduced a contrastive learning framework for mechanism prediction in radical reactions. Rather than independently classifying individual atoms or orbitals, RMechRP directly learns to score pairs of orbitals by contrasting the true reactive pair against all other candidate pairs within the same reaction context. This formulation collapses reactive site identification and ranking into a single learning objective, avoiding an explicit two-stage enumeration-and-filtering pipeline. By scoring orbital pairs in their full molecular context, the model is able to prioritize chemically relevant interactions while maintaining consistency with radical reaction mechanisms.
Several models have adopted the idea of identifying reactive donor and acceptor sites as a basis for enumerating plausible mechanistic steps. These approaches differ in how they represent reactivity, the source of supervision, and the structure of the enumeration procedure. QC-RP283 uses quantum chemical descriptors—such as condensed Fukui indices, atomic charges, and HOMO/LUMO energies—to identify likely reactive atoms, which are then paired to generate candidate steps. In contrast, ELECTRO22 learns to generate entire sequences of source–sink transitions that trace a linear electron flow (LEF) from reactants to products. This path-based formulation mimics arrow-pushing diagrams but is trained on electron flow sequences heuristically extracted from atom-mapped LEF-type reactions, rather than on individual atom-level descriptors. Reactron56 also constructs multi-step pathways iteratively, with broad coverage of reaction types and close adherence to well-curated ground truth mechanisms. One reactive step is predicted at a time, the molecular graph is updated, and enumeration continues until the product is reached. While these models vary in training data and representation—ranging from pairwise descriptors to path-based and graph-based updates—they share the common goal of learning how electron redistribution events drive chemical transformations.
, significantly reducing computational cost while retaining broad coverage of plausible reaction outcomes. To ensure chemical validity, valence constraints and graph-theoretic rules are applied during enumeration. This formulation eliminates the need to assign explicit mechanistic roles to atoms, enabling scalable, template-free enumeration. However, because bond edits are predicted independently, product structures must be assembled post hoc, and key transformations may be missed if critical edits are misranked.Joung et al.48 adopt the same WLDN architecture but apply it to model individual elementary steps, expanding the output space to include changes in hydrogen count and formal charge. This additional complexity reflects the mechanistic nature of the data, which includes stepwise transformations and chemically valid intermediates. As a result, enumeration must consider combinations of bond, proton, and charge edits. The WLDN-based edit prediction thus becomes one component of a larger pipeline aimed at reproducing full mechanistic pathways rather than predicting only the final product.
| Year | Method | Description |
|---|---|---|
| 1983 | MaxFlux303–305 | Finds pathway with maximum flux for diffusive dynamics assuming isotropic friction |
| 1987 | Elastic band306–309 | Images along pathway are optimized with artificial springs between them |
| 1994 | Synchronous transit-guided quasi-Newton310 | Linear/quadratic synchronous transit followed by an optimization with quasi-Newton or eigenvector following |
| 1994 | Nudged elastic band311–313 | Elastic band with nudging mechanism to avoid corner cutting |
| 1996 | Variational reaction coordinate method314–317 | Optimizes the line integral of the potential energy gradient norm |
| 2000 | Climbing image nudged elastic band318 | Nudged elastic band where highest energy image is pushed to the top of the reaction path |
| 2002 | String method319 | Minimize perpendicular energy gradient along a string between reactant and product |
| 2004 | Growing string method62,320,321 | String method where the reactant and product propagate along the string toward each other |
| 2004 | Hamilton-Jacobi322 | Hamilton-Jacobi equations are solved with the fast marching method to determine the MEP |
| 2008 | Spline Saddle323,324 | Cubic splines representing the reaction path are optimized to the saddle point |
| 2014 | Image dependent pair potential interpolation325 | Nudged elastic band with an atomic distance-based pair potential objective function, used to provide initial path guesses |
| 2018 | ReaDuct326 | Spline saddle where the whole MEP is optimized |
| 2019 | Geodesic interpolation327 | Optimize a geodesic in internal coordinates on the Morse potential, used to provide initial path guesses |
| 2021 | Energy weighted nudged elastic band328 | Nudged elastic band where the springs are weighted by energy |
| 2024 | TS-Tools329 | Constrained optimizations push the reactant geometry to the product geometry |
One of the original algorithms to perform this task is the elastic band method.306–309 The elastic band method places artificial springs between the geometries along the path and optimizes the entire path. The springs ensure that the geometries fall into their respective well allowing the geometries to span the whole reaction path. While this method can yield a valid reaction path, the highest energy geometry may not be near the transition state as the path may take a shortcut through a curved pathway due to the tension from the springs. This behavior resulted in the development of two of the most commonly used double-ended reaction path optimization methods: nudged elastic band (NEB)311–313 and the string method.319 NEB implements “nudging” to decouple the role of the geometry relaxations and springs in the elastic band thus ensuring that optimization to the global minimum will yield the MEP. A further improvement to NEB, climbing image nudged elastic band (CI-NEB), pushes a geometry on the reaction path to the highest energy point. This helps bring the highest energy geometry towards the transition state.
The string method represents a departure from the elastic band formulation. The string method defines and optimizes towards a string such that the energy gradient perpendicular to the path from a geometry relaxation is 0 along the reaction path (which is the definition of the MEP).319 Intermittent reparameterization ensures geometries are equally spaced along the path. All of the previously discussed methods require the user to input a path between the reactant and product. If this input path (which often is a linear interpolation between the reactant and product geometries) is far away from the MEP, these algorithms may struggle to converge. The double ended growing string method (DE-GSM) addresses this by building the path between the reactant and product during the optimization.62,320,321 The convergence to the MEP from both NEB and GSM have resulted in their widespread use to explore reaction paths.
After performing any of these reaction path optimizations, a transition state-like conformer can be extracted at the highest energy image along the path. Whether or not the applied method converges exactly to an MEP, it is often prudent to further refine this geometry to achieve a tighter convergence criteria. This can be done with a local saddle point optimization. Local saddle point optimizations follow the eigenvector associated with the largest absolute negative eigenvalue of the molecular hessian to the saddle point on the PES. This approach is shared amongst several popular methods like the Bernie algorithm,330 rational function optimization (RFO)331,332 and its successor restricted step partitioned rational function optimization (RSPRFO),333 and the dimer method.334,335 Due to these methods' reliance on the molecule's Hessian, convergence is quite slow which makes having an accurate guess for the transition state vital for convergence. This further motivates machine learning methods to generate transition state guesses discussed in the following section.
Once a saddle point is identified, it is important to identify whether that saddle point corresponds to the transition state of the mechanistic step of interest. Although this can be done qualitatively by identifying whether the negative frequency associated with the saddle point corresponds to the bond formation of interest, a more rigorous approach is to perform an intrinsic reaction coordination calculation.336,337 This calculation establishes the reactant and product associated with the transition state by following the paths of steepest descent from the saddle point.
Early approaches formulated this task as a regression problem where the reactants and products are the inputs and the transition state geometry is the output. The reactants and products can be represented as 2D graphs as in Graph-2-Structure338 but were more often represented as 3D geometries. Several architectures like kernel ridge regression (KRR) in Graph-2-Structure,338 graph neural networks in TS-Gen,339 tensor flow networks in TSNet,340 and transformers in an unnamed method from Choi341 were used as models for this task. An important attribute for each of these implementations is either equivariance or invariance to translation and rotation of the inputs (when the input is 3D). This is because translation and rotation of the input reactants/products geometries should either not change the output transition state coordinates (invariant) or change them in a reliable way (equivariant). This can be achieved with an equivariant architecture like the tensor flow networks in TSNet340 or by predicting an invariant representation of the geometry like the distance matrix like in TSGen339 and in Choi.341
More recent approaches for this task have used generative strategies to improve the accuracy of generated transitions states. A generative adversarial network (GAN) based approach was implemented in TS-GAN,342 however significant performance enhancements were not seen until the introduction of diffusion. OA-ReactDiff343 and TSDiff344 train SE(3) equivariant graph neural networks to denoise from Gaussian noise to transition state structures. Both of these methods provide additional context to steer the diffusion towards the correct transition state. OA-ReactDiff uses a technique called “inpainting” where the reactant and product denoising trajectories are included in the model input whereas TSDiff includes a 2D graph representation of the reaction in its input. These methods provided state of the art results compared to previous methods, however diffusion models are more computationally expensive to inference than other generative approaches. To reduce computational expense, flow matching models for transition state generation were developed. Flow matching models learn the linear interpolation between a set of input data to a set of output data, the transition state structures for this task. The input data represents the reaction of interest; in React-OT345,346 it is the reactant and product geometries, in GoFlow347 it is the 2D condensed graph of reaction,348 and in MolGEN302 it is the 2D graphs of the reactant and product. These models represent the current state of the art for this task. For a more in-depth overview of transition state generation with machine learning, we point the readers to Beaglehole et al.349
Similar to the previous section, many of these approaches use simulated datasets (Section 2.1.4) as training data. However, since these models directly predict experimental observables (activation energy/rate constant), they can also leverage experimental datasets. These datasets tend to be smaller than computational datasets but directly represent the ground truth value of interest. Most of the approaches discussed here predict the activation energy, but the activation energy and rate constant can be readily interconverted with the Eyring equation.
There has been much work to develop reaction input representations to learn reaction properties from data. A commonly employed 2D representation, the condensed graph of reaction (CGR), superimposes the reactant and product 2D graphs to serve as input into a graph neural network.348,350 A similar model, EquiReact, utilizes a graph neural network but instead takes 3D geometries of the reactant and product as input.351,352 Benchmarking has shown that these input representations outperform baselines like reaction fingerprints353 and that 3D methods outperform 2D methods when reaction mapping is not available.354 A more in-depth review of the representations and architectures used for activation energy prediction can be found in De Landsheere et al.355
One strategy that has been explored for improving the performance of these models is providing them with extra information beyond the structure of the reactant and product. This information tends to be derived from the calculations that generate the activation energy training data like geometries and quantum mechanical descriptors of reactants, products, and transition states and has been shown to improve performance of these models, particularly in generalization to unseen compounds and in data-sparse regimes. One way to incorporate this information is directly provide it as input to the model. This has been employed with semi-empirical transition states,356 semi-empirical descriptors,357 DFT reaction energies,358 and DFT orbital energies.359 Although this can yield significant performance improvements, making predictions with these models is more computationally expensive since the additional information must be calculated for each prediction. To avoid this extra computational expense, models can be trained to jointly predict the extra information (descriptors, geometries, etc.) along with the activation energy in a multi-task learning or two-stage formulation. This approach is employed with QM descriptors in Stuyver and Coley360 and with transition state geometries in Karwounopoulos et al.361 and improves the generalization of the model in both cases.
There have been efforts to predict activation energies and rate constants from experimental datasets. A model with a CGR architecture has been successfully employed for the prediction of both E2362 and SN2363 rate constants. However the size of these datasets (>1000 data points) is uncommon for most reaction classes so different approaches have been leveraged to make predictions in sparser data regimes. For example, as discussed in the previous paragraph, the strategy of providing additional information beyond the structure of the reactant and product in the model input has been used to improve predictions for nucleophilic aromatic substitution reactions. DFT features of the transition state364 and DFT barrier heights365 have served as inputs to models to improve accuracy. These models beat CGR baselines but degrade in performance when there is disagreement between DFT and experimental results. Another approach to training models on small experimental datasets is to train a model to predict parameters in a few-parameter empirical relationship. This can help prevent overfitting on minimal data as there are fewer parameters to learn and performance is grounded by the classical technique. Examples of this approach include learning the residuals of the Hammett model for SN2 rate constants366 and learning multivariate linear models based on QM and steric descriptors to predict inverse-electron demand Diels–Alder cycloaddition activation energies.367
Activation energy is often correlated with reaction enthalpy through linear free-energy relationships such as the Evans–Polanyi equation.373 Although useful empirically, these relationships are not generally transferable and require prior knowledge of the reaction landscape. Reaction enthalpy itself can be obtained as the difference between the standard enthalpies of formation of products and reactants. One well-established approach for estimating the enthalpies of formation of arbitrary organic molecules are group additivity methods. Group additivity/group contribution methods provide a way to estimate enthalpy based on the functional groups present in a given molecule using experimental data. The first widely used group additivity model was the Benson group increment theory (BGIT).374–376 This model accounts for atomic, bond, and group contributions to the heat of formation where groups account for the local environment surrounding an atom. The model's parameters are derived from experimentally calculated heats of formation. The Gronert model377 extended BGIT by accounting for 1,2- and 1,3-interactions, improving accuracy for alkanes and trends in bond dissociation energies. More recently, the topology-automated component increment theory (TCIT)378 has replaced empirical parameters with quantum chemically derived ones, enhancing transferability across diverse molecular classes.
Beyond these classical approaches, machine learning models and hybrid quantum–statistical frameworks have emerged for predicting both enthalpies and Gibbs free energies directly from molecular structure. There are many parallels between these modeling efforts and the activation energy prediction models discussed in Section 4.2.2.4. For example, simulated datasets (Section 2.1.4) are commonly used to train these models; however, since transition state energies aren't necessary, thermodynamic models can leverage more data like ground state quantum chemical calculations379 or heats of formation.380 Thermodynamic ML models also share model architectures348,381–383 and augmentation strategies384–386 with kinetic ML models. A common formulation for thermodynamic models is Δ-ML which means using an ML model to correct a less accurate calculation towards a more accurate calculation or experimental value.384,385,387 This is accomplished by training a model to predict the difference between the less accurate method and a target value. This mirrors the input augmentation strategy discussed previously and allows researchers to produce higher accuracy thermodynamic predictions at a lower computational expense.385,387–391 Together, these classical and data-driven strategies link thermodynamic evaluation to kinetic feasibility, offering complementary metrics for assessing reaction likelihood when explicit transition-state information is unavailable.
Broadly, methods for general likelihood evaluation can be grouped into three major categories: (1) binary classification of reaction feasibility, which rapidly evaluates the validity of reactant–product pairs in an absolute sense, one-at-a-time; (2) template-ranking approaches, which select and prioritize reaction templates to intrinsically guide outcome prediction; and (3) enumerated candidate ranking, which explicitly considers multiple potential outcomes and ranks their relative likelihoods. Each reflects a different modeling choice in how to assign or learn scores over candidate reaction outcomes, and they differ in terms of computational cost, interpretability, and the form of output they provide.
A major challenge in training feasibility classifiers lies in obtaining negative examples—reactions that are chemically unreasonable or fail to occur—since failed or low-yield reactions are rarely reported in literature. As a result, most reaction datasets are heavily biased toward successful transformations. To construct balanced training data, studies such as the in-scope filter392 and the fast filter272 generated artificial negatives by applying extracted reaction templates to random sets of reactants (as discussed in Section 4.1.3. Each template, originally derived from a known successful reaction (A + B → C) was reapplied to produce hypothetical products (D, E, …) not observed in the dataset. The known reaction yielding C was treated as a positive example, while all other hypothetical outcomes were labeled as negatives, enabling the classifier to distinguish feasible from infeasible transformations. To further mitigate data imbalance, Wollenhaupt et al.393 later demonstrated that supplementing a large dataset with a small number of curated positive reactions, 14
500 rare cases from CAS collection, can substantially improve classifier performance on underrepresented reaction types without compromising general accuracy. An alternative perspective was introduced by Strieth-Kalthoff et al.,394 who showed that feasibility classifiers can be effectively trained using only positive and unlabeled reaction data, without explicitly constructing negative examples. By framing reaction feasibility as a positive-unlabeled learning problem, their work demonstrated that reliable classifiers can be obtained despite strong reporting biases in the literature, challenging the assumption that large numbers of explicit negative reactions are required for training.
A representative implementation of this approach is the reaction type classifier introduced by Wei et al.,395 in which a neural network is trained to predict the most likely reaction class from a set of 17 categories including a null class for cases where no reaction occurs. The input consists of concatenated fingerprints of reactants and reagents, and the model outputs a probability distribution over reaction types, each associated with a predefined SMARTS rule. Building on this idea, Segler and Waller172 proposed a neural-symbolic framework that generalizes reaction-type classification to thousands of automatically extracted transformation rules. In this model, each reaction rule (i.e. template) is treated as a distinct class, and a neural network is trained to predict the probability distribution over these rules given the molecular fingerprints of the input reactants. Though these models are trained solely to recapitulate the template that matches what is recorded experimentally, these models implicitly learn which transformations are compatible with a given molecular context, prioritizing feasible rules while deprioritizing chemically inconsistent ones.
While template-ranking models effectively prioritize plausible transformations, predicting the most probable template does not uniquely determine the product structure. Many reactions, especially those involving multiple reactive sites or competing reaction centers, can yield several possible atom mappings that satisfy the same transformation rule, requiring subsequent modeling of selectivity to obtain unique product representations.
At the most microscopic level, reaction likelihood can be evaluated in terms of a single electron-transfer, predicting the likelihood of interactions between electron-rich and electron-poor centers, often referred to as source-sink pairs. This framing naturally mirrors the conventions of arrow-pushing diagrams, where reaction mechanisms are represented as a sequence of localized electron movements. As a result, models that predict source–sink pairs step-by-step can be interpreted as approximating reaction mechanisms, with each pairwise decision corresponding to a plausible elementary event in a multistep pathway.396
Several models take the source–sink interaction as the fundamental unit of prediction, but differ in how they generate and evaluate candidate pairs. One strategy is to construct reaction pathways iteratively, predicting one pair at a time while deciding at each step whether the reaction should continue or terminate. This framework can be seen in ELECTRO,22 which generates sequences of electron movements resembling arrow-pushing diagrams, although it does not explicitly score the plausibility of individual steps. Reactron,56 in contrast, builds on this approach by associating each predicted electron move with a learned local compatibility score that guides sequential generation. Importantly, this score is used during the construction of the reaction pathway rather than to rank a set of pre-enumerated candidate outcomes. Rather than evaluating complete reactant-product pairs, Reactron updates the molecular structure after each predicted source-sink move, producing a sequence of intermediate-like structures that reflect the evolving electron flow. Although the model progresses through structures that include chemically meaningful intermediates—as these are present in the training mechanisms—it does not explicitly segment them as discrete steps, but instead models the reaction as a continuous chain of movements. Unlike graph-edit models such as MEGAN,159 which generate and rank final reaction products using beam search over action sequences, Reactron does not produce scores for a set of multiple finished products, but instead predicts the evolution of a single reaction trajectory at a time.
A one-shot scoring strategy evaluates all candidate source–sink pairs in parallel, enabling direct comparison of alternative electron-transfer events based on local chemical context. Such models differ in how reactivity is represented, ranging from explicit quantum or physicochemical descriptors283 to learned graph-based representations.285–287
Beyond electron-level modeling, candidate-ranking can also be defined at the level of complete reactant–product pairs. Here, reactivity is encoded as the structural difference between reactants and products. In the model of Coley et al.,28 each candidate is represented by an edit vector that records bond formations, bond cleavages, and changes in hydrogen count; these vectors are processed by a neural network and converted to probabilities via softmax normalization. The Weisfeiler–Lehman Difference Network (WLDN)21,171 generalizes this concept using a graph neural network that featurizes each atom by the difference between its reactant and product embeddings, improving expressiveness and accuracy by capturing both local edits and global context.
Difference-based and embedding-based strategies share the same underlying goal: assigning likelihoods to enumerated products by learning how structural changes correlate with known reactivity. Another approach is embedding-based methods learn a latent representation of the joint reactant–product pair that implicitly captures their transformation. These models use message-passing neural networks or attention-based encoders to process the full reaction context and assign a likelihood score based on the resulting features. For instance, LocalTransform170 and its extension in Chen et al.267 use GNNs to embed both the reactants and candidate products together and learn a compatibility function over the joint representation. Rather than focusing on structural deltas, these models learn to assess whether a given reactant–product combination is chemically coherent, modeling reactivity as a function of their overall configuration.
Within this category, two problems have received particular attention: selectivity and yield prediction. Both quantify the extent of reaction progress but differ in focus: selectivity describes the relative preference among multiple plausible pathways, while yield measures the overall efficiency of product formation. The following subsections review machine learning approaches to these two domains.
Selectivity encompasses multiple levels of preference, including chemoselectivity, regioselectivity, and stereoselectivity, each presenting distinct modeling challenges. While chemoselectivity overlaps with major product prediction, regio- and stereoselectivity require finer discrimination among closely related alternatives. Because these differences often arise from subtle steric or electronic effects, many models incorporate physically interpretable descriptors, such as Sterimol parameters,400 to complement learned molecular representations. The following subsections survey representative machine learning approaches to regioselectivity and stereoselectivity, illustrating how empirical observables and chemical priors are integrated into quantitative models of reaction outcomes.
Regioselectivity prediction can be approached in two distinct ways, depending on how models define and evaluate reactivity at different sites within a molecule. Ranking-based approaches treat the task as a relative comparison among candidate positions. Given a predefined set of possible sites, these models assign a score, rank, or probability to each, selecting the most likely site of reaction based on the highest value. This formulation can be found in models trained on product annotations or experimental site distributions, as in cytochrome P450 metabolism,403 nucleophilic aromatic substitution,277 and aromatic C–H functionalization.404,405 In many cases, these models output soft probability distributions over sites. For example, a model may assign 70% likelihood to one position and 20% to another.404 This type of label can often be derived from existing reaction databases without the need for quantum mechanical calculations, making it suitable for large-scale or high-throughput settings.406–409 It has been demonstrated that intentional dataset design with active learning can improve the performance of these models in data sparse regimes.410
In contrast, site-wise regression models are trained to predict an absolute reactivity-related value for each candidate site. These values are typically physical or thermodynamic quantities, such as bond dissociation energy (BDE), free energy barrier (ΔG‡), or hydricity, which is the free energy required for heterolytic cleavage of a C–H bond to form a hydride ion (H−).277,411,412 Such models require a numerical label for each site, often obtained from quantum chemical calculations. Examples include predicting BDEs for sp3 or sp2 C–H bonds,403,411,413–418 computing activation energies for radical substitutions,411 or estimating hydricity values to assess hydrogen atom transfer reactivity.419 These labels are more costly to generate than categorical site annotations, but they enable detailed mechanistic interpretation and generalization beyond training examples.
The modeling strategies used in ranking- and regression-based approaches differ both in architecture and in how transparently they reflect underlying chemical principles. Ranking models often rely on neural architectures such as message-passing networks or attention mechanisms that learn from molecular graphs or SMILES strings without explicitly defined input features.404,420 While these models can achieve strong predictive performance, the source of their predictions is often difficult to interpret, as the learned representations do not correspond directly to chemically meaningful quantities.397 In contrast, many site-wise regression models use descriptors grounded in physical organic chemistry, such as partial atomic charges,421 Sterimol parameters,405,421 Fukui indices,422 or quantum-derived values like BDE413–416,418 and orbital energies.405,421 These descriptors have been employed in models that predict site selectivity in contexts such as C–H oxidation421,422 and electrophilic aromatic substitution.276,405,423 Because these input features are chemically interpretable and often mechanistically motivated, regression models are more amenable to rational analysis and post hoc interpretation than deep learning models trained end-to-end on structural data.411
Some recent studies combine these two perspectives into hybrid frameworks that aim to balance computational efficiency with mechanistic fidelity. These models use a ranking component to rapidly screen or prioritize candidate sites and apply regression or quantum calculations only to cases with uncertain or conflicting predictions. For example, Guan et al.277 predicted site selectivity for nucleophilic aromatic substitution using a graph neural network, and triggered DFT calculations when the model's confidence was low. Similarly, Seumer et al.412 employed a two-step strategy in which machine learning models first narrowed down reactive sites for Pd-catalyzed C–H activation, and quantum chemical methods were then used to evaluate the relative energies of plausible intermediates. A related philosophy also appears in descriptor-based studies that integrate empirical models with quantum refinement.411,422,424 Such hybrid strategies offer a practical solution for workflows that demand both speed and reliability.
:
10 er, while a difference of around 6 kcal mol−1 leads to near-complete selectivity (99
:
1).426 Because this metric is continuous, centered near zero for unselective reactions, and experimentally quantifiable with high precision, it provides a suitable target for model training in well-characterized reaction classes.426 As in regioselectivity prediction, many models rely on quantum chemical descriptors and steric parameters.397,427 However, the limited availability of large-scale datasets containing experimentally measured er or ee values has restricted the development of models based on learned molecular representations.428Despite the increasing use of machine learning in modeling stereoselectivity, the availability of large-scale labeled data remains a fundamental limitation.426,428,429 Experimentally determined er or ee values are labor-intensive to obtain and are rarely included in open-access reaction databases.426 This scarcity is particularly problematic in asymmetric catalysis, where the majority of reported reactions are biased toward high selectivity and favorable outcomes, leading to a skewed and incomplete representation of chemical space.426 As a result, most datasets are restricted to narrow chemical domains, such as specific ligand scaffolds, transition metal complexes, or single catalytic motifs, limiting model generalizability across different reaction families.430 Because of this locality and data sparsity, most current models adopt a descriptor-based framework tailored to specific reaction classes.431 These models rely on physical organic descriptors or transition-state-derived features to achieve interpretability and predictive performance in constrained chemical spaces.431,432 Attempts to build models using learned molecular representations remain limited due to data scarcity and concerns about extrapolation performance beyond the training domain.428
Most enantioselectivity prediction models adopt a regression framework and can be categorized based on how chemical information is represented and interpreted. A widely used strategy involves multivariate linear regression models built on steric and electronic descriptors, such as Charton values, Sterimol parameters, and classical linear free-energy descriptors.427,433,434 These features are selected to reflect underlying physical organic principles and have been applied in early studies of asymmetric catalysis using multivariate regression approaches.427,434 Because these models rely on human-selected, mechanistically interpretable inputs, they are well-suited for reaction classes with established transition state models and allow for direct rationalization of selectivity trends. In addition to these descriptor-based approaches, several studies have incorporated explicit three-dimensional steric environments derived from quantum chemical calculations into machine learning models for stereoselectivity prediction.435
Building on these descriptor-based approaches, several studies have explicitly integrated quantum chemical calculations with statistical or machine learning models to predict enantioselectivity. Early work by Huang et al.436 used computed transition states to extract geometric and energetic features, which were then used in regression models. More recent studies have used a hybrid workflow that integrates computed steric maps or electronic descriptors with machine learning architectures to predict ΔΔG‡ across multiple ligand–substrate combinations.430,431,437–440 While some recent works have explored fully learned molecular representations, their applicability has remained limited due to the small size and chemical locality of existing datasets.428
Taken together, most high-performing models in this domain still rely on physical descriptors that can be mapped back to mechanistic reasoning. This allows the model to reveal how structural features, such as steric bulk or electronic effects positioned near the reactive site and in carbohydrate-based systems where structural conformation drives selectivity,432,441 influence the formation of one enantiomer over the other. In contrast, deep models that learn molecular features end-to-end may offer greater representational flexibility but have seen more limited application in enantioselectivity prediction, largely due to the small size of available datasets.397,428,442
Most models for enantioselectivity prediction have been developed and evaluated within narrow chemical domains, typically focusing on a specific ligand scaffold, transition metal center, or reaction class. As a result, their predictive performance tends to degrade when applied to systems outside the original training distribution. This limited scope has motivated efforts to explore the transferability of structure–selectivity relationships across different catalytic contexts. Representative examples include the work of Reid and Sigman,443 who demonstrated that a common descriptor space for bisphosphine ligands could be reused across multiple reaction types and metal centers. Related studies have explored broader condition spaces within a fixed reaction class,25 or incorporated detailed mechanistic and quantum-chemical descriptors to model selectivity trends,444,445 but generally remain constrained to singular catalytic systems. Their analysis showed that while model performance remained reaction-dependent, certain steric and electronic descriptors exhibited consistent correlations with selectivity across reaction classes.
From a modeling perspective, yield prediction is typically framed as a regression problem, where descriptors of reactants, reagents, and conditions are concatenated and used as model input. Types of representations vary, with common choices being circular fingerprints or DFT-derived descriptors; graph-based encodings have been explored more recently as well. Efforts vary with respect to the types of datasets, models, and representations evaluated, and in terms of success, are a mixed bag of results, reflecting both the promise and the challenges of reliably predicting yield.
A common paradigm is to train a model on high-throughput experimental (HTE) data. Traditionally, these datasets represent the exhaustive exploration of a few substrate and reagent components for a single reaction type, and are attractive due to the use of automation for data generation, which allows for increased reaction thorughput, and reduces confounding factors by holding most procedural variables constant. Examples include two publicly-available Suzuki and Buchwald–Hartwig datasets, which are often used to benchmark yield prediction models.25,446 Due to the combinatorial nature of this data, models are generally able to interpolate well. Even input features encoding no chemical information (i.e. one-hot encodings) often perform on par with chemically-rich features (i.e. structural/electronic information) on interpolation tasks.447 However, the limited chemical diversity may hinder them from extrapolating to out-of-distribution components.37,448–450 Nevertheless, as automation capabilities expand, an emergent trend has been the release of broader HTE datasets, covering multiple reaction types and a larger range of substrates,43,451,452 or covering more industrially-relevant data,408,453 potentially offering opportunities for more comprehensive model development.
Another common approach is to build a “global model” by training on large databases of published reactions spanning multiple reaction types and several diverse substrates (i.e. USPTO, CAS, Pistachio, Reaxys). While having such a model is undoubtedly useful in concept, it often sees little success in practice,454 as models struggle to learn across multiple mechanisms and are hindered by implicit experimental differences, reactivity cliffs, and publication bias towards high-yielding reactions.455 To at least minimize the effect of mechanistic changes, one may also consider building a model on literature data extracted for a single reaction type; however, it should be noted that model performance has generally been sub-optimal even in this setting as confounders remain.456
Yield is of course affected by numerous experimental factors such as concentrations, reaction scale, and the precise order and timing of addition, most of which goes unreported in commonly-used datasets.457 Even when reaction procedures are shared with a special interest in reproducibility, outside of the context of data-driven modeling, it has been observed that a substantial fraction of reactions cannot be satisfactorily reproduced;458,459 it is unsurprising then that models cannot predict yields under these circumstances. Modeling is further complicated by the skewed yield distributions in reported data, particularly in the literature where high-yielding examples dominate and negative data are scarce, limiting extrapolation to unfavorable substrates or conditions460, whereas HTE datasets provide more balanced yield distributions that support robust prediction across the full yield range.25,446 Reported yields may include both reactivity and purification, which is particularly true for datasets that combine information from multiple sources (i.e., literature databases). Prediction of isolated yields is ill-posed as details of workup are not included in structured datasets.461 Other performance metrics such as enantio- and regioselectivity are less subject to the aforementioned experimental differences. Thus, they have generally seen more success in modeling within specific reaction classes than yield, as discussed in the previous section.
It should be noted that these difficulties in modeling yield, which have often led to subpar model performance, do not necessarily mean that these models cannot be useful when applied to practical workflows. In discovery chemistry settings, for example, a binary classification yield prediction model (0 vs. >0% yield) is still very useful and can reduce the impact of dataset noise from variation in experimental conditions. Indeed, such classification models have been shown to be successful, both in silico462 and in deployment for pharmaceutical discovery efforts.463 Additionally, training on curated datasets that are more similar data to the desired task (i.e. medicinal relevance) has shown greater success in real-world extrapolation,408,453 particularly when combined with active learning.464
RMG has been applied to numerous chemically complex systems. In modeling the steam cracking of n-hexane,465 RMG generated a reaction network comprising 1178 elementary steps with calculated rate constants. The calculated rate constants were then used to perform kinetic modeling. The yields from the model matched well with experimental data from a pilot plant which allowed investigators to probe the reaction pathways that were kinetically relevant. Similar analyses have been performed for syngas production from bio-oil gasification,466 combustion and pyrolysis of iso-butanol,467 among many other chemical processes. RMG represents a sensible method choice in these applications since the elementary steps and their corresponding kinetics of these industrial processes are well understood. This means the template-based strategy can yield a robust chemical reaction network and the kinetic parameter estimation and kinetic simulations can yield meaningful concentration profiles.
A representative application is the exploration of the glucose pyrolysis reaction network.468 YARP was used to explore over 31
000 elementary steps and was able to identify around 7000 kinetically relevant elementary steps (as defined by being below a barrier threshold of 45 kcal mol−1). In this exploration, YARP was able to identify well-known reaction pathways in glucose pyrolysis as well as identifying new, low-barrier pathways to major experimental products. YARP has also been used for other complex reaction network explorations of prebiotic ribonucleic acid formation from hydrogen cyanide469 and Li-ion electrolyte degradation.470 YARP constitutes a well-justified choice of method in these types of explorations due to the mechanistic uncertainty and emphasis on product retrieval associated with analyzing systems without well-established mechanisms. The exhaustive graph-based enumeration allows YARP to explore a broader set of potential mechanistic paths and the importance of retrieving products and pathways to these products (rather than accurate kinetic profiles like the previous example) justifies YARP's utilization of semi-empirical methods to accelerate kinetic evaluations.
Despite these advances, several challenges remain. The data-driven paradigm is often limited by the quality and representativeness of training data, which frequently omit reaction conditions, side products, or stereochemical details. As a result, learned models may generate statistically plausible yet chemically inconsistent outcomes, occasionally violating mass or electron conservation. Commenting on data availability being a limitation of data-driven methods is trite, but there is a deeper issue here. It is not merely that the field would benefit from more data, it is that the data that is currently collected into structured databases for the purposes of devising benchmarks are fundamentally incomplete. Reaction product prediction—in virtually every instantiation—is ill-posed. High-throughput experimentation within a single laboratory can mitigate confounding variables, but any dataset assembled from multiple literature sources is missing critical information about reaction conditions and procedures that we know to influence reaction outcomes. Some of that procedural richness exists in unstructured text in the original articles that describe a reaction, even if not in reaction databases we use, providing some hope that this information is already accessible to large language models. However, even when considering hypothetical models that learn “all” of the nuance of experimental procedures from published articles, one must acknowledge that challenges in reproducibility458 raise important questions regarding what we can reasonably expect our models to learn.
On the other hand, physics-based methods are generally free of confounding variables and offer mechanistic rigor, but remain computationally demanding and scale poorly to complex systems. Even when substantial computational resources are devoted to them, the predictive accuracy of physics-based methods degrades for many systems of practical interest due to incomplete treatments of electron correlation, solvation, or anharmonicity.475–477 With physics-based methods, the challenge is not that experimental datasets may omit important procedural details, but that these methods may not have a natural way to account for them; consider, for example, how a DFT model of reaction feasibility could account for the rate of addition. Moreover, these methods scale poorly with system size, making routine application to complex reactive networks infeasible. Recent advances in machine-learned interatomic potentials like smaller, more specialized models478 and more efficient architectures479 will help to address the problem of scalability. Addressing the problem of accuracy necessitates different types of contributions.
Looking ahead, progress in reaction outcome prediction will rely on continued advances within both data-driven and physics-based paradigms. Machine learning approaches are expected to benefit from more chemically faithful representations and curated datasets that capture negative outcomes (so-called “failed reactions”), richer reaction conditions, side products or impurities, and more. Such datasets are essential to more accurately reflect real experimental outcomes and to prevent models from overfitting to idealized or incomplete representations of chemical reactivity. Equally important is the transition from merely predicting the identities of products to kinetics-informed models. Accurate predictions of quantitative reaction rates and product distributions would reflect a more complete understanding of reactivity and provide more utility than existing models. Predictive models that able to consider the role of reaction conditions and experimental protocols are a prerequisite for hypothetical future in silico condition optimization workflows.
For physics-based methods, improving the speed and accuracy of energetic oracles will be crucial for expanding the domain of applicability of these methods. Recent advances in machine-learned interatomic potentials like smaller, more specialized models478 and more efficient architectures479 will help to address the problem of scalability. The creation of the OMol25 dataset147 highlights a broader trend of large-scale data generation to train foundation machine-learned interatomic potentials capable of providing the speed necessary to simulate increasingly complex systems. We expect that as the coverage of these datasets continues to improve, the resulting models will be capable of predictive simulations that provide actionable insights to experimental chemists, even if inheriting the limitations of its associated training data.480,481
Beyond purely technical advances, improving the usability and accessibility of computational tools will be essential to bridge the gap between methodological advances and routine adoption in experimental settings. Despite strong predictive performance in certain settings, reaction prediction models remain underutilized in practice, likely due to a combination of limited awareness, barriers to effective use, and a lack of confidence in model predictions. One often requires extensive experience to have (and often extensive patience to develop) an intuition for when a model prediction can be trusted, or when a seemingly-valid proposal actually reflects an idiosyncratic failure mode. Certainly, deep expertise is beneficial when understanding how to select an appropriate physics-based framework (e.g., functional and basis set in DFT) for a given chemical system. In many cases, computational methods also remain difficult to install, configure, or integrate into existing workflows, particularly for researchers without formal training in programming. Improving the convenience of deploying models coming out of academic labs may require user-friendly interfaces and well-designed APIs or MCPs (e.g., for tool-calling by large language models), as well as robust software engineering practices including maintenance, documentation, and validation. All of these usability concerns are rather easily addressed with modern coding agents, which may not yet have fully permeated into mainstream computational chemistry practices.
Another question that is fair to ask, critically, is for what purpose are these models actually useful? Among the many reaction outcome prediction tasks, there are several settings where existing predictive models have a clear role; e.g., anticipating reaction success or failure in a parallel chemistry campaign to maximize the number of products for subsequent testing;463 gaining confidence that a late-stage C–H functionalization will be selective before embarking on a long total synthesis or screening a precious intermediate;410 constructing and parameterizing detailed kinetic models to embed in computational fluid dynamics simulations and estimate the effect of additives on ignition delay;482 suggesting reasonable initial reaction conditions (e.g., solvent, reagents, temperature) to accelerate experimental optimization workflows;483 and screening synthetic routes for feasibility using a model trained on DFT calculations.484 Yet there are many more hypothetical workflows that remain out of reach. Fully in silico reaction condition screening; a perfect oracle for reaction feasibility to guide retrosynthetic analysis; development of a “digital twin” reaction flask emulator for the chemistry lab with which to train laboratory agents; de novo catalyst design with reliable prediction of rates and turnovers; flowsheet modeling for pharmaceutical manufacturing that predicts heat and mass transfer-dependent product distributions; predicting reaction rates and time-dependent product distributions under realistic conditions, enabling quantitative control over selectivity and yield; autonomous exploration and discovery of complex reaction networks to identify previously unknown transformations. To address these more ambitious use cases, the next generation of reaction prediction models should aim to be able to generalize across disparate chemical domains, reason over reaction conditions, rank competing pathways, and quantitatively describe the dynamic evolution of reaction mixtures. Ultimately, these methods should enable the discovery of new, rather than the recapitulation of known, chemical reactivity.
| This journal is © The Royal Society of Chemistry 2026 |