Open Access Article
Jonghwi Choe†
a,
Shuan Chen†ab and
Yousung Jung
*abc
aDepartment of Chemical and Biological Engineering, and Institute of Chemical Processes, Seoul National University, 1 Gwanak-ro, Seoul, South Korea
bInstitute of Engineering Research, Seoul National University, 1 Gwanak-ro, Seoul, South Korea
cInterdisciplinary Program in Artificial Intelligence, Seoul National University, 1 Gwanak-ro, Seoul, South Korea. E-mail: yousung.jung@snu.ac.kr
First published on 13th May 2026
Template-free retrosynthesis models offer the potential to extrapolate beyond established chemical reaction spaces, addressing inherent limitations of template-based approaches. However, it remains unclear whether these models can reliably predict accurate, novel, and chemically feasible pathways outside their training distribution. In this study, we rigorously assess the extrapolation ability of state-of-the-art template-free models using carefully constructed out-of-distribution (OOD) benchmarks derived from USPTO datasets. While these models can generate novel synthetic routes, their exact-match accuracy on OOD reactions is remarkably low (typically <1%). Moreover, round-trip performance (≈5–30%) is influenced by the performance of the forward model and may not fully capture some chemically reasonable predictions. Complementary manual inspection mitigates this limitation by revealing that the surrogate forward model produces false negatives, where chemically feasible reactions are incorrectly predicted as infeasible, and vice versa for false positives. These results underscore a critical challenge: current models may exhibit little creative extrapolation yet lack mechanisms to ensure chemical feasibility. Addressing this gap is essential for developing retrosynthesis models that are not only innovative, but also reliable for real-world synthesis planning.
Unlike template-based methods that rely on predefined reaction rules, template-free approaches treat the retrosynthesis as either a graph editing9 task or a SMILES generation10–12 task. These models are, in principle, capable of proposing chemical transformations that go beyond the predefined synthesis rules. However, it remains unclear whether they can reliably predict chemically feasible synthetic routes that lie outside their training reaction data—referred to as out-of-distribution (OOD) reactions—as opposed to in-distribution (ID) reactions13,14 that are similar to those seen during training.
To address this, RetroOOD15 formalizes label- and covariate-shift scenarios, and reassesses state-of-the-art models primarily via top-k accuracy. Tanović et al.16 analysed template-frequency skew and proposed “narrow” versus “broad” training-set partitions to disentangle template diversity from examples-per-template effects, evaluating performance mainly via top-k and round-trip accuracy. Beyond these task-specific studies, broader benchmarking efforts have emphasized that reported retrosynthesis performance can be highly sensitive to evaluation design itself. For example, Syntheseus provides a standardized benchmarking framework for both single-step and multi-step synthesis planning, and shows that the ranking of state-of-the-art methods can change under more carefully controlled evaluation settings.17 Likewise, Hastedt et al. introduced an automated benchmarking and interpretability pipeline and showed that chemical validity, feasibility, and interpretability can differ substantially across retrosynthesis frameworks, with purely data-driven approaches often producing unfeasible or invalid predictions.18 However, these studies still do not provide an in-depth analysis of the chemical quality of out-of-distribution (OOD) reactions generated by template-free models, which is central to assessing whether such models genuinely extrapolate beyond the reaction patterns represented in the training data.
In this work, we investigate whether template-free retrosynthesis models can genuinely extrapolate to novel reaction spaces, or whether their apparent performance primarily reflects memorization of transformation patterns seen during training. Beyond exact-match accuracy, we evaluate template-free retrosynthesis models by examining the novelty, chemical validity, and synthetic feasibility of OOD reactions. In addition to standard round-trip accuracy, we conduct manual inspection to assess chemical plausibility and reveal surrogate biases intrinsic to round-trip metrics. Further comparison between language-based and graph-based template-free models further reveals that chemistry-aware inductive biases can substantially improve feasibility without sacrificing their novelty. (We note that an earlier version of this work was presented at the NeurIPS 2023 ELLIS Workshop on Molecule Discovery and archived on arXiv (arXiv:2403.03960); the present manuscript expands upon that preliminary study.)
We sorted all extracted LRTs by their frequency of occurrence and selected the most frequent templates that collectively account for 80% of all reactions. Reactions associated with these templates were assigned to the training set and considered as ID reactions, representing well-established synthetic knowledge. The remaining 20% of reactions, corresponding to less frequent and often more diverse templates, were reserved for validation and testing and treated as OOD reactions. Templates appearing exclusively in these subsets were designated as test templates.
Following this protocol, the USPTO-50k20 split contains 39
982/5016/5018 reactions for training, validation, and test sets, associated with 45/40/835 unique templates, respectively. Similarly, the USPTO-480k20 split consists of 383
784/46
519/48
730 reactions using 232/667/19
321 templates.21 The pronounced imbalance between reaction counts and template diversity reflects the intrinsically long-tailed distribution of chemical reaction space and establishes a stringent benchmark for OOD extrapolation. The statistics and distributions of the curated data can be found in Table S1 and Fig. S1.
Exact-match accuracy measures how accurately a model predicts reactants by comparing the predicted reactant set against the ground-truth reactants after SMILES canonicalization using RDKit.22 For each target product Pi, the retrosynthesis model produces a ranked list of K candidate reactant sets {R(k)i}Kk=1. We compute top-K exact-match accuracy by checking whether any of the top-K candidates matches the ground-truth reactants.
Reaction validity evaluates whether the predicted reactions can be successfully converted into molecules using RDKit22 and if the resulting reaction satisfies atom balance with respect to the target product Pi. We report the fraction of valid predictions among the top-K candidates.
Reaction novelty evaluates whether a predicted reaction corresponds to a transformation unseen in the training set. We analysed the novelty of each prediction generated by each model by extracting their reaction templates. Predicted reactions whose extracted templates were distinct from the training templates were defined as novel reactions and subjected to further feasibility analysis (Fig. 1C).
Round-trip accuracy evaluates the feasibility of predicted retrosynthetic outputs using a reaction outcome prediction model (surrogate forward model) that predicts product candidates from the predicted reactants.23 For each target Pi, each candidate reactant set R(k)i is passed to the surrogate model fsurr, which returns the top-n product predictions {
(j)i,k}nj=1 A prediction is considered cycle-consistent if Pi appears among these outputs, and round-trip accuracy is computed as the fraction of targets for which at least one of the top-K candidates is cycle-consistent. After comparing the performance of different forward synthesis prediction models, including Transformer,24 Chemformer,10 MEGAN,9 and LocalTransform25 on the USPTO-480k dataset, we selected LocalTransform25 as the surrogate model in this work (Table S2 and S3) as it demonstrated the best performance across all splitting regimes considered over other alternatives. In addition, to avoid architectural bias, we deliberately chose surrogate models with different architectures from the inspected template-free models, ensuring that the evaluation does not favour models with similar inductive biases.
![]() | ||
| Fig. 2 Distribution of extracted reaction template types for the evaluated template-free models on (A) USPTO-50k and (B) USPTO-480k dataset across top-1 to top-10 predictions. | ||
Our results reveal a consistent novelty–validity trade-off whose origin depends strongly on model inductive biases. Overall, higher novelty tended to coincide with a larger fraction of invalid outputs, reflecting a trade-off between exploration and chemically valid generation. However, this relationship is not uniform across architectures: GraphRetro produces a level of novelty comparable to the Transformer while maintaining substantially higher validity, suggesting that chemistry-aware inductive biases (e.g., functional-group-aware edits) can mitigate validity loss even when novelty remains high. For end-to-end models such as Transformer and MEGAN, limited validity on smaller datasets suggests that additional data is required to reliably learn chemically plausible reaction representations. In contrast, Chemformer and GraphRetro maintain high validity even at smaller scales, indicating that their behaviour is not governed by data insufficiency.
Notably, for these chemistry-aware models, increasing dataset size primarily reduces prediction novelty rather than improving validity. We interpret this effect not as overfitting or correction of undertrained behaviour, but as an implicit, data-driven regularization toward canonical reaction patterns that are repeatedly reinforced during training. As larger datasets expose models to a broader yet more unevenly distributed set of reaction templates, predictions become increasingly concentrated around statistically dominant transformation modes.
At the same time, larger reaction corpus such as USPTO-480k also exhibit a more fragmented template landscape, in which unseen or weakly represented transformations are structurally farther from the dominant training distribution. As a result, extrapolation to unseen templates becomes more challenging despite increased data volume. Together, these observations suggest that the observed trade-off between novelty and validity arises from the interaction between dataset structure and model inductive biases, rather than from insufficient training on smaller datasets.
As shown in Fig. 3, cumulative average round-trip (RT) accuracy shows a strong correlation with the template popularity. For all models, this accuracy is highest for reactions associated with frequently used templates, but it decreases markedly as the evaluation expands toward less common templates. While language-based models (Transformer and Chemformer) show a continuous decline in accuracy when considering the predictions corresponding to rarer templates, graph-based models (MEGAN and GraphRetro) show a more stable profile.
This trend is consistent with Table S6, where language-based models show a steep drop in average RT accuracy as the template rank range moves from popular templates (around 30% for top 100 templates) toward the rarest ones (less than 5% for templates ranked over 10
000), while nearly half (45.7%) of Transformer's total predictions fall into the rarest template range. In contrast, graph-based models exhibit highly concentrated template usage and a comparatively stable average RT profile, with >20% RT accuracy for popular (top 100) templates to ∼10% RT accuracy for the rarest template range. Similar analysis for the USPTO-50k dataset can be found in Fig. S2 and Table S6.
Despite the computationally evaluated metrics provided in Table 2 and Fig. 3, interpreting round-trip accuracy requires caution, as it serves more as a useful heuristic for evaluating the self-consistency of a model rather than a definitive metric for true chemical feasibility. For instance, GraphRetro shows relatively high RT accuracy but extremely low exact-match accuracy, indicating that while the surrogate model (used for round-trip evaluation) recognizes the generated reactions, these do not necessarily correspond to the literature-recorded reactions. To better assess chemical novelty and feasibility, we manually examined the reactions associated with the five most frequently proposed novel templates (Fig. 4). Using SciFinder (https://scifinder-n.cas.org/), we systematically evaluated each reaction;26–36 if a similar transformation is found in the database, we consider it chemically feasible; otherwise infeasible.
Although the evaluated models can generate novel reaction templates with high round-trip (RT) accuracy, our manual inspection found that most of these novel templates reflect only limited chemical novelty. In many cases, the templates are not exact matches to predefined ones, but they still follow similar patterns in the training set. For instance, the Transformer's top-3 template24 (95.08% RT accuracy) corresponds to the formation of an imine through the reaction of an aldehyde with a primary amine. This is a fundamental transformation in organic chemistry but absent in the training set. Similarly, Chemformer's top-4 template27 (92.48% RT accuracy) represents a TMS protection reaction where a thiol nucleophilically attacks the silicon atom of trimethylsilyl chloride (TMS-Cl), displacing a chloride ion. This reactivity directly parallels reactions in the training dataset that use tert-butyldimethylsilyl chloride (TBS-Cl) as their reagents, where an oxygen atom attacks a silicon center. Both cases share the common motif of heteroatom (sulfur or oxygen) attack on silicon, demonstrating that the model has transferred the concept of silyl protection from oxygen to sulfur. Overall, these examples show that the models' novel templates with high RT accuracy are often conservative, mechanism-preserving extrapolations rather than genuinely new chemical reactivity.
For reaction feasibility, manual inspection reveals that high RT accuracy does not guarantee chemical validity. A representative example is Transformer's top-2 template, which achieved a high RT accuracy of 90.59%. This template proposes the synthesis of a sulfoxide product via the oxidation of a sulfide precursor. While sulfide oxidation as a general reaction type is feasible, the model predicted a specific, unconventional chlorinated pyridine–peroxyacid as the oxidant. This reagent appears to be a hallucinated analogue of mCPBA (meta-chloroperoxybenzoic acid) lacking chemical precedent or stability, yet the surrogate model accepted it based on its structural similarity to known oxidants. Similarly, Chemformer's top-3 template (91.55% RT accuracy) suggests synthesizing a sulfide-substituted lactam from an α, β-unsaturated lactam and a thiol via a Michael addition. However, the required starting material, which includes a strained 5-membered unsaturated lactam, is thermodynamically unstable and likely inaccessible as a stable reagent. These cases demonstrate that surrogate models could misclassify infeasible reactions as correct because they resemble common reaction patterns (e.g., oxidation, conjugate addition) in the training set.
In contrast, manual inspection also indicates that round-trip evaluation can penalize chemically reasonable OOD predictions when they are not covered by the surrogate model's learned chemical space. For example, Chemformer's top-5 template28 (0% RT accuracy) represents acylation reactions for aryl ketone synthesis. These reactions typically proceed via metal–halogen exchange to generate an organometallic nucleophile, which attacks the electrophilic carbonyl carbon of the acid derivative to yield a ketone. Similarly, MEGAN's top-1 template29 (0% RT accuracy) corresponds to the reaction of an amine with a chloro-activated electrophile, converting the amino group into an amidine derivative. In this transformation, the amine acts as a nucleophile and attacks the electrophilic carbon bearing the chloride leaving group, followed by chloride displacement and proton transfer to furnish the C
N bond, with HCl (or its salt) as a byproduct.
Overall, manual inspection suggests that round-trip evaluation does not clearly separate different types of errors. In particular, surrogate models can penalize correct OOD predictions (i.e. false negatives) while endorsing incorrect ones (i.e. false positives). At the same time, the novel yet valid predictions were mostly simple extensions of known reactions, such as replacing the atoms of known functional groups, indicating that truly new kinds of chemistry extrapolated by template-free models are rare in practice. These findings highlight the limitations of relying solely on round-trip accuracy to assess prediction quality, particularly for OOD reactions. Also, these results emphasize that while round-trip accuracy is a practical tool for high-throughput screening of model logic, it possesses inherent limitations as a true feasibility metric, particularly in out-of-distribution (OOD) spaces where surrogate bias is most pronounced.
Despite the surrogate model's shortcomings, our inspection reveals that Chemformer and GraphRetro tend to produce more chemically feasible predictions than other template-free models, respectively. This observation suggests the role of inductive biases, such as SMILES pretraining or functional-group-based molecule generation, may ensure the models' extrapolative behaviours within the bounds of chemical realism while performing relatively reasonable extrapolation.
Our findings establish important boundaries for interpreting the prediction outputs of template-free retrosynthesis models. While these approaches can produce novel reactions, such novelty alone is insufficient evidence of genuine exploration of creative and feasible reaction space. We therefore emphasize that claims of discovery or extrapolation of future template-free models should be supported by chemically grounded evaluation beyond accuracy-based metrics and evidence.
More broadly, this work highlights the need for alignment between computational evaluation protocols and chemical reasoning and provides a foundation for more responsible interpretation of AI-generated synthesis proposals.
Recent single-step retrosynthesis studies have introduced newer generative architectures, including Markov-bridge, diffusion-based, GFlowNet-based, and flow-matching approaches, as well as ensemble frameworks that combine complementary inductive biases.36–40 Although these models were not the present study, extending feasibility-aware OOD evaluation to such architectures will be an important direction for future work. In particular, it will be valuable to test whether improved top-k or round-trip performance in these newer frameworks translates into genuinely chemically plausible, feasible, and extrapolative predictions under explicit OOD settings.
Supplementary information: supplementary methods, experimental details, extra figures/tables. Methods and materials, ablation study, and the additional details of the experimental results. See DOI: https://doi.org/10.1039/d6dd00072j.
Footnote |
| † These authors contributed equally to this work. |
| This journal is © The Royal Society of Chemistry 2026 |