Open Access Article
Sara
Tanovic
a,
Ewa
Wieczorek
ab and
Fernanda
Duarte
*a
aChemistry Research Laboratory, 12 Mansfield Road, Oxford, OX1 3TA, UK. E-mail: fernanda.duartegonzalez@chem.ox.ac.uk
bAlzheimer's Research UK Oxford Drug Discovery Institute, Centre for Artificial Intelligence in Precision Medicine, Centre for Medicines Discovery, Nuffield Department of Medicine, University of Oxford, Oxford, OX3 7FZ, UK
First published on 29th December 2025
Single-step retrosynthesis models are integral to the development of computer-aided synthesis planning (CASP) tools, leveraging past reaction data to generate new synthetic pathways. However, it remains unclear how the diversity of reactions within a training set impacts model performance. Here, we assess how dataset size and diversity, as defined using automatically extracted reaction templates, affect accuracy and reaction feasibility of three state-of-the-art architectures – template-based LocalRetro and template-free MEGAN and RootAligned. We show that increasing the diversity of the training set (from 1k to 10k templates) significantly increases top-5 round-trip accuracy while reducing top-10 accuracy, impacting prediction feasibility and recall, respectively. In contrast, increasing dataset size without increasing template diversity yields minimal performance gains for LocalRetro and MEGAN, showing that these architectures are robust even with smaller datasets. Moreover, reaction templates that are less common in the training dataset have significantly lower top-k accuracy than more common ones, regardless of the model architecture. Finally, we use an external data source to validate the drastic difference between top-k accuracies on seen and unseen templates, showing that there is limited capability for generalisation to novel disconnections. Our findings suggest that reaction templates can be used to describe the underlying diversity of reaction datasets and the scope of trained models, and that the task of single-step retrosynthesis suffers from a class imbalance problem.
These methods, as is the case with machine learning algorithms generally,20,21 have previously been found to be sensitive to imbalanced data, often reinforcing biases rather than identifying important trends.22–24 This is most clearly evidenced by template-based models, where retrosynthesis is formulated as a multi-class classification task25 and thus model performance is heavily affected by the underlying distribution of the reaction templates in the training data. Within retrosynthesis, this bias manifests as preferential prediction of specific reaction classes, regioselectivities, or stereoselectivities which are better represented in the training set.22–24 The widely used open-source USPTO reaction dataset,26 derived from US patent data, and its subsets have been extensively used for training and model comparison,27–29 however, its underlying biases have been often overlooked during model evaluation.23 Torren-Peraire et al. train and test multiple models on a variety of datasets, but the lack of a common test set means that results and biases cannot be directly compared.30 Thakkar et al. investigate the impact of template library size on the performance of template-based models, but do not use template-free models and do not discuss the impacts of bias.31 Thus, it is unclear how training data impacts model predictions, and what future reaction databases should look like in terms of size and diversity.24,32
Despite many works evaluating and comparing retrosynthesis models, there is little consensus on the best way to realistically evaluate extrapolation to real world scenarios.30,33 Often models are trained and evaluated on a particular random split of USPTO50k,27 which is itself a cleaned random subset of the USPTO database,26 however this relatively small dataset cannot demonstrate how model performance would scale when trained and tested on much larger and more diverse in-house reaction libraries.30 Recently, Bradshaw et al. have shown random splits of patent databases yield overly optimistic results, due to the similarity of reactions within the same patent or published by the same author.34 Instead, they use patent- and author-based splits to simulate out-of-distribution (OOD) data and measure generalisation to reactions from unseen patents and authors, respectively. Other studies instead define generalisation as the ability to predict novel transformations defined by reaction templates.35–39 However, these studies focus on how well different model architectures can generalise to new templates, but not how the underlying training data impacts generalisation.
Here, we investigate the effect that dataset size and diversity have on single-step model performance by training and testing on different subsets of a reaction database. We generate USPTO-retro, a retrosynthesis-specific dataset derived from USPTO,26 analyse its diversity through local reaction templates,11 and use it to train and test three established single-step architectures: LocalRetro11 (template-based), MEGAN17 (graph-based template-free), and RootAligned14 (SMILES-based template-free). We show that top-k accuracy is correlated with the popularity of reaction templates in the training set for all models, regardless of architecture, suggesting that this metric can serve as a measure of reaction diversity. Finally, we evaluate performance on external test sets extracted from the Pistachio database40 to demonstrate a protocol for measuring generalisation to seen and unseen reaction templates (Fig. 1A).
This pipeline was applied to the USPTO reaction database26 to generate USPTO-retro, which includes 1
103
781 atom-mapped reaction SMILES. Reaction templates were extracted using the LocalTemplate11 algorithm, a modified version of RDChiral,42 generating a total of 10
028 local reaction templates. This template extraction method was chosen to allow for direct comparison to the LocalRetro model. Two external test sets were created from Pistachio: Pistachio ID, containing 10k reactions with in-distribution templates seen in USPTO-retro, and Pistachio OOD, containing 10k reactions with unseen out-of-distribution templates.
:
5
:
5 split, consistent with established practice in retrosynthesis studies.26–29 This is referred to as the full split. To prevent data leakage, all reactions sharing the same product were assigned to the same subset.
To investigate the effects of dataset size and diversity, the training set was further split into 10%, 25%, and 50% subsets using two splitting strategies (Fig. 1B):
• Narrow split: this strategy selects a subset of reaction templates and includes all associated reactions in the training, validation, and test sets, sequentially increasing template diversity with training dataset size. The validation and test sets are similarly filtered to contain only templates seen during training. This split aims to measure how many reaction templates models can learn to predict, and the effect of increasing template diversity on model performance.
• Broad split: in contrast, this strategy randomly samples a fraction of reactions from all templates in the full training set while ensuring at least one example of each template is present. The validation and test sets are not altered. This split is designed to measure how much data per template is needed to learn these chemical transformations.
Top-k accuracy measures the proportion of test reactions for which the ground truth reactants appear among the model's top-k predictions. In this case, the ground truth is the reported reactants from the test set. The top-10 accuracy metric is analysed in all experiments to mimic the desired breadth of a search tree in a multi-step algorithm.33
Top-k round-trip accuracy evaluates the proportion of top-k predicted reactants that satisfy back-translation.13 This is done by checking whether they regenerate the original product via a forward reaction model (here RootAligned trained on the full USPTO-retro training set) to predict the top-1 product from each set of predicted reactants. If the predicted product matches the original target, the prediction is considered successful. We report top-1 and top-5 round-trip accuracy metrics to estimate the chemical feasibility of the top predictions.13 It is important to note that the calculation of round-trip accuracy requires the use of a forward prediction model and is thus not 100% accurate, and should be interpreted as an approximation rather than an absolute measure of chemical validity.
922, with 50% of templates occurring fewer than 12 times. This bias underscores the inherent nature of open-source reaction databases, where certain reactions dominate. For example, the top 10 templates account for just 0.1% of all templates and together describe 30% of the training data.
The most common reaction template, an example of which is shown in Fig. 2B, corresponds to a C–N bond-forming SN2 reaction, which accounts for >78k (8%) of all reactions in the training set. This template is similar to the next two most popular templates, which differ only in their leaving groups. Conversely, rarer templates include those with uncommon leaving groups or highly specific reaction centres. While these reactions are less common in the dataset, they are not necessarily less effective or harder to apply experimentally. Therefore, understanding the implications of this template imbalance on model performance is key for formulating better training and data curation strategies.
This decrease in top-k accuracy does not imply lower reaction feasibility; rather, it indicates the model's increased vocabulary of reactivity as a broader set of plausible reactions is suggested. Round-trip accuracy is used here to estimate the feasibility of the predicted reactions.13 The top-1 round-trip accuracy remains roughly consistent across all splits and models, with over 89% of top predictions likely to be feasible reactions. In contrast, the top-5 round-trip accuracy increases by 14–21% across all models as template diversity increases, suggesting that lower-ranked predictions become more feasible when the model is exposed to more reaction types.
This behaviour differs from previous studies wherein top-k accuracy improves with additional randomly split training data.25,44 In our case, increasing both the volume and diversity of training data leads to a decrease in top-k accuracy. This highlights the importance of explicitly reporting and accounting for reaction template diversity when comparing model performance across datasets with varying levels of diversity.
In contrast, the RootAligned model exhibits a substantial decrease in performance across the broad split. Its top-10 accuracy degrades by 15.7% between the 10% to 50% training sets, but recovers to 85.0% with the full training set. The consistent performances of LocalRetro and MEGAN indicate that the variations observed for RootAligned arise from the underlying transformer architecture rather than the size or nature of these training sets. This template-free approach attempts to implicitly learn chemistry directly from SMILES strings, whereas the template-based and semi-template methods provide a more structured way of learning reactions through predefined templates and graph edits. Consequently, the learning process of the RootAligned model may require more examples of the same reactions to fully utilise this chemistry. Models may also be more easily overfit on the smaller training datasets, leading to memorisation and pattern matching, which cannot generalise to the test set. Further investigation is needed to determine if this behaviour occurs with other template-free models.
001+) is at most 88.6% for LocalRetro, 83.5% for MEGAN, and 55.4% for RootAligned. A similar, though weaker, correlation is observed when considering Tanimoto similarities between the training and test sets (Fig. S4). These trends persist even in models that do not explicitly use reaction templates, such as MEGAN and RootAligned, implying that template frequency reflects the underlying class distribution of reaction data.
In both the narrow and broad splits, increasing the training set size amplifies the spread of top-k accuracies across template frequencies. For the most frequent templates (frequency of 10
001+), LocalRetro and MEGAN consistently achieve top-10 accuracy above 95%, regardless of training set sizes. In contrast, rare templates (frequency of 1–10) show a marked drop in accuracy as training set size increases: top-10 accuracy decreases between the narrow 10% and full 90% training sets by 53.9% for LocalRetro, 33.8% for MEGAN, and 22.1% for RootAligned. This behaviour is most pronounced for LocalRetro, which explicitly considers reaction templates and thus learns to prioritise more frequent classes during training. RootAligned, which implicitly encodes chemistry through SMILES strings, is less sensitive to these class imbalances. These results suggest that increasing both the number and imbalance of reaction templates contributes to performance disparities. To mitigate this, further work is needed to incorporate class balancing strategies during model training.
While the top-k accuracy measures how often a reaction template is correctly predicted, it does not describe how often that type of template is recalled. Thus, it is also important to understand if the models are oversampling from popular reaction classes as a way of mimicking the training set distribution. This behaviour is most easily studied in the LocalRetro model, as its algorithm readily outputs a ranked list of predicted templates. In all splits, the model oversamples the most popular template classes for its highest ranked prediction (Fig. 5A). Rarer templates are undersampled compared to the true test distribution, which contributes to their low top-10 accuracy. These rarer templates are instead sampled more often at lower ranks as the model is less confident in their prediction (Fig. 5B).
On the Pistachio ID test set (Fig. 6B), all models exhibit a moderate decline in top-10 accuracies when compared to their performance on the USPTO-retro test set: 7–9% for LocalRetro, 6% for MEGAN, and 5–12% for RootAligned. This indicates that models successfully generalise to novel products using templates learnt during training, with similar performance trends to previous results. The slightly reduced performance on this test set is likely due to the lower structural similarity between Pistachio ID products and those in the USPTO-retro training sets (Fig. S5).
In contrast, performance on the Pistachio OOD test set (Fig. 6C) reveals severe limitations in generalisability to novel disconnections, in agreement with previous findings.35–38 LocalRetro exhibits near-zero top-10 accuracy, which is expected given its reliance on predefined templates. The non-zero accuracy suggests template ambiguity, where different templates from the training and OOD test sets occasionally yield the same sets of reactants. This occurs due to overlapping SMARTS patterns or errors in atom mapping. MEGAN and RootAligned models show modest generalisability, which increases with increased training diversity and peaks at top-10 accuracies of 1% and 2% respectively with the full training sets. Their low but non-zero accuracy implies that models prioritise recognising and applying patterns seen in the training data over utilising underlying chemical principles to generate novel, feasible disconnections.
These results highlight the differences in capabilities between ID and OOD generalisation, emphasising the need for distinct evaluations that distinguish between these two scenarios. Previous studies showing the traditional learning pattern of increasing top-k accuracy with increasing training data volume25,44 may, in fact, be misattributing the effect of additional template coverage of the test set to additional data. This explanation may also apply to studies showing low generalisability to external datasets33 or author-/patent-based splits,34 wherein their test sets possibly contain both seen and unseen templates. Furthermore, the extremely low generalisability of template-free models to novel templates suggests that these models are not yet sufficiently developed to warrant their use for predicting new chemistries.
Our results have highlighted the critical role of training set diversity in model performance. Increasing the diversity of the training set significantly increases top-5 round-trip accuracy, an indicator of prediction feasibility, while reducing top-10 accuracy, reflecting the ability of the model to recover the ground truth. This trade-off suggests that more diverse datasets enable the prediction of a broader range of plausible reactions, even if they differ from the ground truth. Interestingly, increasing dataset size without increasing template diversity yields minimal performance gains for LocalRetro and MEGAN models, suggesting that template diversity has a greater impact on model performance than volume.
We also examined the impact of template frequency on model performance. All three models, regardless of whether they explicitly use templates, show a strong correlation between a template's frequency in the training set and the model's ability to predict it correctly. This indicates that all models implicitly rely on the distribution of reaction templates learnt during training, with rare templates consistently underperforming compared to more frequent ones.
Finally, to assess real-world applicability, we evaluated model performance on two external test sets derived from the Pistachio database: one containing novel products with known templates (Pistachio in-distribution (ID)) and another with novel products and unseen templates (Pistachio out-of-distribution (OOD)). While all models generalised reasonably well to new molecules involving known templates, their ability to predict novel disconnections was limited. These results highlight the differences in capabilities between ID and OOD generalisation. LocalRetro failed almost entirely on OOD reactions due to its reliance on predefined templates, while MEGAN and RootAligned achieved only 1–2% top-10 accuracy. These results highlight the need for evaluation protocols that clearly distinguish between in- and out-of-distribution generalisation.
These results also offer a new perspective on recent advances in transfer learning for retrosynthesis prediction, wherein fine-tuning effectively modifies the training template distribution. For instance, our reported mixed fine-tuning approach to bias predictions towards heterocyclic ring disconnections can be viewed as addressing the underlying class imbalance issues present in the initial training set.45 Our results suggest that similar systematic approaches to class imbalance during training could improve representation across reaction classes. Similar challenges have been addressed in other domains, such as computer vision, through pre-training, data augmentation, and re-weighting strategies,46 and could be applied to retrosynthesis through the selective augmentation of rare templates or lower weighting of popular templates during the training process.
The performance trends across the narrow and broad splits raise questions about what data should be used to train retrosynthesis models. Ideally, models would learn underlying physical principles to propose feasible reactions; however, evaluation shows that they are more likely to learn to mimic the template distribution of the training set. Further cheminformatic analysis is needed to characterise the biases of common reaction datasets and identify areas for improvement. Furthermore, models do not necessarily exhibit worse accuracy when trained on less data; therefore, data curation efforts should prioritise quality and diversity over quantity. As such, it is clear that as chemists we cannot blindly train models with all available data and not consider the types of chemistry that data represents, and whether that chemistry suits our synthetic goals and targets.
Training: the open-source packages used to train the machine learning models (using the configuration files provided at https://github.com/duartegroup/template-splits/tree/main/configs) can be found at:
• LocalRetro: https://github.com/kaist-amsg/LocalRetro (since removed by the authors)
• MEGAN: https://github.com/molecule-one/megan
• RootAligned: https://github.com/otori-bird/retrosynthesis
Testing: the open-source syntheseus package used to analyse the trained models can be found at https://github.com/microsoft/syntheseus/tree/main.
Supplementary information (SI): further discussion of the data preprocessing pipeline, training dataset template distributions, and Tanimoto similarity analysis between training and test sets. See DOI: https://doi.org/10.1039/d5dd00358j.
| This journal is © The Royal Society of Chemistry 2026 |