An exploration of dataset bias in single-step retrosynthesis prediction

Sara Tanovic; Ewa Wieczorek; Fernanda Duarte

doi:10.1039/D5DD00358J

An exploration of dataset bias in single-step retrosynthesis prediction

Sara Tanovic,

^a Ewa Wieczorek

^ab and Fernanda Duarte

*^a

Author affiliations

* Corresponding authors

^a Chemistry Research Laboratory, 12 Mansfield Road, Oxford, OX1 3TA, UK
E-mail: fernanda.duartegonzalez@chem.ox.ac.uk

^b Alzheimer's Research UK Oxford Drug Discovery Institute, Centre for Artificial Intelligence in Precision Medicine, Centre for Medicines Discovery, Nuffield Department of Medicine, University of Oxford, Oxford, OX3 7FZ, UK

Abstract

Single-step retrosynthesis models are integral to the development of computer-aided synthesis planning (CASP) tools, leveraging past reaction data to generate new synthetic pathways. However, it remains unclear how the diversity of reactions within a training set impacts model performance. Here, we assess how dataset size and diversity, as defined using automatically extracted reaction templates, affect accuracy and reaction feasibility of three state-of-the-art architectures – template-based LocalRetro and template-free MEGAN and RootAligned. We show that increasing the diversity of the training set (from 1k to 10k templates) significantly increases top-5 round-trip accuracy while reducing top-10 accuracy, impacting prediction feasibility and recall, respectively. In contrast, increasing dataset size without increasing template diversity yields minimal performance gains for LocalRetro and MEGAN, showing that these architectures are robust even with smaller datasets. Moreover, reaction templates that are less common in the training dataset have significantly lower top-k accuracy than more common ones, regardless of the model architecture. Finally, we use an external data source to validate the drastic difference between top-k accuracies on seen and unseen templates, showing that there is limited capability for generalisation to novel disconnections. Our findings suggest that reaction templates can be used to describe the underlying diversity of reaction datasets and the scope of trained models, and that the task of single-step retrosynthesis suffers from a class imbalance problem.

Supplementary files

Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article.

View this article’s peer review history

Article information

DOI: https://doi.org/10.1039/D5DD00358J
Article type: Paper
Submitted: 13 Aug 2025
Accepted: 22 Dec 2025
First published: 29 Dec 2025
This article is Open Access

Download Citation

Digital Discovery, 2026,5, 793-802

Permissions

Request permissions

An exploration of dataset bias in single-step retrosynthesis prediction

S. Tanovic, E. Wieczorek and F. Duarte, Digital Discovery, 2026, 5, 793 DOI: 10.1039/D5DD00358J

This article is licensed under a Creative Commons Attribution 3.0 Unported Licence. You can use material from this article in other publications without requesting further permissions from the RSC, provided that the correct acknowledgement is given.

Digital Discovery

An exploration of dataset bias in single-step retrosynthesis prediction

Abstract

Supplementary files

Transparent peer review

Article information

Download Citation

Permissions

An exploration of dataset bias in single-step retrosynthesis prediction

Social activity

Search articles by author

Spotlight

Advertisements