Issue 2, 2026

Data augmentation in a triple transformer loop retrosynthesis model

Abstract

Reactions in the US Patent Office (USPTO) are biased towards a few over-represented reaction types, which potentially limits their usefulness for computer-assisted synthesis planning (CASP). To obtain an equilibrated dataset, we applied retrosynthesis templates to USPTO molecules as products (P) to generate starting materials (SM). We then used transformer T2 from our recently reported triple transformer loop (TTL) retrosynthesis model to predict reagents (R) for the SM → P reaction. Finally, we validated the prediction by requesting a high confidence prediction (>95%) for the prediction of P from SM + R by TTL transformer T3. We generated up to 5000 reactions per template, resulting in 27.5m validated fictive reactions covering the chemical space of the original USPTO dataset. To exemplify the use of this dataset, we demonstrate that a single-step retrosynthesis transformer model trained on a template equilibrated subset of 1 097 374 fictive reactions outperforms the corresponding model trained on USPTO reactions only.

Graphical abstract: Data augmentation in a triple transformer loop retrosynthesis model

Article information

Article type
Paper
Submitted
16 Oct 2025
Accepted
21 Jan 2026
First published
21 Jan 2026
This article is Open Access
Creative Commons BY license

Digital Discovery, 2026,5, 653-661

Data augmentation in a triple transformer loop retrosynthesis model

Y. Grandjean, D. Kreutter and J. Reymond, Digital Discovery, 2026, 5, 653 DOI: 10.1039/D5DD00465A

This article is licensed under a Creative Commons Attribution 3.0 Unported Licence. You can use material from this article in other publications without requesting further permissions from the RSC, provided that the correct acknowledgement is given.

Read more about how to correctly acknowledge RSC content.

Social activity

Spotlight

Advertisements