Data augmentation in a triple transformer loop retrosynthesis model

Abstract

Reactions in the US Patent Office (USPTO) are biased towards a few over-represented reaction types, which potentially limits their usefulness for computer-assisted synthesis planning (CASP). Herein we propose a data augmentation approach to generate a balanced dataset of fictive reactions. First, we apply retrosynthesis templates to template-matched USPTO molecules used as products (P) to obtain starting materials (SM). We then use transformer T2 from our recently reported triple transformer loop (TTL) retrosynthesis model to predict reagents (R). Finally, we validate the resulting fictive reaction by requesting high confidence, correct prediction by transformer T3*, trained to predict P from R and SM* with tagged reacting atoms. We generate up to 5,000 reactions per template, resulting in a template-equilibrated dataset of 27.5 million fictive reactions covering the chemical space of the original UPSTO dataset. We demonstrate that a TTL trained on these fictive reactions outperforms a TTL trained on USPTO reactions only.

Article information

Article type
Paper
Submitted
16 Oct 2025
Accepted
21 Jan 2026
First published
21 Jan 2026
This article is Open Access
Creative Commons BY license

Digital Discovery, 2025, Accepted Manuscript

Data augmentation in a triple transformer loop retrosynthesis model

Y. Grandjean, D. Kreutter and J. Reymond, Digital Discovery, 2025, Accepted Manuscript , DOI: 10.1039/D5DD00465A

This article is licensed under a Creative Commons Attribution 3.0 Unported Licence. You can use material from this article in other publications without requesting further permissions from the RSC, provided that the correct acknowledgement is given.

Read more about how to correctly acknowledge RSC content.

Social activity

Spotlight

Advertisements