Data augmentation in a triple transformer loop retrosynthesis model
Abstract
Reactions in the US Patent Office (USPTO) are biased towards a few over-represented reaction types, which potentially limits their usefulness for computer-assisted synthesis planning (CASP). Herein we propose a data augmentation approach to generate a balanced dataset of fictive reactions. First, we apply retrosynthesis templates to template-matched USPTO molecules used as products (P) to obtain starting materials (SM). We then use transformer T2 from our recently reported triple transformer loop (TTL) retrosynthesis model to predict reagents (R). Finally, we validate the resulting fictive reaction by requesting high confidence, correct prediction by transformer T3*, trained to predict P from R and SM* with tagged reacting atoms. We generate up to 5,000 reactions per template, resulting in a template-equilibrated dataset of 27.5 million fictive reactions covering the chemical space of the original UPSTO dataset. We demonstrate that a TTL trained on these fictive reactions outperforms a TTL trained on USPTO reactions only.
Please wait while we load your content...