Dissecting errors in machine learning for retrosynthesis: a granular metric framework and a transformer-based model for more informative predictions

Arihanth Srikar Tadanki; H. Surya Prakash Rao; U. Deva Priyakumar

doi:10.1039/D4DD00263F

Dissecting errors in machine learning for retrosynthesis: a granular metric framework and a transformer-based model for more informative predictions

Arihanth Srikar Tadanki,

^a H. Surya Prakash Rao

^b and U. Deva Priyakumar

*^a

Author affiliations

* Corresponding authors

^a Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad 500 032, India
E-mail: deva@iiit.ac.in

^b Teadus Pharma Pvt. Ltd, Hyderabad, India

Abstract

Chemical reaction prediction, encompassing forward synthesis and retrosynthesis, stands as a fundamental challenge in organic synthesis. A widely adopted computational approach frames synthesis prediction as a sequence-to-sequence translation task, using the commonly used SMILES representation for molecules. The current evaluation of machine learning methods for retrosynthesis assumes perfect training data, overlooking imperfections in reaction equations in popular datasets, such as missing reactants, products, other physical and practical constraints such as temperature and cost, primarily due to a focus on the target molecule. This limitation leads to an incomplete representation of viable synthetic routes, especially when multiple sets of reactants can yield a given desired product. In response to these shortcomings, this study examines the prevailing evaluation methods and introduces comprehensive metrics designed to address imperfections in the dataset. Our novel metrics not only assess absolute accuracy by comparing predicted outputs with ground truth but also introduce a nuanced evaluation approach. We provide scores for partial correctness and compute adjusted accuracy through graph matching, acknowledging the inherent complexities of retrosynthetic pathways. Additionally, we explore the impact of small molecular augmentations while preserving chemical properties and employ similarity matching to enhance the assessment of prediction quality. We introduce SynFormer, a sequence-to-sequence model tailored for SMILES representation. It incorporates architectural enhancements to the original transformer, effectively tackling the challenges of chemical reaction prediction. SynFormer achieves a Top-1 accuracy of 53.2% on the USPTO-50k dataset, matching the performance of widely accepted models like Chemformer, but with greater efficiency by eliminating the need for pre-training.

Article information

https://doi.org/10.1039/D4DD00263F

Article type

Paper

Submitted

15 Aug 2024

Accepted

07 Feb 2025

First published

18 Feb 2025

This article is Open Access

Download Citation

Digital Discovery, 2025,4, 831-845

Permissions

Request permissions

Dissecting errors in machine learning for retrosynthesis: a granular metric framework and a transformer-based model for more informative predictions

A. S. Tadanki, H. Surya Prakash Rao and U. D. Priyakumar, Digital Discovery, 2025, 4, 831 DOI: 10.1039/D4DD00263F

This article is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported Licence. You can use material from this article in other publications, without requesting further permission from the RSC, provided that the correct acknowledgement is given and it is not used for commercial purposes.

To request permission to reproduce material from this article in a commercial publication, please go to the Copyright Clearance Center request page.

If you are an author contributing to an RSC publication, you do not need to request permission provided correct acknowledgement is given.

If you are the author of this article, you do not need to request permission to reproduce figures and diagrams provided correct acknowledgement is given. If you want to reproduce the whole article in a third-party commercial publication (excluding your thesis/dissertation for which permission is not required) please go to the Copyright Clearance Center request page.

Digital Discovery

Dissecting errors in machine learning for retrosynthesis: a granular metric framework and a transformer-based model for more informative predictions

Abstract

Article information

Download Citation

Permissions

Dissecting errors in machine learning for retrosynthesis: a granular metric framework and a transformer-based model for more informative predictions

Social activity

Search articles by author

Spotlight

Advertisements