Data undersampling models for the efficient rule-based retrosynthetic planning

Min Sik Park; Dongseon Lee; Youngchun Kwon; Eunji Kim; Youn-Suk Choi

doi:10.1039/D1CP03630K

Data undersampling models for the efficient rule-based retrosynthetic planning†

Min Sik Park,

*^a Dongseon Lee,^a Youngchun Kwon,^ab Eunji Kim^ac and Youn-Suk Choi^a

Author affiliations

* Corresponding authors

^a Autonomous Material Development Lab, Samsung Advanced Institute of Technology, Samsung Electronics, 130 Samsung-ro, Suwon, Gyeonggi-do 16678, Republic of Korea
E-mail: ms91.park@samsung.com

^b Department of Computer Science and Engineering, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul, Republic of Korea

^c School of Business Administration, Chung-Ang University, Seoul, Dongjak-gu 06974, Republic of Korea

Abstract

Computer-aided retrosynthetic planning for organic molecules, which is based on a large synthetic database, is a significant part of the recent development of autonomous robotic chemists. As in other AI fields, however, the class imbalance problem in the dataset affects the prediction performance of retrosynthetic paths. Here, we demonstrate that applying undersampling models to the imbalanced reaction dataset can improve the prediction of retrosynthetic templates for target molecules. We report improvements in the top-1 and top-10 prediction accuracies by 13.8% (13.1, 5.4%) and 8.8% (6.9, 2.4%) for undersampling based on the similarity (random, dissimilarity) clustering of molecular structures of products, respectively. These results demonstrate the importance of deep understanding of the statistical distribution, internal structure, and sampling for the training dataset. For practical applications, the target-oriented undersampling model is proposed and confirmed by the improved prediction performance of 9.3 and 4.2% for the top-1 and top-10 accuracies, respectively.

Supplementary files

Article information

DOI: https://doi.org/10.1039/D1CP03630K
Article type: Paper
Submitted: 07 Aug 2021
Accepted: 08 Nov 2021
First published: 08 Nov 2021

Download Citation

Phys. Chem. Chem. Phys., 2021,23, 26510-26518

Permissions

Request permissions

Data undersampling models for the efficient rule-based retrosynthetic planning

M. S. Park, D. Lee, Y. Kwon, E. Kim and Y. Choi, Phys. Chem. Chem. Phys., 2021, 23, 26510 DOI: 10.1039/D1CP03630K

To request permission to reproduce material from this article, please go to the Copyright Clearance Center request page.

If you are an author contributing to an RSC publication, you do not need to request permission provided correct acknowledgement is given.

If you are the author of this article, you do not need to request permission to reproduce figures and diagrams provided correct acknowledgement is given. If you want to reproduce the whole article in a third-party publication (excluding your thesis/dissertation for which permission is not required) please go to the Copyright Clearance Center request page.

Physical Chemistry Chemical Physics

Data undersampling models for the efficient rule-based retrosynthetic planning†

Abstract

Supplementary files

Article information

Download Citation

Permissions

Data undersampling models for the efficient rule-based retrosynthetic planning

Social activity

Search articles by author

Spotlight

Advertisements