Machine Learning Models for Catalytic Asymmetric Reactions of Simple Alkenes: From Enantioselectivity Predictions to Chemical Insights
Abstract
Increasing number of applications of machine learning (ML) in chemical catalysis has engendered considerable confidence in predicting reaction outcomes. Despite the successful applications of ML to high-throughput experimentation (HTE) datasets, extension to small real-world datasets prevalent in organic synthesis remained more difficult, primarily due to their imbalanced and sparse distribution. Herein, we present a new chemical reaction dataset curated from published literature that bears class imbalance (CI) with a skewness of −1.37. The reactions in focus belong to an important class of transition metal-catalysed asymmetric transformation of alkens such as cyclopropanation, aziridination, and arylation. Such reactions are indispensable for the construction of three-membered structural motifs, a versatile building block found in complex bioactive molecules. In cognizance of the CI in the reaction outcome, measured in terms of enantiomeric excess (%ee), we employ the AttentiveFP-CI model to predict %ee. This class-imbalance aware graph-based model with an attention mechanism exhibits commendable performance, as evidenced by the root mean square error (RMSE) of 9.80±1.40. Evaluation across various molecular representations of these reactions (OHE, fingerprints, SMILES, graphs) and ML algorithms (DNN, T5Chem, Transformer, MPNN), the AttentiveFP-CI emerged as the best model distinguished by its minimal overfitting (train-test RMSE difference of 3.59, compared to up to 5.40 for other CI-aware models). When extended to other important reaction datasets such as the N,S-acetylation, asymmetric hydrogenation of alkenes, and the USPTO, the improved predictions could be obtained by using the AttentiveFP-CI. Furthermore, attention visualization identifies key atoms and substructures contributing to high enantioselectivity, offering valuable chemical insights for planning the synthesis of new molecular targets. Harnessing insights derived from ML model could serve an efficient and cost-effective approach for expedited developments in asymmetric catalysis.
Please wait while we load your content...