Open Access Article
Ajnabiul Hoque†
a,
Nupur Jain†a,
Divya Chennaa and
Raghavan B. Sunoj
*ab
aDepartment of Chemistry, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India. E-mail: sunoj@chem.iitb.ac.in
bCentre for Machine Intelligence and Data Science, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
First published on 7th March 2026
The increasing number of applications of machine learning (ML) in chemical catalysis has engendered considerable confidence in predicting reaction outcomes. Despite the successful applications of ML to high-throughput experimentation (HTE) datasets, extension to small real-world datasets prevalent in organic synthesis remained more difficult, primarily due to their imbalanced and sparse distribution. Herein, we present a new chemical reaction dataset curated from published literature that bears class imbalance (CI) with a skewness of −1.37. The reactions in focus belong to an important class of transition metal-catalysed asymmetric transformations of alkenes such as cyclopropanation, aziridination, and arylation. Such reactions are indispensable for the construction of three-membered structural motifs, a versatile building block found in complex bioactive molecules. In cognizance of the CI in the reaction outcome, measured in terms of enantiomeric excess (% ee), we employ the AttentiveFP-CI model to predict % ee. This class-imbalance aware graph-based model with an attention mechanism exhibits commendable performance, as evidenced by the root mean square error (RMSE) of 9.80 ± 1.40. Upon evaluation across various molecular representations of these reactions (OHE, fingerprints, SMILES, and graphs) and ML algorithms (DNN, T5Chem, Transformer, and MPNN), AttentiveFP-CI emerged as the best model distinguished by its minimal overfitting (train-test RMSE difference of 3.59, compared to up to 5.40 for other CI-aware models). When extended to other important reaction datasets such as N,S-acetylation, asymmetric hydrogenation of alkenes, and USPTO, improved predictions could be obtained by using AttentiveFP-CI. Furthermore, attention visualization identifies key atoms and substructures contributing to high enantioselectivity, offering valuable chemical insights for planning the synthesis of new molecular targets. Harnessing insights derived from ML models could serve as an efficient and cost-effective approach for expedited developments in asymmetric catalysis.
There have been a good number of previous efforts aimed at predicting reaction outcomes. The predictive capabilities of quantum chemically derived molecular descriptors have been exploited to build bespoke linear regression models for catalytic reactions.8,9 Molecular descriptors such as charge, NMR chemical shifts, vibrational frequencies and intensities, Sterimol parameters, buried volumes, etc., bearing electronic and steric features of the participating molecules, have served as useful inputs for modelling reactions.10,11 The use of such descriptors comes with their own challenges such as higher computational cost, particularly in the case of complex molecular systems, and the requirement of annotation/curation by domain experts.12,13 Given these challenges, an increasing interest in exploring alternative approaches became more prominent in the current literature. Modern ML algorithms, capable of handling complex and diverse reaction data, can offer promising solutions on this front.14,15 In practice, when one attempts reaction optimization by way of changing various controllable parameters (as described above), sparsely distributed reaction data become available. It would be of interest to try and see whether such data could possibly present potential opportunities for applying suitable ML algorithms for reaction modelling.16–18 The key advantage of an early ML intervention would be to help make an informed choice of substrates/catalysts/solvent during the reaction development phase.
The unprecedented growth in computational capabilities has rendered the applications of ML to chemical reactivity problems increasingly more feasible.19,20 Deployment of incredibly complex language models such as BERT for yield prediction became possible in very recent years.21,22 The use of hybrid graph neural networks on molecular graphs to derive features for selectivity predictions in the case of chiral phosphoric acid catalysed thiol addition to N-acyl imines is now available.23 These contemporary ML models for reaction outcome prediction have offered robust performance on High Throughput Experimental (HTE) datasets.24–26 Analyses have shown that HTE datasets, as used in many of the recent studies, exhibited reduced variability in data quality, high internal consistency and high fidelity.27,28 Another aspect of the HTE settings is that exhaustive permutations of reactants/reagents are affordable under uniform reaction conditions. However, in real-life reaction development only a few combinations between the reactants and the associated conditions can be practically explored. For example, the dataset sourced from the AstraZeneca electronic laboratory notebooks (ELNs) potentially encompassed approximately 470 M possible combinations of reactants. However, in practice, only 1000 reactions were experimentally examined by engaging 340 aryl halides, 260 amines, 24 ligands, 15 bases, and 15 solvents for the Buchwald–Hartwig reaction.29 In this context, we consider it highly timely to develop accurate ML models for small-sized reaction datasets with different distribution characteristics.
Recent years witnessed several successful applications of ML in predicting yields or enantioselectivities of various catalytic reactions such as Buchwald–Hartwig cross-coupling,24 Lewis base-catalysed propargylation,30 β-C–H activation,31 asymmetric hydrogenation,32 relay Heck,33 Negishi cross-coupling,34 and palladaelectro-catalyzed C–H annulation reactions.35 Needless to say, most of these studies are early examples of implementing deep learning (DL) methods and were confined to only a few reaction types, leaving out a large family of important asymmetric catalytic reactions. One of the key reasons for such exclusions from ML studies can be traced to the lack of good datasets. One such important catalytic asymmetric reaction that has not received attention is shown in Scheme 1a, which employs simple alkenes as the core substrate. Alkenes are abundant precursors that can participate in a wide array of reactions to provide valuable products. For example, under suitably chosen Cu/Pd catalytic conditions, alkenes can react with (a) diazoester to form cyclopropane, (b) aryl boronic acid to yield important 1,1-diaryl compounds, and (c) aliphatic or aromatic N-tosyloxycarbamates to access key structural motifs such as aziridines. This class of reaction holds promise as it can help synthesize stereochemically well-defined cyclopropanes and aziridines, which are key constituents in medicinal and agrochemical compounds.36–39 A few representative examples bearing these substructures are shown in Scheme 1b to convey the significance of these small ring containing molecules.40–43
Apart from the synthetic utility of this class of reactions, the use of one of the most widely found ligands, such as chiral bis(oxazoline), is to be taken cognizance of, as the applications of such chiral motifs go well beyond these reactions.44–46 The conformationally rigid framework of bis(oxazoline) metal chelates, bearing chiral centres close to the donor nitrogen atoms, can provide the desired chiral environment nearer to the catalytic site. The modular architecture of these ligands can help create desirable variations in both steric and electronic attributes, thus allowing fine-tuning of their catalytic activity for specific applications.47,48 It would therefore be of importance to identify the key regions in the chiral catalyst that impact the stereochemical outcome of such reactions, potentially using ML tools (vide infra).
Motivated by recent advancements in machine learning approaches for reaction outcome prediction,49–53 including contributions from our laboratory,31–33 we became interested in the catalytic enantioselective reactions of alkenes as shown in Scheme 1a.54,55 The availability of reliable predictive ML models can help identify optimal reactant triads comprising alkene, chiral ligand, and substrate that are likely to offer higher % ee. Such ML models might help reduce the typical timelines involved in reaction discovery. Given these motivations, we set the following major objectives in this work: (a) evaluation of the effectiveness of DL methods in enantioselectivity predictions in transition metal-catalysed asymmetric reactions of alkenes, (b) identification of an optimal featurisation strategy from among One-Hot Encoding (OHE), molecular fingerprints, SMILES, and graph representations, (c) addressing the issues associated with data imbalance consisting of more samples in the high % ee region by implementing cost-sensitive training loss, (d) identification of the better combinations of reactants (alkene, chiral ligand, and substrate) that are likely to offer superior reaction outcomes, and (e) examination of learning ability of DL models by using the attention mechanism to identify the critical regions in chiral ligands and substrates that can influence the reaction outcomes. Utilization of a trained DL model can streamline and help expedite the reaction discovery pipeline by identifying and eliminating low selectivity reactions in the initial screening, which would save time and effort.
![]() | ||
| Fig. 1 (a) Details of various substituents present in the individual reacting partner; (b) yield distribution in the ART dataset. | ||
The chemical space spanned by the ART dataset is very sparse compared to the combinatorial possibilities arising from the number of reactions between the compatible partners. For instance, in the cyclopropanation subset there are 130 examples (67 catalysts, 10 substrates, and 23 alkenes) while the aziridination reaction class comprises only 91 reactions (14 catalysts, 13 substrates, and 44 alkenes), and the arylation reaction class contains as many as 155 reactions (19 catalysts, 50 substrates, and 55 alkenes). Given the possible combinations between the reactants, the theoretically likely reactions for cyclopropanation, aziridination, and arylation are 15
410, 52
250, and 8008, respectively, totaling 75
668 possibilities. However, the ART dataset contains only 376 experimentally known reactions from among these combinations, indicating a highly sparse distribution. Furthermore, distribution of the % ee values is also skewed towards the high ee regions (Fig. 1b).67,68 The diversity of chemical structures, skewed distribution of reaction outcomes, and sparsity in the dataset can together make the ML model building rather challenging.
:
10
:
20 ratio for training purposes. We have conducted hyperparameter tuning on the validation set based on the criterion of achieving the lowest mean validation loss and then employ such optimal hyperparameters for prediction on the test set. To mitigate potential bias due to sample distribution, while creating the test-train splits (70
:
10
:
20), 30 independent runs with randomised splits were considered. The model performance is evaluated using root mean squared error (RMSE) on the test sets as the average RMSE over these runs.
Since the model focuses on individual atoms, each atom includes features from its neighbouring atoms and the bonds connecting them to form the respective initial state vectors h0i for each atom. These initial vectors are further embedded with an r number of stacked attentive layers, allowing atoms to aggregate relevant “messages” from their neighborhood. This step is expected to capture the nuances of atomic local environments by effectively propagating node information over various distances. For molecule-level embedding, the entire molecule is treated as a super-virtual node (V), connecting every atom, and embedded using the attention mechanism as shown in Fig. 2. This process, over t stacked layers, produces a state vector hvt for the whole molecule. In this mechanism, the first step is to concatenate the state vectors of the virtual node (hv) with all connected nodes (h1r), followed by a linear transformation (W) and nonlinear activation (leakyRelu) to produce avi. This avi is then normalised using a softmax function over the neighbour nodes, resulting in αvi that captures the importance (weight) of each neighbour node to the virtual node. These attention weights are then passed through the message and readout functions to obtain the final state vector hvt, which encodes structural information about the molecular graph. Finally, hvt is fed through the fully connected layer (FCL) for the regression task.
We have optimised the model using the Adam optimization algorithm, in conjunction with Bayesian optimization (BO) for model-specific parameters such as the number of graph layers for atom embedding, the number of time steps for molecule embedding (denoted respectively using r and t in Fig. 2), graph feature size, dropout rate, and optimizer parameters such as the learning rate. Tuning hyperparameters in the DNN can be challenging as they are used for model parameter estimation rather than the model directly assessing them. Hence, BO, as implemented in the Optuna Python package,71 is used to optimize model-specific parameters. The optimal sets of validation hyperparameters for all 30 runs are provided in Section 2.5 in the SI.
In addition to examining the effects of featurisation on the ART dataset, we aim to compare the performance of the AttentiveFP model with that of some of the state-of-the-art (SOTA) DL models commonly used in enantioselectivity predictions. We have conducted comprehensive evaluations of various DNN models employing both OHE and molecular FPs.75 Additionally, we explored transformer-based language models such as Transformer,76 ULMFiT,77 and T5Chem,78 which utilize reaction SMILES, and graph neural networks like MPNN79 and AttentiveFP, which leverage graphs for reactant featurisation. Each combination between a DL model and featurization is then evaluated on the basis of the corresponding training and test RMSEs (see Section 2 in the SI for more details).
The model performances compiled in Table 1 highlight the effect of different featurization techniques on the ART dataset. With OHE, the test RMSE of the DNN is found to be 14.43 ± 3.05 (the details of the DNN architecture are given in Section 2.2 in the SI). These OHE-based models serve as a statistical probe, offering an internal baseline performance for other models built using chemically meaningful descriptors. In contrast to OHE, models that utilize FPs offered improved performance.71,80 It is worth noting that although the fingerprint-based DNN model exhibited a relatively lower test RMSE (9.55 ± 1.31) compared to AttentiveFP, a larger gap with the training RMSE (5.54 ± 1.50) suggests overfitting; hence it should be considered with caution when applied to out-of-bag situations. The use of SMILES representations in conjunction with advanced DL architectures, including T5Chem, ULMFiT, and Transformer, yielded slightly higher test RMSEs of 10.83 ± 1.73, 11.30 ± 1.30, and 12.26 ± 2.02, respectively.81 The graph-based MPNN model showed a high test RMSE of 11.00 ± 2.00.82 Thus, despite not being the top performer in terms of the lowest test RMSE, the balanced performance of AttentiveFP suggests that it is a robust model with a lower susceptibility to overfitting compared to the other models considered here.83 In addition to the good performance, the graph attention mechanism inherent to the AttentiveFP model that allows for chemically meaningful interpretability (vide infra) has made AttentiveFP our primary framework for the ee prediction task.84
| Featurization | Model | Training | Test |
|---|---|---|---|
a The datasets are randomly divided into 70 : 10 : 20 training, validation, and test sets. |
|||
| OHE | DNN | 3.67 ± 1.94 | 14.43 ± 3.05 |
| Fingerprint | DNN | 5.54 ± 1.50 | 9.55 ± 1.31 |
| SMILES | T5Chem | 6.74 ± 0.39 | 10.83 ± 1.73 |
| ULMFiT | 10.94 ± 0.51 | 11.30 ± 1.30 | |
| Transformer | 5.28 ± 1.18 | 12.26 ± 2.02 | |
| Graph | AttentiveFP | 7.41 ± 1.77 | 10.56 ± 1.86 |
| MPNN | 8.01 ± 1.18 | 11.00 ± 2.22 | |
Herein, we propose a customized model for class imbalance, namely, AttentiveFP-CI. Unlike conventional Mean Squared Error (MSE) loss, our model incorporates a class imbalance loss, assigning different weights to training samples based on their actual ee values (Fig. 3). The idea is to reduce the influence of the majority class samples while prioritising the more challenging minority class instances during training.85 We have examined the effect of using different class boundaries, from 30 to 60, by placing the boundary at statistically important points of the dataset such as the mean (µ) value of 76 and µ − σ of 54. In most cases, the AttentiveFP-CI models performed better than the AttentiveFP model, as evident from the corresponding test RMSE (see Tables S61–S69 in the SI). The model with a class boundary of 30 achieved the best test RMSE of 9.80 ± 1.40 as compared to other class boundaries considered.86 Moreover, the t-test resulted in a p-value < 0.05, indicating that the gain in performance is statistically significant as compared to the model without the CI (test RMSE for AttentiveFP is 10.56 ± 1.86). Incorporation of the CI-aware loss into other DL models also improved the respective test RMSEs, except in the case of ULMFiT.87 A comparison between different deep learning architectures reveals that AttentiveFP-CI outperforms Transformer-CI and ULMFiT, with p-values < 0.05 endorsing their statistical significance.88 However, most of these models tend to exhibit overfitting issues, evident from the train-test RMSE differences as follows: MPNN-CI (4.59), T5Chem-CI (5.4), and DNN-CI (3.93) as opposed to AttentiveFP-CI (3.59).89 Additionally, the number of model parameters in AttentiveFP-CI is in the order of 1.93 M, which is much fewer than those in T5Chem-CI (14.71 M), assuring us of better computational scalability. Given the lower RMSE, reduced overfitting, and computational efficiency, AttentiveFP-CI stands out as the optimal choice from among all the SOTA models for % ee predictions on the ART dataset.90
Efforts were also expended to assess whether the performance issues could be traced to the sparse and imbalanced distribution in the ART dataset. We have compared the model performances on more balanced and denser datasets such as Buchwald–Hartwig Amination (BHA), which is a catalytic transformation of high practical utility. The high throughput experimental (HTE) dataset of the BHA reaction, denoted as BHA-HTE, is a commonly used dataset for baseline comparisons for yield prediction tasks.24 BHA-HTE comprises 3955 labeled reactions and their corresponding experimentally measured yields. The AttentiveFP model offered a good test performance, with an RMSE of 6.49 ± 0.33 and a coefficient of determination (R2) of 0.94 ± 0.01 (see Table S72 in the SI), surpassing the previously reported R2 of 0.92 obtained using physical-organic descriptors.29
In the present context, we have done random sampling of the full BHA-HTE dataset to create a few sparser subsets, denoted as BHA-LTE (low throughput), each containing about 500 reactions. The idea is to induce skewness to produce an imbalance in labels such that the distribution (µ and σ) in the BHA-LTE subsets resembles that of the ART dataset. These subsets are then employed for evaluating the baseline performance of various deep learning models considered in this study.91 In general, the BHA-LTE subsets have a µ of 75 and a σ of 14 (see Fig. S1 in the SI), similar to those in our ART dataset (µ = 76 and σ = 22). When the AttentiveFP model was trained using these subsets bearing an induced sparse distribution, the test RMSE dropped from 6.49 ± 0.33 with the original BHA-HTE dataset (see Table S72 in the SI) to 9.14 ± 0.80 (or higher, depending on the BHA-LTE subset used). The lower performance of the same AttentiveFP model can be attributed to the induced sparse distribution and CI in the BHA-LTE subsets. Interestingly, inclusion of the CI loss with a class boundary of 50 improved the test RMSE to 8.70 ± 0.52. Such test performances are analogous in quality of predictions by the same model on the ART dataset bearing comparable distribution characteristics (µ of 76 and a σ of 22). Additional details on the model performance with varying class boundaries, spanning 30 to 70, and with a µ of 75 are provided in Section 3 in the SI. An improvement in RMSE (9.14 ± 0.80 for AttentiveFP versus 8.70 ± 0.52 for AttentiveFP-CI) could possibly be due to the use of a customized loss function to mitigate CI issues.
A similar performance trend is conspicuous in our ART dataset as well. For instance, the test RMSE of the AttentiveFP-CI model (with a class boundary of 30) is found to be 9.80 over 10.56 obtained with the AttentiveFP model without the CI loss. On the basis of the model performance, with and without the inclusion of the CI loss, on the ART and BHT-LTE datasets, we could conclude that the data sparsity is primarily responsible for the higher RMSEs. These insights would be valuable in developing suitable deep learning models with customized loss functions for chemical reaction datasets bearing skewed distribution.
An alternative for imbalanced and sparsely distributed chemical datasets is to consider a two-step model,92 wherein a classification of samples is done first on the basis of a predefined class label. Subsequently, separate regressors are developed for the major and minor classes. This approach, termed classification-followed-by-regression (CFR), is likely of relevance to the ART dataset.93 In the first step, reactions are classified as ‘major’ or ‘minor’ using a statistically meaningful class boundary set at a (µ − σ) of 54 % ee.94 We found that a hyperparameter optimized custom built DNN classifier could achieve a very good accuracy of 0.98 ± 0.003.95 The reactions thus classified are employed in training two separate AttentiveFP regression models, one for the major class and the other for the minor class. The AttentiveFP regressor achieved test RMSEs of 10.74 ± 1.98 for the major class and 8.73 ± 0.90 for the minor class, outperforming our direct regression in the case of minor class reactions (i.e., reactions with less than 54 % ee as their true label).96 To ensure a balanced assessment of the overall model performance, we have also considered the use of weighted RMSE, which accounts for class imbalance by combining error contributions proportionate to the sample size.97 For the AttentiveFP model, the weighted RMSE is 8.97 ± 1.28, which is poorer than our AttentiveFP-CI with a class boundary of 30 (RMSE = 9.80 ± 1.40; R2 = 0.80 ± 0.05) (see Table S64 in the SI). Similarly, an ULMFiT regression model showed a test RMSE of 10.15 ± 2.26 and 8.48 ± 1.26 respectively for the major and minor classes with a corresponding weighted RMSE of 8.75 ± 1.05 (R2 = 0.40 ± 0.29).98 Since no significant improvement is found with CFR-major and CFR-minor classes, our original AttentiveFP-CI, with its interpretable characteristics, can be considered a more appropriate model for the ART dataset.
AttentiveFP-CI showed good performance in predicting % ee, achieving a test RMSE of 9.80 ± 1.40. Importantly, the difference between the training and validation RMSEs suggests a lower overfitting, which is good for model generalizability when predicting on unseen samples. In the 30 independent runs, the model predicts % ee thousands of times for the 76 reactions present in the test set. Furthermore, every reaction gets predicted multiple times whenever it appears in the test set. A comparison of the predicted % ee with the experimentally known ground truth values revealed good correlation as shown in Fig. 4. In fact, ∼87% of the predictions remain within 15 units of the actual values (Fig. 4a). In the optimal run with an RMSE of 8.2 % ee, as many as 70 out of 76 test samples are predicted well within an error limit of 15 units with respect to the corresponding true values (Fig. 4b). In a typical run (RMSE = 10.1), only 12 out of the 76 samples incurred prediction errors in excess of 15 units (Fig. 4c). The parity plot also conveys a good correlation between the % ee predicted by the AttentiveFP-CI model and the corresponding experimental values with an R2 of 0.84 (Fig. 4d).99 These assuring findings highlight the efficacy of AttentiveFP-CI in learning from the sparse ART dataset for catalytic asymmetric reactions of alkenes.
To evaluate the learning ability of an ML model and to examine its robustness, control experiments are required. For this purpose, the dependence of the model performance on the quality of the input data is assessed using techniques such as Y-scrambling. We created a straw model of AttentiveFP-CI, which intentionally breaks the potential connections between the input features and the output variable. Here, each sample is incorrectly mapped to an output value belonging to some other sample within the dataset. The considerably worse test RMSE of 25.2 ± 2.1 obtained with the Y-scrambling run shows that the model learns from the true features it was provided with in the correct training campaigns. The inferior performance also highlights the effectiveness of AttentiveFP-CI in learning the chemically meaningful aspects of the catalytic reaction investigated in this work (vide infra).
From perusal of the attention map shown in the inset of Fig. 5a and the bar plot in Fig. 5b for the reaction involving catalyst-1 (pyridine-oxazoline catalyst), it can be learned that the Cu centre and pyridine N with two nearest C atoms make positive contributions while all other atoms or substructures have relatively lower negative or negligible contributions to the % ee. In the case of catalyst-2, the side arm (SA) on the bridge carbon of the bis(oxazoline) ligands positively contributes to high % ee, consistent with the trends observed with this family of ligands.101 It is further evident that the other positive contributors to the reaction outcome are (i) the SA on the chiral ligand, (ii) the transition metal-bound triflate ligand, (iii) the styrenyl double bond, and (iv) one of the carbon atoms of the cyclohexyl diazo compound. These positive attention values are suggestive of their synergistic role in the enantioselectivity of the cyclopropanation reaction. One of the significances of this analysis is that installation of suitable substituents on the SA group could be key to achieving enhanced enantioselectivity. This prediction by the model is chemically meaningful and intuitively appealing, as it aligns with the fact that most variants of reported bis(oxazoline) ligands rely on modifications of the SA.101
After visualizing attention for two representative examples, we have analysed the global effects of the critical regions/atoms that likely exert a significant contribution toward the quality of the % ee prediction. To accomplish this, it is essential to identify a common region present in each reaction partner across all the samples. Fig. 5b highlights such shared regions in all the reaction components, along with their atom numbering. The steps involved in estimating the effect of each atom are as follows: first, attention values for each atom in the shared region are extracted using the corresponding SMART pattern. Second, the variance of these attention values is plotted, since the variance is crucial in assessing the feature importance as it captures the most significant and informative variations in the data.102 Interestingly, the bar diagram shown in Fig. 5b indicates a higher importance of the chiral ligand (atom numbers are given with L in parentheses to denote the chiral ligand) as compared to the reactants such as the alkene and other substrates. It is gratifying to see that our attention-based model deciphered chemically intuitive patterns present in the chiral ligand as the most relevant contributor to asymmetric induction. The variances in the attention values exhibited by the atoms in the chiral ligands are found to be much higher than those of the alkene and other substrates. This observation is in line with one's chemical intuition that chiral ligands play a pivotal role in transferring the chiral information to the developing product.103 Notably, the chiral carbon centre, denoted as [C*8(L)], in the ligand exhibits the highest variance in the attention, corroborating the domain knowledge that the substituents at this centre are largely known to influence enantioselectivity. Additionally, the carbon atom near the bridge or SA [C1(L)] shows the second-highest variance, suggesting that modifications to this atom, where branching from the bridge carbon begins, could help in fine-tuning the reaction outcome. The identification of these atoms as important ones indicates that AttentiveFP-CI accurately captures the relationship between the molecular factors and the desired outcome. Exploiting this protocol by fine-tuning these key features, particularly during reaction development, or while expanding the scope of this reaction family, could prove advantageous.
It is worth noting that although these ligands were previously reported, their use in copper-catalyzed arylation reactions, involving silylketenes and diaryliodonium salts, remains unexplored (Fig. 6a). Thus, the potential of these ligands for such alkene–substrate pairs is novel, even when the reaction conditions are retained.63–65 A heatmap representation of the predicted % ee for different chiral ligands shown in Fig. 6b conveys the significance of fast (virtual) screening of chiral ligands by using our regression model. It can be seen (top left) that a ligand with only one of the oxazoline rings chiral is predicted to exhibit a lower % ee of 65. Interestingly, the attention analysis identifies the groups on the side arm at the bridge carbon (SA) and the chiral carbon of the bis(oxazoline) ligand as the dominant contributors to % ee. In light of the attention as noted, we considered two representative variants of the bis(oxazoline) family of ligands such as the (S,S)-Ph-Box for further illustration as shown in Fig. 6b. One of these ligands is obtained by replacing the Ph group at the chiral carbon with 4-tBu-Ph, and the other is obtained by replacing the 1,1-dimethyl on the side arm with a 1,1-diisopropyl group. Both of these ligands are predicted to show high % ee. More importantly, a higher attention value noted for 1,1-diisopropyl and in the Ph regions (green color contours) indicates their positive contribution to enantioselectivity. These can be considered as indicative of how an attention-based approach could be utilized in catalyst design for asymmetric reactions.
:
10
:
10 training, validation, and test split as used before besides a µ-based class boundary of 0.98. Our model achieved a test R2 of 0.90 ± 0.02 and an RMSE of 0.21 ± 0.02.111 These results are comparable to those of the SEMG-MIGNN (R2 = 0.915; RMSE = 0.197)23 and ChemAH (R2 = 0.918; RMSE = 0.209).112 In the regression setting with 1027 reactions, AttentiveFP-CI achieves a good predictive performance of 8.06 ± 1.00 (R2 of 0.92 ± 0.10) on the 1027 N,S-acetylation reactions, which is found to be a statistically significant improvement over the corresponding AttentiveFP model devoid of CI loss.113a In the case of asymmetric hydrogenation, although the performance of AttentiveFP-CI as indicated by an RMSE of 10.48 ± 1.10 (R2 of 0.60 ± 0.17) is good, the gain as compared to the AttentiveFP model is not statistically significant.113b Notably, for the N,S-acetylation reaction, AttentiveFP-CI significantly outperformed an often used baseline such as the ULMFiT model (p = 0.0036), with a test RMSE of 8.88 ± 1.03.114 Although in the hydrogenation case, baseline ULMFiT provides better predictive accuracy (test RMSE of 8.56 ± 1.46), it lacks the advantage of interpretability as afforded by AttentiveFP-CI.
In addition to the enantioselectivity prediction on three important chemical reaction datasets, the utility of the AttentiveFP-CI framework is also evaluated for yield prediction tasks on the USPTO (grams) reaction dataset.21 This dataset comprises 1.9 × 105 reactions, each annotated with the corresponding yield values. The yield distribution exhibits a skewness of −0.86, indicating the presence of CI in the USPTO dataset. Furthermore, the yield values in this dataset are reported as scaled values to fit in the interval of 0 to 1. In light of this skewed distribution, we used the AttentiveFP-CI model with a statistically relevant class boundary of µ + σ (0.94) to note a test RMSE of 0.20 ± 0.01 and a marginally better R2 of 0.08 ± 0.00, as compared to AttentiveFP without the CI consideration (test RMSE = 0.21 ± 0.01; R2 = 0.04 ± 0.01).115 However, the t-test resulted in a p-value > 0.05, indicating that the numerical improvement is not statistically significant in the case of the USPTO dataset. Notably, the performance of AttentiveFP-CI is even comparable to the previously reported RMSE of 0.195 obtained using a more complex transformer-based model on the same dataset.116 Overall, these results indicate that AttentiveFP-CI could be useful in addressing CI issues in chemical datasets, even if its performance does not always surpass state-of-the-art benchmark performances.
Similar to the approach employed earlier in this manuscript for global attention analysis, we have visualized the attention weights by using the computed variance in the attention values of atoms within the shared region as shown in Fig. 7, for both asymmetric N,S-acetylation and asymmetric hydrogenation reactions.100,117–119 The variance in the attention values of the shared region atoms of the catalysts and substrates across different samples (i.e., reactions in the dataset) can be considered a measure of sensitivity of the reaction outcome to the environment of such atoms. Therefore, such analysis might help decipher how the changes in the local substituents are likely to influence the enantioselectivity. A relatively larger change in variance implies a higher attention on such atoms, which might stem from the changes in their substituents or local environment. A modest change indicates that the atom concerned consistently gets similar attention weights. Interestingly, in the context of the reactions in our ART dataset, we notice that reactive positions on the alkene and the vital regions around the chiral centre exhibit high variances in their attention. These variances both in the ART reactions and in the case of asymmetric hydrogenation are in line with our chemical intuition, where alkene is the key substrate undergoing the reaction. Since this analysis collectively reveals mechanistically valuable insights consistent with the domain knowledge on the origin of enantioselectivity catalyzed by axially chiral ligands,101,120,121 we consider that the AttentiveFP-CI model is meaningfully interpretable.
In the asymmetric hydrogenation reaction, the atoms belonging to the imine or alkene (e.g., C/N1(S)) exhibited consistently higher variance in attention, conveying that they might play a key role in enantioselectivity. While the atoms in the chiral ligand showed relatively lower variance compared to those of the substrate, the P/O center in BINOL- and BINAP-derived ligands (e.g., O/P2(L) or O/P22(L)) has maximal attention variance. This is an interesting aspect, which is in line with the chemical intuition that these positions in the chiral BINOL/BINAP frameworks are expected to influence effectiveness of enantioinduction. Thus, suitable substitution at the ortho positions in the biphenyl ring can potentially impact enantioselectivity values. For the N,S-acetylation reaction, high attention variance is found across all three components: the thiol (C3(S)), imine (C1(A)), and ligand (C5(L)), reflecting the impact of local substituent changes on enantioselectivity. Notably, within the ligand, attention values corresponding to the atoms C5(L) and C6(L), which bridge the biphenyl units, showed substantial variance, linked to the dihedral angle fluctuations (e.g., C4–C5–C6–C7) that can modulate the chiral environment. Additionally, variance in attention scores suggests that the meta-positions on the biphenyl rings (C19(L) and C10(L)) are important where a change in substituents is likely to influence the reaction outcome. Thus, the attention variance analysis suggests that introducing suitable substituents at the high-variance sites (e.g., P/O centers and dihedral-sensitive atoms) might shift the enantioselectivity. These findings offer a firm basis for the interpretability of AttentiveFP-CI by way of identifying hotspots for enantioselectivity tuning. The results could become useful for rational catalyst design and for making an informed choice of substrates during reaction scope investigations in asymmetric catalysis.
Visualization of the atomic attention weights could identify the pivotal regions in the reaction partners, such as the chiral centre, as a high attention spot in the chiral catalyst. Similarly, critical atoms/substructures in the reactant(s) are identified as an important contributor to high enantioselectivity. Thus, AttentiveFP-CI not only serves as a good predictive model, but it also offers chemically meaningful insights for reaction optimization. This method can therefore pave the way for informing ligand design and reaction development, as exemplified by the identification of the (S,S)-PhBox ligand variant, featuring 1,1-diisopropyl groups on the side arm as a potentially effective catalyst relevant to the synthesis of (S)-naproxen. When extended to an important enantioselective reaction, such as the axially chiral phosphoric acid (CPA) catalyzed N,S-acetylation, AttentiveFP-CI offered a very good RMSE of 8.06 ± 1.00. The interpretability of our model sheds light on the factors governing enantioselectivity in the form of identifying the reactive olefinic sites in imines and alkenes in asymmetric hydrogenation reactions as the key contributors and the binaphthyl axis of the axially chiral ligands in the case of asymmetric acetylation reactions. Overall, the AttentiveFP-CI model not only serves as a robust predictive framework but also as a chemically interpretable tool that complements intuition. Interpretable models can therefore be exploited in data-driven discovery of chiral ligands and substrates in asymmetric catalysis.
Supplementary information (SI): details of the machine learning setups with their hyperparamter tuning, chemical reaction datasets, various control experiments, and other relevant information are provided. See DOI: https://doi.org/10.1039/d5dd00483g.
:
10
:
20 training, validation, and test sets. The hyperparameter tuning for the ULMFiT model is performed on the validation set using the Optuna framework (ref. 71), and the resulting optimal hyperparameters are applied to the model for predictions on the test set. The model performance is reported in terms of RMSE and R2 obtained as the average over 30 different runs each using a randomly created train-test split.
:
10
:
20 ratio and trained on five such independent random splits.Footnote |
| † AH and NJ contributed equally. |
| This journal is © The Royal Society of Chemistry 2026 |