Open Access Article
Chloe Wilson†
*a,
María Calvo
a,
Stamatia Zavitsanou
a,
James D. Somper
a,
Ewa Wieczorek
a,
Tom Watts
a,
Jason Crain
b and
Fernanda Duarte
*a
aPhysical and Theoretical Chemistry Laboratory, University of Oxford, 12 Mansfield Road, Oxford OX1 3TA, UK. E-mail: fernanda.duartegonzalez@chem.ox.ac.uk; chloe12345wilson@hotmail.com
bIBM Research, The Hartree Centre STFC Laboratory, Sci-Tech Daresbury, Warrington WA4 4AD, UK
First published on 26th December 2025
The accurate prediction of reaction rates is an integral step in elucidating reaction mechanisms and designing synthetic pathways. Traditionally, kinetic parameters have been derived from activation energies obtained from quantum mechanical (QM) methods and, more recently, machine learning (ML) approaches. Among ML methods, Bidirectional Encoder Representations from Transformers (BERT), a type of transformer-based model, is the state-of-the-art method for both reaction classification and yield prediction. Despite its success, it has yet to be applied to kinetic prediction. In this work, we developed a BERT model to predict experimental log
k values of bimolecular nucleophilic substitution (SN2) reactions and compared its performance to the top-performing Random Forest (RF) literature model in terms of accuracy, training time, and interpretability. Both BERT and RF models exhibit near-experimental accuracy (RMSE ≈ 1.1
log
k) on similarity-split test data. Interpretation of the predictions from both models reveals that they successfully identify key reaction centres and reproduce known electronic and steric trends. This analysis also highlights the distinct limitations of each; RF outperformed BERT in identifying aromatic allylic effects, while BERT showed stronger extrapolation capabilities.
![]() | (1) |
While quantum mechanical (QM) methods, such as Density Functional Theory (DFT), are commonly used to estimate ΔG‡, they often fail to provide the required chemical accuracy of 1 kcal mol−1, which roughly corresponds to a change in k of one order of magnitude.1–3 This failure has been associated with the use of low-level electronic structure methods,4,5 inaccurate description of entropic contributions,4–6 and poor description of solvent effects by implicit solvent models.4,5,7 Reactive Force Fields, such as ReaxFF8 and the empirical valence bond (EVB)9 method, can, in principle, address the challenge of describing reactivity in explicit solvent; however, their parameterisation remains time-consuming.
In recent years, machine learning (ML) has emerged as a promising alternative for efficiently computing reaction kinetics. This includes the use of machine-learned interatomic potentials (MLIPs) that reduce the cost of modelling solvent explicitly,10 as well as ML models that predict QM-computed activation barriers or experimental log
k values.1–3 Given the limited availability of experimental kinetic data, DFT has often been used for training these models despite its inherent limitations. Prominent QM-based ML models developed for activation energy predictions include the work of Green et al.,11 who developed a graph-based deep learning model (directed message passing neural network: D-MPNN) to predict gas-phase activation energies for various reaction types. Grayson et al. employed transfer learning (TL) to adapt a pre-trained NN initially trained on Diels–Alder reactions to predict barriers for other pericyclic reactions, thus reducing the need for extensive datasets.12 Recently, Li et al. systematically explored the use of TL, delta learning (aligning low-level QM data with CCSD(T)-F12a targets), and feature engineering (incorporating computed molecular properties) to improve activation energy predictions using the D-MPNN model, finding delta learning to be the most effective approach.13
Models trained on experimental log
k values have been pioneered by Madzhidov et al.14 However, due to the scarcity of experimental data, they have been limited to a handful of reaction types, including SN2,14–17 E2,14,18 and cycloadditions.14,19–22 To predict the reaction rates for these types, the authors developed Random Forest (RF) models that use in silico Design and Data Analysis (ISIDA) fragments,23 along with information about reaction conditions, including the solvent dielectric constant and temperature. The models achieved an RMSE ≤ 1.0
log
k on validation data, with the SN2 model further evaluated on an external test set.14 For cycloaddition reactions, they demonstrated that conjugated quantitative structure–property relationships (conjugated QSPR), which embed the Arrhenius equation into the ML architecture (in this case, a Ridge Regressor and a Neural Network), accurately predicted experimental values of log
k, pre-exponential factor log
A, and activation energy (Ea). On the validation data, R2 values of 0.75, 0.57, and 0.90 for log
k, log
A, and Ea, respectively, were achieved (RMSE not provided).22
In addition to reaching high accuracy, interpretability in ML models has become increasingly important.24 Interpretability can help identify sources of prediction error,14,25 identify influential features,12,26,27 and verify whether predictions are chemically meaningful.11,22,28–30 For example, in kinetics predictions, Green et al.11 demonstrated how learned reaction representations from their D-MPNN model clustered in terms of reaction type and reactivity. Similarly, von Lilienfeld et al.28 interpreted their Reactant-To-Barrier (R2B) model by plotting the difference between the predicted E2 and SN2 barriers based on LG, nucleophile, and R groups, demonstrating its predictions aligned with heuristic reactivity rules. Furthermore, Persson et al.30 developed an equivariant graph neural network (GNN) that uses frontier molecular orbital coefficients of reactants and products as node features to predict QM activation barriers of SN2 reactions, as well as molecular orbital coefficients of the transition state, allowing for chemically intuitive interpretations. Madzhidov et al.14 also analysed the importance of solvent descriptors in predicting reaction rates and showed that their conjugated QSPR model successfully replicated the Arrhenius relationship between log
k and temperature.22 Here, we interpret Madzhidov's RF in the context of known reactivity rules and compare its performance to a Bidirectional Encoder Representations from Transformers (BERT) model.
Transformer-based models, particularly BERT, have gained popularity in chemistry as an alternative to shallow ML models, treating chemistry as a language task. These models have been applied to a range of (bio)chemical applications, including molecular discovery,31,32 reaction classification,33 and yield prediction.34 We refer the reader to relevant reviews illustrating the use and extension of transformer models for chemical applications.35–37 In kinetic prediction, learned reaction representations from a pretrained BERT model have been used as a descriptor for predicting activation free energies of SNAr reactions using Gaussian Process Regression (GPR), achieving an RMSE of 1.4 ± 0.2 kcal mol−1 (1.0
log
k) on validation data.26,38 However, to our knowledge, no transformer-based models have been trained directly for kinetic prediction.
Here, we train a BERT model to predict rates for SN2 reactions and compare its performance against the RF model originally reported by Madzhidov et al.14 To evaluate the ability of the models to learn the underlying reactivity rules, we conducted a feature importance analysis using Kuz'min prediction contributions39 for RF and Integrated Gradients (IGs)40,41 for BERT (Fig. 1). Our results show that both models achieve near-experimental accuracy on similarity-split test data (RMSE ≈ 1.1
log
k) and identify key reaction centres, as well as known electronic and steric effects. However, limitations were also identified: RF struggled with log
k extrapolation, while the BERT model had difficulty recognising aromatic effects.
![]() | ||
Fig. 1 Overview of the reaction under study and feature representation. (a) Pictorial representation of an SN2 reaction, highlighting the nucleophile (red), leaving group (LG, orange), electrophilic C (purple), and substituting R groups (grey). (b) In the RF model, features are represented using ISIDA fragments, reciprocal temperature, and solvent properties. In the BERT model, features are encoded from SMILES strings. Although not shown, ionic strength and mole fractions of each solvent component were appended to the SMILES, as shown here for reciprocal temperature. (c) The influence of a given feature on the predicted log k is computed using Kuz'min prediction contributions56 for RF and IGs54,55 for BERT; Q, K, V, and df denote queries, keys, values, and feature dimension respectively, used to calculate self-attention scores in the BERT model, E denotes BERT token embeddings, and H[CLS] denotes the hidden representation of the [CLS] token prepended to the SMILES input. | ||
k values. After removing unbalanced reactions and duplicates, we reduced the dataset to 4666 entries. We then added 196 new SN2 reactions with experimental log
k values, bringing the total to 4862 reactions (Fig. 2). These additional reactions included phosphine nucleophiles (36 reactions), azide leaving groups (4 reactions), and electrolyte solutions (16 reactions), thus increasing chemical diversity. Reinforcing this idea, 83% of the new reactions had a Tanimoto similarity (ST) < 0.4 to the initial 4666 reactions (Fig. S2b). The range of log
k also expanded from 1.6 to −7.7 (ΔG‡ = 16.1–29.5) to 1.6 to −12.3 (ΔG‡ = 16.1–36.1). Throughout this work, log
kexp refers to the experimental log
k and log
kpred refers to the values predicted by the RF and BERT models.
![]() | ||
| Fig. 2 TMAP of the total training set of 4862 SN2 reactions. 4666 of these were compiled by ref. 26 (shown in grey), and 196 were added in the current work to increase the chemical diversity of the training data (shown in black). | ||
Despite diversifying the training data, the model's Root Mean Square Error (RMSE) on the test data from ref. 14 (referred to here as Test 1 ≡ 73 reactions), remained at 1.0
log
k. However, for out-of-domain reactions (Test 2 ≡ 56 reactions, including phosphine nucleophiles (4 reactions), azide leaving groups (5 reactions), and electrolyte solutions (12 reactions), see Methods), the test RMSE improved from 2.0 ± 0.0
log
k (the baseline RMSE predicting the mean log
kexp of the training data) to 1.4 ± 0.2
log
k (Fig. 3a). The greatest contribution to this improved RMSE came from the electrolyte-containing reactions, with a complete breakdown provided in Fig. S3. Consequently, this revised RF model was employed in this study. To ensure generalisability, reactions with ST > 0.4 to the diversified data set were excluded from all test sets (Fig. S2a).
![]() | ||
| Fig. 3 Evaluation of prediction accuracy and interpretability in RF and BERT models. (a) Learning curve showing the change in RMSE of the RF model from ref. 14 upon increasing the chemical diversity of the training data (evaluated using 56 out-of-domain reactions). (b) RMSE comparison between the RF and BERT models (evaluated using 129 external test reactions) and 30 DFT calculations carried out at the CPCM(solvent)CCSD(T)/def2-TZVP//PBE0-D3BJ/def2-SVP level of theory. (c) Percentage of accurate predictions where the nucleophilic (Nu), leaving group (LG) and electrophilic carbon (C) atoms were high impact features in the RF and BERT models. (d) Percentage of accurate predictions where temperature and solvent were high impact features in the RF and BERT models. A detailed breakdown of solvent property impact in RF is provided in Fig. S9a. | ||
k range of −8.2–1.2 (see Methods). Importantly, all test data had ST < 0.4 to the training data, so prediction accuracy reflects model performance on novel reactions (Fig. S2a).
Both models showed comparable accuracy (RMSE/log
k: 1.2 ± 0.1 for RF and 1.1 ± 0.1 for BERT) on the combined test data (129 reactions, Fig. S1b, with learning curves in Fig. S4a). However, the RF model significantly outperformed BERT in training speed, taking 256 seconds compared to BERT's 52.9 hours on CPUs. Although BERT's training time could be accelerated on GPUs, which are better suited for deep learning tasks (see Methods for details).
We also compared both models to a dummy model that always predicted the mean log
kexp of the training data, which resulted in an RMSE of 2.0 ± 0.0
log
k. Additionally, we benchmarked both models against log
k values calculated using DFT at the CPCM(solvent)CCSD(T)/def2-TZVP//PBE0-D3BJ/def2-SVP level of theory. The DFT predictions yielded an RMSE of 2.5 kcal mol−1 ≡ 1.9
log
k (Fig. 3b) and required 6.8 hours (using 4 CPU cores and allocated up to 4 GB each) for 30 reactant complexes and TS geometry optimisation and frequency calculations, contrasting with the prediction time of less than 1 second for both RF and BERT models.
In our analysis of test reactions, we categorised predictions into accurate (upper quartile ≡ 32 reactions) and inaccurate (lower quartile ≡ 32 reactions, Fig. 4a and S5). Of the accurate predictions made by RF, 44% were also accurately predicted by BERT. Conversely, 56% of RF's inaccurate predictions overlap with those from BERT.
This analysis highlights that while both models achieved similar overall accuracy, they differed in the specific reactions they accurately or inaccurately predicted, suggesting they have learned different underlying relationships. RF offers a more practical solution for rapid deployment and retraining, while BERT may be better suited for large datasets, where its richer representations and interpretability tools can be fully leveraged. Further improvements in predictive performance are likely to depend more on data quality than on the choice of the model architectures.
kexp of the training data for each test reaction, where the feature importance is thus set to zero. Features considered high impact were defined as those falling within the upper quartile of importance in the test data (Fig. 4b).
Both the RF and BERT models agreed on the importance of reaction centres and conditions. For example, in accurate predictions, the LG atom emerged as a high impact feature in over 90% of cases, while the Nu atom was significant in 75% of accurate predictions for both RF and BERT (Fig. 3c); however, BERT occasionally underestimated the importance of the Nu atom for inaccurate predictions (SI 5.3). In contrast, the electrophilic C atom was consistently identified as high impact in all of RF's accurate predictions, but only in 38% of those made by BERT. This discrepancy arises from the differences in how features are represented. RF considers the electrophilic carbon as part of a larger molecular fragment that includes its surrounding environment, allowing it to directly capture steric effects. In contrast, BERT represents the electrophilic carbon as a single token representation, which may overlook these environmental influences.
Temperature also emerged as a high-impact feature in 97% of accurate predictions for both models, demonstrating their ability to recognise key physical features (Fig. 3d). In the RF model, the solvent is represented by 13 distinct properties,17 with each property being high impact in 90–100% of accurate predictions (full breakdown provided in Fig. S9a), in line with the original analysis using RF in ref. 14 Conversely, in BERT, where solvent is represented by SMILES strings, its importance as a high impact feature was found in 72% of accurate predictions; however, its importance was not observed for inaccurate predictions (SI 5.3). These results show that both models identify chemically meaningful features as relevant for the prediction task, particularly RF.
To further assess whether RF and BERT effectively learned key structural and physical effects, we evaluate the feature importance of high impact features, including LG, steric, allylic, temperature, and solvent effects, on either increasing (positive sign) or decreasing (negative sign) log
kpred (Fig. 5).
Both models show a positive correlation between halide size and reactivity (rates: Cl < Br < I). In the RF model, where LG atoms were represented by C–I, C–Br, C–Cl and C–F fragments, C–I increased log
kpred across all examples. The presence of Br showed mixed effects on log
kpred, while Cl had no significant effect in two reactions and decreased log
kpred in one reaction. In the BERT model, where LG was represented by I, Br, Cl and F tokens, I and Br increased log
kpred in all cases where they were high impact (12 and 5 reactions, respectively), while Cl decreased log
kpred in the two examples involving this LG. These results show that both models recognise the importance of LG size in determining reactivity, with iodide demonstrating the most pronounced positive effect across both models.
Our analysis shows that both models recognised that steric hindrance decreases SN2 reactivity. In the RF model, substituted centres consistently decreased log
kpred in the four reactions where they were high impact, while these features increased log
kpred in 10 reactions with unsubstituted centres. For the two reactions where C–C and C–C–C fragments decreased log
kpred, this was attributed to a spurious correlation (see discussion in SI. 5.2.1). Similarly, a spurious correlation was observed in three reactions with unsubstituted centres where the C–C–C–C fragment decreased log
kpred (see discussion in SI. 5.2.1). In the BERT model, we found that substituted centres decreased log
kpred in all reactions where they were high impact (6 reactions), while unsubstituted centres increased it.
C, C–C
C, and C–C:C fragments, respectively (where ‘:’ is an aromatic bond), while in the BERT model, these groups were described using tokens ‘
’, ‘
’, and ‘c’, with ‘c’ representing an aromatic carbon bonded to the centre.Overall, both models recognised that allylic groups increase SN2 reactivity. However, BERT was limited in identifying this effect on reactivity with aromatic groups. In the RF model, alkyne bonds increased log
kpred in both instances where they appeared in the accurate predictions subset. Furthermore, aromatic groups also increased log
kpred in 4 out of 6 reactions; the two reactions where aromatic groups decreased log
kpred were attributed to their presence in the nucleophile (see Fig. S8). In the BERT model, alkenes increased log
kpred in one reaction and were low impact in the other, while alkynes increased log
kpred in the 4 reactions considered. Aromatic groups, however, had a negligible effect in the BERT model. Feature engineering by adding physical descriptors may improve the learning of these effects.26
k with increasing T−1 (Pearson's correlation coefficient rp ≥ −0.97 for both models). The correlation was seen for both the predicted log
k values and feature importances of T−1 (Fig. S6, predictions shown for one representative example in Fig. 5d). Hence, both models successfully captured the mathematical relationship between log
k and temperature.
k, solvent effects were not analysed for the RF model (Fig. S10). In the BERT model, solvent effects were evaluated by analysing the contribution of polar (ε > 15) protic and aprotic solvent SMILES in accurate predictions with anionic and neutral nucleophiles. The distribution of solvents was as follows: 3 polar protic and 6 polar aprotic for anionic nucleophiles, and 15 polar protic and 8 polar aprotic for neutral nucleophiles (Fig. 5e). No accurate predictions with non-polar solvent (ε < 15) were obtained.The BERT model consistently predicted that polar protic solvents decrease log
k, while polar aprotic solvents increase log
k with anionic nucleophiles (two reactions for protic and six reactions for aprotic solvent). For neutral nucleophiles, polar solvents generally increased log
kpred where solvent was high impact (five reactions for protic, four for aprotic). An exception was 2-amino-1-methylbenzimidazole reacting with ethyl iodide in methanol (5 reactions), which displayed a spurious correlation.
In summary, both BERT and RF models recognised LG, temperature, steric, and allylic effects to varying extents. Analysis of inaccurate predictions showed similar trends to accurate ones, reinforcing the reliability of these assessments. This consistency indicates that inaccurate predictions were not due to the inability of the models to capture key effects.
k extrapolation)
k values outside the range of the training data, we analysed the relationship between distance in log
kexp from the training median (training median = −3.4
log
k), and prediction error for each reaction in the test data (Fig. 6). Here, the x-axes were divided into positive and negative distance to capture extrapolation to log
k values greater than the training median and those less than the training median. For the BERT model, a low correlation between distance from the training median and prediction error was observed (Spearman's correlation coefficient rs = 0.29). This result suggests that BERT extrapolates well to log
k values far from the training median. Contrarily, the RF model exhibited a modest correlation (rs = 0.40) between distance from the training median and prediction error for log
k > the training median and a strong positive correlation (rs = 0.65) for log
k < the training median. This implies that RF is limited in its ability to extrapolate to log
k below the training median.
![]() | ||
Fig. 6 Prediction error vs. distance in log kexp from the training median (Δlog kexp) for each reaction in the test data, for the RF and BERT models. | ||
In conclusion, both models can extrapolate to log
k > the training median but BERT proved more reliable in extrapolating to log
k < the training median. However, this is to be expected given that BERT has a linear prediction layer, while RF bases its predictions on training averages.
k values of SN2 reactions and compared its performance to the RF literature model14 in terms of accuracy, training time, and ability to capture known reactivity rules.
In addition, we diversified the dataset of SN2 reactions curated by Madzhidov et al.17 used to train their RF,14 by introducing 196 new reactions curated from literature. We show that increasing the chemical diversity of the training data broadens the applicability of the model to new areas of chemical space, in particular for reactions in electrolyte solutions.
When comparing both the RF and BERT models trained on this diversified data, we observed that while both models achieved similar prediction accuracy (RMSE ≈ 1.1
log
k), the RF model showed a clear advantage in training speed. Additionally, both models identified key reaction centres as important for accurate predictions, along with known factors that influence the reaction rate, such as the nature of the LG, sterics, allylic groups, temperature, and solvent (BERT only). However, each model exhibited specific limitations: RF had difficulties with extrapolating log
k values, and BERT failed to recognise aromatic effects. Despite these limitations, each model compensates for the other's weaknesses, confirming that both RF and BERT are effective models for rate prediction and capable of capturing fundamental chemical principles in their predictions.
Future work should focus on expanding the applicability of these models to a wider range of chemical reactions via fine-tuning. We recognise that a key challenge will be the availability of experimental kinetic data beyond E2, SN2, cycloadditions, and SNAr reactions. Although promising initiatives such as the Open Reaction Database44 and data mining strategies offer potential solutions to improve the generalisation of available models, these could be used alongside QM-generated data through multi-fidelity approaches. For success, diversity rather than quantity alone will be essential to enhance model generalisation.
k range of −12.3–1.6. The test data contained 129 reactions with 41 unique nucleophiles, 43 unique substrates, 10 unique solvent systems, had a temperature range of 252.15–461.00 K, and log
k range of −8.2–1.2. All test data had a Tanimoto similarity (ST) < 0.4 to the training data, so prediction accuracy reflects model performance on novel reactions (Fig. S2a). Here, reaction A was said to have an ST > X to reaction B if the nucleophile and substrate of A respectively had an ST > X to the nucleophile and substrate of B, otherwise, reaction A is said to have an ST < X to reaction B. Note that 0.4 is a standard ST threshold41 and imposed a similarity constraint effectively without a significant reduction in test set size.
The training data utilised for this study builds upon the SN2 data compiled by Madzhidov et al.,17 which they used to train their rate prediction RF model.14 The original dataset consists of 4830 SN2 reactions and their experimental log
k values, which were cleaned in the current work to remove unbalanced reactions (109 reactions), duplicates (46 reactions), and known CV outliers17 (9 reactions), resulting in 4666 reactions from ref. 14. The SMILES used to generate the ISIDA fragments were also canonicalised in the current work to improve interpretability (i.e., so each molecular fragment is only represented by 1 ISIDA fragment). The chemical diversity in the training data was increased by including 196 SN2 reactions manually curated in the current work.45–56 Specifically, 83% of the reactions curated in this work have an ST < 0.4 to the reactions from ref. 14, and therefore add structural diversity (Fig. S2b). Furthermore, the reactions curated in this work introduce an additional nucleophile type into the data: phosphines (36 reactions), as well as an additional LG: azide (4 reactions), and an additional solvent type: electrolyte solutions (16 reactions). The log
k range was also increased from −7.7–1.6 to −12.3–1.6.
The test data is comprised of 73 test reactions compiled by Madzhidov et al.17 used to evaluate their rate prediction RF model14 (those with ST < 0.4 to the training data, Test 1), and 56 reactions manually curated in the current work (Test 2).45,47–55 The latter represent an area of chemical space outside the applicability domain of ref. 14's training data. Firstly, Test 2 has a low chemical similarity to ref. 14's training data, in comparison to Test 1. This was quantified by the percentage of reactions with ST < 0.2 to ref. 14's training data: 46% and 12% for Test 2 and Test 1, respectively (Fig. S2c and d). ST < 0.2 is used here as all test data have an ST < 0.4 to the training data. Secondly, Test 2 contains species not included in ref. 14's training data: 12 reactions in electrolyte, 5 reactions with azide LGs, and 4 reactions with phosphine nucleophiles.
| GCCSD(T) ≈ ECCSD(T) + GPBE0 − EPBE0 | (2) |
To obtain the DFT RMSE in log
k, the DFT ΔG‡ values (in kcal mol−1) were converted to log
k using the Eyring equation (eqn (1)). Further DFT details are provided in SI 3.
305 s (52.9 h) for BERT on 8 vCPUs of Intel® x86-64 CPU (64 GiB RAM). The BERT training time was 13
311 s (3.7 h) on GPU (1 NVIDIA V100 PCIe 16 GB GPU), which is incompatible with the Scikit-learn implementation of RF. Even on GPU, BERT is significantly slower than RF.
BERT and RF prediction times were calculated for five random samples of 30 test reactions (one sample per CV fold) and averaged. DFT times correspond to 30 reactant complexes and TS geometry optimisation and frequency calculations. The averaging over samples for the ML prediction time is to account for the fact that different samples were used to calculate the ML and DFT prediction times, due to some of the lower-molecular weight reactions required for DFT calculations failing to meet the ST < 0.4 requirement for the ML test data.
![]() | (3) |
kpred, while the sign corresponds to whether the feature increased (positive) or decreased (negative) log
kpred.
It is noted that each reaction centre atom also has a mapped atom in the products. The importance of these mappings is discussed in SI 5.4.
kpred, or negative sign ≡ decrease log
kpred) of key reaction centre atoms (and bonds) were evaluated. Here, features with an importance of zero within the associated error were categorised as low impact. The importance of each feature is relative to that of a dummy model that predicts the mean log
kexp of the training data for each test reaction. By definition, all features of the dummy model have an importance of zero.
To analyse LG effects, reactions with I, Br, Cl, and F LGs were considered, represented by C–I, C–Br, C–Cl and C–F fragments in RF and I, Br, Cl, and F tokens in BERT. Steric effects were analysed using reactions with alkyl-substituted centres are modelled by C–C, C–C–C, and C–C–C–C fragments in the RF model, and tokens of electrophilic and substituting C atoms in BERT. Here, the importance of these features in reactions with unsubstituted centres was used as a control. Meanwhile, allylic effects were assessed using reactions with alkene, alkyne, and aromatic groups bound to the electrophilic centre, which were represented by C–C
C, C–C
C, and C–C:C fragments in RF (where ‘:’ is an aromatic bond), and
,
and c tokens in BERT (where c is an aromatic carbon bonded to the centre). For centres with multiple substituents, the feature importances were summed over the corresponding high impact C,
,
or c tokens in the BERT model (this is not relevant to RF where features are represented by counts of molecular fragments).
When evaluating temperature effects, feature importances correspond to the importance of the reciprocal temperature feature in RF, and the sum over importances of SMILES tokens representing temperature (the “[RecipTemp]” token or any token of the numerical value) in BERT. Regarding solvent effects in the BERT model (solvent effects were not evaluated for RF, see discussion in SI 5.2.3), the standard threshold of ε = 15 was used to define polar (>15), and non-polar (<15) solvents,63 while protic and aprotic solvents were defined as those with (protic) or without (aprotic) a proton bonded to a heteroatom. The importance of the solvent was taken as the sum over importances of high impact solvent tokens. Here, the solvent was said to be low impact if none of its tokens were high impact. Note that feature importances were summed over all temperature tokens, but only high impact solvent tokens. This is because solvent effects were analysed by categorising the feature importances into positive (increase log
kpred), negative (decrease log
kpred), or low impact (negligible effect on log
kpred), while temperature effects were evaluated by observing the correlation between feature importance and temperature.
In RF, fragments containing I, Br, Cl or F LGs, or alkene, alkyne, or aromatic groups that aren't mentioned above were excluded from analysis to avoid confounding effects from other atoms and bonds. Additionally, reactions where C–C, C–C–C, or C–C–C–C fragments contain the product atom mapping of a nucleophilic C− atom were omitted from the analysis of steric effects in RF, to avoid confounding nucleophilic effects, as were reactions where the solvent acted as a nucleophile in the analysis of solvent effects in the BERT model. Similarly, reactions containing substituent groups other than alkyl were excluded from the analysis of steric effects in both models, as were reactions containing substituent groups other than alkene/alkyne/aromatic and alkyl in the analysis of allylic effects.
Supplementary information (SI): detailed settings for model training, data sets used, and DFT calculations. It also provides further analyses on the interpretation of the model and inaccurate predictions. See DOI: https://doi.org/10.1039/d5dd00192g.
Footnote |
| † Current address: Xyme Ltd, Inventa, Botley Road, Oxford, England, OX2 0HA. E-mail: E-mail: cwilson@xyme.ai |
| This journal is © The Royal Society of Chemistry 2026 |