Open Access Article
Samuel G.
Espley
a,
Samuel S.
Allsop
a,
David
Buttar
b,
Simone
Tomasi
c and
Matthew N.
Grayson
*a
aDepartment of Chemistry, University of Bath, Claverton Down, Bath, BA2 7AY, UK. E-mail: M.N.Grayson@bath.ac.uk
bData Science and Modelling, Pharmaceutical Sciences, R&D, AstraZeneca, Macclesfield, UK
cChemical Development, Pharmaceutical Technology & Development, Operations, AstraZeneca, Macclesfield, UK
First published on 21st October 2024
Machine learning (ML) models have provided a highly efficient pathway to quantum mechanical accurate reaction barrier predictions. Previous approaches have, however, stopped at prediction of these barriers instead of developing predictive capabilities in reactivity analysis tasks such as distortion/interaction–activation strain analysis. Such methods can provide insight into reactivity trends and ultimately guide rational reaction design. In this work we present the novel application of ML to the rapid and accurate prediction of distortion and interaction DFT energies across four datasets (three existing and one new dataset). We also show how our models can accurately predict on unseen, high impact literature examples where DFT-level distortion/interaction analysis has previously been used to explain reactivity trends for cycloadditions. This work thus provides support for ML to be utilised further in reactivity analysis across different reaction classes at a fraction of the cost of traditional methods such as DFT.
However, these ML studies seldom go further than reaction barrier prediction. Thus, an area which remains relatively untouched by ML is reactivity analysis. To the best of our knowledge, only one paper has utilised ML in this context for energy decomposition analysis, however this work focused on minimum energy and not transition state (TS) structure analysis.16 Common computational methods used to gain insight into the origins of chemical reactivity include Natural Bond Orbital (NBO),17,18 Non-Covalent Interaction (NCI),18–21 and Distortion/Interaction–Activation Strain (DIAS) analysis (Fig. 1).22–31 In the DIAS model, the reaction barrier can be decomposed into reactant distortion energies (strain) and interaction energy. The former arises from the conformational changes that occur during the reaction and the latter is the interaction between these distorted structures. The reaction barrier can therefore be considered as a balance between distortion and interaction energies. Breaking the barrier down into these energy components can provide insight into the factors that govern reactivity and is a particularly popular method of analysis in the study of cycloadditions.22,23,27,31–34 As an example, in a study of the reaction of cycloalkenones with various cyclic dienes,31 it was found that the interaction energies across the reaction series were almost identical. Therefore, the differences in reaction barrier arise from changes in dienophile and diene distortion energies. In a related study of the cycloaddition reaction of cyclopropenes with butadiene, the distortion energies of the endo and exo reactions were nearly identical, with the differences in reaction barrier (endo reactions favoured by 2–3 kcal mol−1) arising from the difference in interaction energies.35 This type of analysis therefore provides detailed information on reactivity trends which can aid in the rational design of new reactions.
Similar analysis has been applied to many other reactions such as: SN2/E2,24,36 iridium-catalysed C–H borylation,29 palladium-catalysed cross-coupling,37 asymmetric propargylation38 and allyboration,39 and nickel-catalysed amide (a functional group that is ubiquitous in drug discovery) C–N bond activation.40 However, computing accurate distortion and interaction energies requires the use of expensive quantum mechanical calculations (typically DFT) which prevents the routine use of this reactivity analysis on large datasets. ML may instead offer a rapid approach to high accuracy distortion and interaction energy predictions.
In this work, we report the first example of predicting distortion and interaction energies using ML. Models are built to predict DFT energies from rapid semi-empirical quantum mechanical (SQM) calculations for four distinct datasets (of which one is a new dimethyl malonate Michael addition dataset). The models yield high accuracy distortion and interaction energies for all four datasets with further insight available from the computed SQM TSs. Finally, we show that performance remains high when the models are applied to two unseen reaction sets from the literature.
49 and ωB97X-D/def2-TZVP.50,51 Single point energy (SPE) calculations at the same levels of theory were performed on these structures to account for solvent effects using the integral equation formalism of the polarisable continuum model (IEFPCM) with water.52 The combination of DFT and an implicit solvent model was chosen due to its good agreement with experiment as reported within the literature for this type of chemistry.53–55 All calculations were performed with Gaussian16 (Revision A.03
56 and Revision C.01
57). Energies and quasi-harmonic free-energies (298.15 K, 1 mol l−1 concentration) were calculated using GoodVibes.58 Further details on this dataset are provided in the ESI, Section 1.4.†
For the distortion/interaction analysis, the distorted reactants at the TS must be separated to perform the necessary calculations. Whilst there is python code available for this function,59 it requires a detailed input file about the given chemical system and does not provide a mapping of atom numbers from a TS to its respective reactant and distorted structures. We developed an approach that utilises the frequency vibrations of the TS as detailed in Fig. 2 (the code is available on GitHub (https://github.com/the-grayson-group/distortion-interaction_ML)). The main benefit of our code is that it can provide us with a mapping of atom numbers between the reactant, distorted, and TS structures regardless of the initial atom numbering. This is crucial for ML approaches that use knowledge of specific atoms. Dataset ds3 contained arbitrary atom numbering and our approach for determining the common atoms of this dataset is summarised in the ESI, Section 1.3.† After extracting the distorted structures from each TS, SPE calculations were performed at the same level of theory that the TS and reactants were calculated at and in solvent with energies then extracted with GoodVibes. The distortion and interaction energies were calculated as outlined in ESI Section 2.†
From the SQM optimised reactants, distorted reactants, and TSs, atomic and physical organic chemical features were extracted using Morfeus60 and cclib61 python packages. The features extracted and the entire procedure are outlined in the ESI, Section 4.1.† Features were standardised before three scikit-learn62 regression algorithms and two TensorFlow63 neural networks (NNs) were trained. The targets were also standardised for the NNs.
The scikit learn algorithms were ridge regression, kernel ridge regression (KRR) with a radial basis function (RBF) kernel, and support vector regression (SVR) with the RBF kernel. These models were chosen due to their success in predicting reaction barriers in previous work.8,9,64 The link between reaction barriers and distortion/interaction energies suggests that similar models may also be able to predict these new targets.
The models were hyperparameter tuned and performance was monitored on a validation set. The final model performance was evaluated on a held-out test set. For a full description of the ML procedure see ESI, Section 4.†
For ds1,8 ds2,9 and ds3,41 their pre-ML SQM-DFT metrics are reported in the literature, Table 1, and in the ESI, Section 4.3.†
| Datasets | Pre-ML AM1-DFT MAE (kcal mol−1) | Average SVR test MAE (kcal mol−1) | Average test set range (DFT) (kcal mol−1) | Test MAE as % of test set range | |
|---|---|---|---|---|---|
| ds1 | MA | 4.45 | 1.35 | 2.60 to 28.26 | 5.3 |
| Nitromethane (nucleophile) | 2.57 | 0.36 | 0.73 to 5.36 | 7.8 | |
| Interaction | 7.60 | 1.59 | −23.11 to −0.09 | 6.9 | |
| ds2 | Diene | 2.80 | 0.33 | 12.12 to 26.54 | 2.3 |
| Dienophile | 2.04 | 0.34 | 7.01 to 21.92 | 2.3 | |
| Interaction | 9.87 | 0.50 | −17.40 to −3.43 | 3.6 | |
| ds3 | Dipole | 3.59 | 2.55 | 0.98 to 41.25 | 6.3 |
| Dipolarophile | 3.81 | 2.37 | 0.13 to 35.33 | 6.7 | |
| Interaction | 20.01 | 2.46 | −43.62 to −7.92 | 6.9 | |
| ds4 | MA | 4.07 | 1.27 | 7.30 to 39.64 | 3.9 |
| Dimethyl malonate (nucleophile) | 3.70 | 0.54 | 3.43 to 13.98 | 5.1 | |
| Interaction | 10.83 | 1.23 | −26.78 to −5.05 | 5.7 | |
For ds4, the pre-ML AM1-DFT MAE for reaction barrier prediction is 10.46 kcal mol−1. The spread of distortion and interaction energies are shown in Fig. 3. Fig. 3c and e highlight that AM1 significantly overestimates the dimethyl malonate distortion and interaction (pre-ML MAEs of 3.70 and 10.83 kcal mol−1 respectively) energies. The interaction energy is calculated from the reaction barrier and distortion energies thus, its pre-ML AM1-DFT MAE could be impacted by the compounding of errors from the lower level of theory across multiple calculations. It could also be rationalised in part by the inability of the parent method (Hartree Fock) to account for electron correlation sufficiently and thus important interactions are not well captured.49,67 The interaction energies are also overestimated by AM1 across the other datasets. For full information on pre-ML metrics see ESI, Section 4.3.†
In this work we propose that if AM1 can provide rapid and approximate values relative to DFT, ML can be used to bridge this gap and give accurate distortion and interaction energy predictions at a fraction of the cost of DFT.
For ds4, models were trained to predict distortion energies for the dimethyl malonate nucleophile and the MA. The best predictions were obtained once again using SVR with a test set MAE of 0.50 (RS = 23) and 1.05 kcal mol−1 (RS = 14) for nucleophile and MA respectively (Fig. 4). When averaged over five random states, test set predictions were 0.54 ± 0.06 kcal mol−1 and 1.27 ± 0.17 kcal mol−1 for nucleophile and MA, respectively. Overall, these results are a significant improvement on the pre-ML metrics. For full metrics and figures see ESI, Section 4.3 and Table S7.†
Similar performance was seen when models were built to predict the distortion energies of the nucleophile and MA of ds1 (see Table 1). The low test set MAEs seen for the ds4 but especially the ds1 nucleophiles are likely due to the minimal conformational flexibility of these species which results in fewer distorted conformations across the entire dataset thus making the task of correcting their SQM energies an easier one.
The pre-ML MAEs for the diene and dienophile distortion energy for ds2 were 2.80 and 2.04 kcal mol−1 respectively (Table 1). Across all tested models, averaged predictions of the diene distortion energies for ds2 were well below 1 kcal mol−1; the best performing model was again SVR with an averaged MAE of 0.33 ± 0.08 kcal mol−1 (the results of the best models are visualised in Fig. 5). Ridge regression had the highest averaged error across all predictions (0.61 kcal mol−1) however, its performance is still significantly below the 1 kcal mol−1 accuracy threshold that has typically been used for energy predictions.
SVR is again the best performing model when predicting dienophile distortion energies with KRR also performing strongly (MAEs of 0.34 ± 0.05 and 0.38 ± 0.05 kcal mol−1 respectively). Greater fluctuation in model performance is seen when predicting the dienophile distortion energies, however, SVR tends to perform consistently well across the diene and dienophile distortion energy prediction tasks.
Ds3 provides an interesting challenge due to the diversity and biological relevance of this bio-orthogonal dataset.41 The best reported ML predictions of the reaction barriers are between 2 and 3 kcal mol−1.68–70 These results highlight the complexity of this dataset and the difficulty in predicting barriers approaching the desired accuracy of 1 kcal mol−1. A prediction limit for this dataset may have been reached.
The pre-ML AM1-DFT MAE for the dipole distortion energies was 3.59 kcal mol−1 (Table 1). When predicting the distortion energy of the dipole, the best performing model again used SVR with an averaged test set MAE of 2.55 ± 0.13 kcal mol−1. SVR again gave the best performing model for the dipolarophile distortion energy predictions with an averaged test MAE of 2.37 ± 0.12 kcal mol−1. Across the datasets tested, prediction errors for the distortion energy models are either similar or lower than that those of the reaction barrier models. For full results on both cycloaddition datasets see ESI, Section 4.3, Tables S5 and S6.†
For ds1, the average test MAE for interaction energies was 1.59 ± 0.18 kcal mol−1 (SVR) which is approaching the 1 kcal mol−1 accuracy threshold and a significant improvement upon the pre-ML AM1-DFT MAE of 7.6 kcal mol−1 (Table 1). Similar performance was seen for ds4 with an averaged test set MAE of 1.23 ± 0.17 kcal mol−1 (SVR), improving on a pre-ML value of 10.83 kcal mol−1.
Similar success is seen with the cycloaddition datasets. Prior to ML, the interaction MAEs for ds2 and ds3 were 9.87 and 20.01 kcal mol−1 respectively (Table 1). The best model built on ds2 yielded an MAE of 0.41 kcal mol−1 (RS = 14) which is well below the 1 kcal mol−1 accuracy threshold (Fig. 6). The averaged test MAE for ds2 was 0.50 ± 0.06 kcal mol−1 (Table 1). When predicting ds3 interaction energies, the best performing model was again SVR. The best model achieved a test MAE of 2.33 kcal mol−1 (RS = 1) which, when considered against the pre-ML MAE of 20.01 kcal mol−1, is a striking improvement and in line with literature predictions of this dataset's reaction barriers. When averaged over five random states, the test MAE was 2.46 ± 0.12 kcal mol−1. Another approach to evaluating model performance is to consider the test MAE as a percentage of the range of test values. Table 1 shows the averaged test set performances for SVR models for each distortion and interaction energy over five random states. Evaluating ds3 performance solely on MAE would lead to the conclusion that the models are above the 1 kcal mol−1 accuracy threshold and thus not as accurate as other models, however, when considering the MAE as a percentage of the test set range, the performance is comparable to that of the Michael addition models (ds1 and ds4).
Learning curves were also generated for models used in this work to inspect if overfitting had occurred. Across all models, there was good agreement between test and train metrics (see ESI, Section 4.5†). In addition to this, feature importances were generated for SVR models on ds2 to investigate the explainability of our models. This was done by randomly shuffling a feature at inference (full procedure can be found in the ESI, Section 4.6†). For the prediction of DFT interaction energies, the three most important features were the AM1 ΔG‡, ΔE‡, and ΔEint (ESI, Fig. S48†). All these features, when randomly shuffled, resulted in a MAE above 1 kcal mol−1 highlighting that they describe the target task well. Similar easily understood feature importances were also seen for the diene and dienophile distortion models. For all ds2 target feature importances see ESI, Section 4.6.†
As mentioned in the methodology section, a selection of different models were tested in this work. Across most tasks and datasets, SVR yielded the best performance, however the 2-layer NN achieved comparable accuracy (ESI, Section 4.3†). It should be noted that even with the appropriate methods to circumvent overfitting, there was a degree of this with almost every NN hence the discussion in this work primarily focussed on the use of SVR models.
![]() | ||
| Fig. 7 External test sets for ds2 ML models constructed from ref. 33 and 37. The dienes in both datasets are part of ds2 thus diene distortion was not predicted. Any dienophiles that were in ds2 were removed from these external test sets. | ||
Predictions were made using geometries that were taken from these two studies after optimising them with AM1 and DFT (see the methodology section outlined above for the full computational procedure). Model performance on ds2 was strongest using SVR thus we utilised this model for our external test set predictions (Table 2).
| Dataset | Pre-ML AM1-DFT MAE (kcal mol−1) | Average SVR test MAE (kcal mol−1) | |
|---|---|---|---|
| Cycloalkenones | Dienophile | 1.93 | 1.58 |
| Interaction | 12.71 | 1.52 | |
| Cyclopropene | Dienophile | 4.27 | 1.62 |
| Interaction | 6.73 | 1.42 | |
The dienes used in the external test sets were part of ds2 therefore, diene distortion was not predicted to avoid a trained model predicting on a previously seen datapoint. Furthermore, any dienophiles that were present in ds2 were removed from the external test sets.
As before, all models were tested over five random states with results averaged. For context, three of the pre-ML AM1-DFT MAEs are large, especially for the interaction energies (Table 2). The averaged test MAEs from the ds2 ML models for each target, across both datasets, were significantly lower than the pre-ML AM1-DFT MAEs which shows that the models can make predictions close to the 1 kcal mol−1 accuracy threshold on unseen datapoints. Noticeably, prediction of the interaction energies using ML resulted in a significant reduction in test MAE for both the cycloalkenone and cyclopropene datasets; the MAE on the cycloalkenone data dropped from a pre-ML value of 12.71 to 1.52 kcal mol−1. As we have previously shown, SQM TSs can be a very good approximation to DFT TSs for cycloadditions.9 Therefore, our hybrid SQM-ML approach gives energies and mechanistic insight close to the accuracy of DFT at a fraction of the computational cost.
Footnote |
| † Electronic supplementary information (ESI) available: Additional details on generation and analysis of SQM and DFT datasets, ML procedures, and model performance evaluation. See DOI: https://doi.org/10.1039/d4dd00224e |
| This journal is © The Royal Society of Chemistry 2024 |