Machine learning and semi-empirical calculations: a synergistic approach to rapid, accurate, and mechanism-based reaction barrier prediction

Modern QM modelling methods, such as DFT, have provided detailed mechanistic insights into countless reactions. However, their computational cost inhibits their ability to rapidly screen large numbers of substrates and catalysts in reaction discovery. For a C–C bond forming nitro-Michael addition, we introduce a synergistic semi-empirical quantum mechanical (SQM) and machine learning (ML) approach that allows the prediction of DFT-quality reaction barriers in minutes, even on a standard laptop using widely available modelling software. Mean absolute errors (MAEs) are obtained that are below the accepted chemical accuracy threshold of 1 kcal mol−1 and substantially better than SQM methods without ML correction (5.71 kcal mol−1). Predictive power is shown to hold when the ML models are applied to an unseen set of compounds from the toxicology literature. Mechanistic insight is also achieved via the generation of full SQM transition state (TS) structures which are found to be very good approximations for the DFT-level geometries, revealing important steric interactions in some TSs. This combination of speed, accuracy, and mechanistic insight is unprecedented; current ML barrier models compromise on at least one of these important criteria.


Dataset Generation
Initial geometries were built for 1000 unique nitro-Michael addition reactions by varying the organic fragments at four positions of a generic Michael acceptor (MA) core using the Custom R-Group Enumeration feature of Schrödinger's Maestro (release 2020-1) 1 (Fig. S1). Thus, a total of 2001 structures were obtained; the nucleophile (MW: 60), 1000 MAs (MW: 56-380.8), and 1000 TSs for reaction of the nucleophile with each MA (MW: 116-440.8). R-groups were selected to be representative of common fragments across synthesis, 2 toxicology, 3 and covalent drug design. 4 Fig. S1. C-C bond forming nitro-Michael addition used to generate the machine learning dataset.
A further 37 nitro-Michael addition reactions were generated using a set of MAs (aldehydes, ketones, and esters) from the toxicology literature (Fig. S2), leading to an additional 37 MAs (MW: 80-156.1) and 37 TSs for reaction of the nucleophile with each MA (MW: 142-216.1). 3 Where two or more C=C double bonds were present, reactions were always calculated at the β-carbon. In the original literature publication, 58 MAs were selected due to the potential impact of their reactivity in toxicity. From these 58 structures, 12 structures (coloured blue) that were already present in the enumerated dataset of 1000 reactions were removed to avoid bias when performing external validation. An additional 9 structures (coloured red) bearing triple bonds as part of the α,β-unsaturated carbonyl functional group were removed, as these represent a different form of reactivity that the ML models are not trained to predict. A previous study by Schwöbel found that separating α,β-unsaturated carbonyls with double and triple bonds is a valid approach when building models to predict the reactivity of MAs with glutathione. 5 Among the remaining 37 MAs, two reactions, E5 and E7, contain alcohol groups within their R-groups that allow intra-and intermolecular hydrogen bonding to take place in their respective MA and TS geometries (Fig. S3). As no such reactions are present in the enumerated dataset of 1000 reactions, the generated ML models cannot reasonably be expected to learn to account for hydrogen bonding. Indeed, for all models and feature subsets, reactions E5 and E7 were found to exhibit disproportionately worse predictions compared to the MAE of the other 35 structures; for example, absolute errors of 4.26 and 4.74 kcal mol -1 were obtained for E5 and E7, respectively, with the GPR model with the AM1 All feature subset, compared to an MAE of 0.92 kcal mol -1 over the other 35 reactions. Thus, the 37 literature reactions were divided in two distinct sets to assess the predictive performance of the generated ML models with and without hydrogen bonding and alcohol groups:
Thus, a total of 1037 nitro-Michael addition reactions were generated. Except where specified, any reference to the "literature set" refers to literature set 1 (35 structures (E5 and E7 removed)). The average train, test, and literature (all sets) MAEs of the SVR, KRR, and GPR models for each feature subset are provided in Table S1. For the two literature sets, predictions are always best when E5 and E7 are omitted (literature set 1).

Fig. S2
. 58 Michael acceptors used to build reactions in the literature dataset; all blue structures, which were already present in the enumerated dataset, and red structures, which contain triple bonds as part of the α,β-unsaturated carbonyl functional group, were removed, leaving 37 structures in total. For each reaction, temperature (298.15 K) and concentration-corrected (1 mol/l) quasiharmonic (Grimme approximation) 6 free energies were calculated using the GoodVibes python package, 7 with vibrational scaling factors of 0.975 8 for ωB97X-D/def2-TZVP and 1 for all other levels of theory. These were used to calculate free energy reaction barriers (Table S2) by equation (1). Fig. S4 illustrates the generally continuous spread of reaction barriers which are approximately normally distributed for each level of theory according to D'Agostino-Pearson tests. 9,10 (1) ∆ ‡ = -( + ) All structures were conformationally searched using the conformational search tool within Schrödinger's MacroModel (version 12.7) 11,12 with the OPLS3e force field. 13 A recent comparison of several force fields found OPLSe to perform the best when predicting the ordering of conformers of organic molecules based on their DFT energy. 14 A mixed Monte Carlo Multiple Minimum (MCMM) 15 and low-mode sampling approach 16,17 was used to explore the possible conformations of each structure. The lowest energy conformation of each structure, based on its OPLS3e energy, was subsequently optimised with several baseline (molecular mechanical (MM) and semi-empirical quantum mechanical (SQM)) methods and one targetline (density functional theory (DFT)) method using Gaussian16 (Revision A.03). 18 For the baseline methods, the Universal Force Field (UFF), 19 Austin Model 1 (AM1), 20 and Parameterisation Method 6 (PM6) 21 were used; these methods are widely available in QM packages such as Gaussian. Additionally, AM1 and PM6 represent two of the more modern and better parameterised general purpose SQM methods available. 21,22 The newer PM7 method, 23 when tested on a large subset of our structures, regularly failed to reach convergence. For the DFT calculations, the long-range corrected ωB97X-D functional 24 was used with the polarised triple-ζ valence quality (def2-TZVP) basis set. 25 These types of functional have been found to perform very well for prediction of barrier heights, 26,27 and similar methods have previously been used with success in large scale generation of chemical reaction datasets. 28 For all methods, single point energy (SPE) calculations 29 were performed with the same method as the optimisation but with the addition of the integral equation formalism of the polarisable continuum model (IEFPCM) 30 with toluene.
Toluene is a widely used solvent in hydrogen-bonding catalysis and was thus selected for any calculations incorporating solvent. 31,32 To verify that the models are also predictive of geometries in solvent, the 37 reactions from the literature were also reoptimised with AM1/IEFPCM(toluene) 30 and ωB97X-D/def2-TZVP/IEFPCM(toluene). An inhouse python package was used to automatically manage the calculation workflow. All calculations were performed on a High Performance Computing (HPC) architecture using 12-24 cores and one node; the approximate average time of each set of calculations (on a 16-core node) are summarised in Table S3. Gaussian16 output files for all computed structures are openly available in Dataset for "Machine learning and semi-empirical calculations: A synergistic approach to rapid, accurate, and mechanism-based reaction barrier prediction" in the University of Bath Research Data Archive at https://doi.org/10.15125/BATH-01092. Computed structures were illustrated in the manuscript with CYLView. 33 Table S1. Average MAEs of all SVR, KRR, and GPR models.  Table S3. Approximate average calculation times on a 16-core node (hours:minutes:seconds).

Feature Subset
The enumerated dataset of 1000 reactions was randomly split into an 80% train set (800 reactions) and 20% test set (200 reactions) using the Scikit-learn (sklearn) python package. 34 The relationships between the SQM and DFT reaction barriers for each level of theory are summarised in Fig. S5-6. In each case, the mean absolute error (MAE) between the SQM and DFT barriers are substantially above the accepted threshold for chemical accuracy of 1 kcal mol -1 . 35

Feature Extraction
A variety of physical organic features were extracted for the 1037 MAs and TSs at all levels of theory using a series of python packages (Table S4, Fig. S7). To avoid data leakage, all subsequent feature processing was performed only on the train set (800 reactions) and the same transformations then applied to the test and literature sets. Prior to fitting, features with no variance were removed using sklearn's VarianceThreshold feature selector, and features were standardised (the mean of each feature was subtracted from it before dividing by its standard deviation) using sklearn's StandardScaler to ensure they had zero mean and unit variance. Linear correlations between features were measured by Pearson's correlation coefficient (r) using SciPy, 10 Table S4. All extracted molecular and atomic features. All features were extracted for both the MA and TS and included in the respective feature subsets, except for the bond forming distance, which was only available for the TS, and the reaction barrier, which was only included in the relevant combined MA and TS feature subset. Atomic features denoted with n (n = C 1 , C 2 , C 3 , O 4

Machine Learning
Each feature subset was trained on a variety of sklearn regression algorithms using the 80% train set to predict the DFT free energy reaction barrier. The algorithms used are summarised in Table S6; except for ridge regression, all algorithms were trained using their default sklearn parameters. For ridge regression, a large regularisation strength (α = 50) was used to account for the nature of the data; linear models, such as ridge regression, are parametric tests that are particularly prone to overfitting when large numbers of features are provided that exhibit some degree of multicollinearity. 52 Both the radial basis function (RBF) and polynomial kernels were employed for KRR and SVR, whilst GPR models were trained with the Matern kernel based on its previous success predicting reaction barriers. 53 Preliminary studies were also performed using elastic net regression, least absolute shrinkages and selection operator (LASSO) regression, KRR with a linear and sigmoid kernel, SVR with a linear and sigmoid kernel, GPR with a rational quadratic, RBF, and dot product kernel, and a multi-layer perceptron (MLP) neural network regressor, however these all either produced poor initial metrics or suffered from very large computational costs. To reduce the number of features in each model and prevent overfitting, 54 we employed feature selection algorithms within the train set prior to training each feature subset on each regressor. Sklearn's recursive feature elimination with 5-fold cross validation (RFECV) was used for regressors that are able to generate feature coefficients or feature importances, and mlxtend's sequential forward selection (SequentialFeatureSelector) with 5-fold cross validation (SFSCV) otherwise. 55 These algorithms calculate the 5-fold cross validation (CV) MAEs for each feature subset trained on the relevant regressor, allowing selection of the best combination of features for each subset. To avoid the use of these expensive algorithms with Gaussian process regression (GPR), each GPR model was instead fit initially with all features and then refit excluding any features with MAE-derived permutation feature importances below a specified threshold (a multiplier (0.5) of the mean importance) from the original model. Except for ridge regression (α = 50), feature selection was performed for all algorithms using their default sklearn parameters.
For each combination of regressor and optimised feature subset, hyperparameter tuning was performed within the train set using sklearn's GridSearchCV to search the hyperparameter space for the best CV MAE scores. Each regressor was subsequently refit using its optimised feature subset and optimised hyperparameters; the resulting models allow direct prediction of the DFT free energy reaction barrier. The grid of sklearn hyperparameters searched for each regressor are given below:

Model Analysis
To check for potential overfitting of the models, sklearn was used to perform 5-fold CV within the train set to generate MAE and coefficient of determination (R2) scores, as per their standard sklearn definitions, for each fitted model. To assess the individual model performances, external validation was performed using the 20% test set to calculate MAE (with standard errors) and R 2 scores between the observed and predicted DFT barriers for each fitted model. Standard errors were calculated by dividing the standard deviation of the individual absolute errors by the square root of the number of samples (200 for test set, and 35 or 37 for the literature sets). To further assess the generalisability of the models, external validation was performed using the two literature sets to calculate MAE scores (with standard errors), which are comparable between different sample sizes, between the observed and predicted DFT barriers for each fitted model. R 2 scores were not calculated for the literature sets due to their smaller size (n=35/37) and incomparability with the respective train (5-fold CV) and test scores (n=200). 64 MAE-derived permutation feature importances (with standard errors) were calculated with sklearn over 10 repeats for each fitted model for both the train, test, and literature sets to identify which features were most important to the models. Learning curves were generated for each fitted model by training with increasingly smaller subsets of the train data (80 reactions were removed at each stage) and making predictions on the constant test set. The resulting train and test MAEs for each train set size were plotted to validate the size of the training set, identify potential overfitting, and indicate whether the equivalent metrics could be produced using less data.

All Machine Learning Metrics, Features, Hyperparameters
Metrics, features, and sklearn hyperparameters for all computed models are provided below. All train results are 5fold CV. All MAEs (and standard errors) in kcal mol -1 . Except where specified, all models are derived from the original train-test splitting (random state = 0). For readability and comparative purposes, train, test, and literature (both sets) metrics for all computed models are also provided in Tables S7-10. Table S7. 5-fold CV train MAE / 5-fold CV train R 2 (number of features) for each regressor and feature subset.               Features (49)                                                            Features (36)

Learning Curves
The learning curves for each SVR and KRR model with the AM1 All feature subset are provided in Fig. S8-11. The train and test scores for each model tend to be very close to one another, indicating that no significant overfitting takes place in the models at any point.

RMSD Analyses
Root-mean-squared-deviations of atomic positions (RMSDs) were calculated to assess the similarity between geometries of the MAs and TSs at each level of theory compared to the DFT TSs. The results of these analyses are summarised in Fig. S18-28. All RMSDs were calculated via a quaternion-based characteristic polynomial method 65 with the spyrmsd python package. 66 Structures being compared (all atoms, including hydrogens) were superimposed prior to calculation of each RMSD.