Kjell
Jorner
a,
Tore
Brinck
b,
Per-Ola
Norrby
c and
David
Buttar
*a
aEarly Chemical Development, Pharmaceutical Sciences, R&D, AstraZeneca, Macclesfield, UK. E-mail: david.buttar@astrazeneca.com
bApplied Physical Chemistry, Department of Chemistry, CBH, KTH Royal Institute of Technology, Stockholm, Sweden
cData Science & Modelling, Pharmaceutical Sciences, R&D, AstraZeneca, Gothenburg, Sweden
First published on 5th November 2020
Accurate prediction of chemical reactions in solution is challenging for current state-of-the-art approaches based on transition state modelling with density functional theory. Models based on machine learning have emerged as a promising alternative to address these problems, but these models currently lack the precision to give crucial information on the magnitude of barrier heights, influence of solvents and catalysts and extent of regio- and chemoselectivity. Here, we construct hybrid models which combine the traditional transition state modelling and machine learning to accurately predict reaction barriers. We train a Gaussian Process Regression model to reproduce high-quality experimental kinetic data for the nucleophilic aromatic substitution reaction and use it to predict barriers with a mean absolute error of 0.77 kcal mol−1 for an external test set. The model was further validated on regio- and chemoselectivity prediction on patent reaction data and achieved a competitive top-1 accuracy of 86%, despite not being trained explicitly for this task. Importantly, the model gives error bars for its predictions that can be used for risk assessment by the end user. Hybrid models emerge as the preferred alternative for accurate reaction prediction in the very common low-data situation where only 100–150 rate constants are available for a reaction class. With recent advances in deep learning for quickly predicting barriers and transition state geometries from density functional theory, we envision that hybrid models will soon become a standard alternative to complement current machine learning approaches based on ground-state physical organic descriptors or structural information such as molecular graphs or fingerprints.
In the pharmaceutical industry, prediction tools have great potential to accelerate synthesis of prospective drugs (Fig. 1a).9 Quick prediction is essential in the discovery phase, especially within the context of automation and rapid synthesis of a multitude of candidates for initial activity screening.3,10,11 In these circumstances, a simple yes/no as provided by classification models is usually sufficient. More accurate prediction is necessary in the later drug development process, where the synthesis route and formulation of one or a few promising drug candidates is optimized. Here, regression models that give the reaction activation energy can be used to predict both absolute reactivity and selectivity (Fig. 1b). Prediction of absolute reactivity can be used to assess feasibility under process-relevant conditions, while prediction of selectivity is key to reducing purification steps. Predictive tools therefore hold great promise for accelerating route and process development, ultimately delivering medicines to patients both faster and at lower costs.
The current workhorse for computational studies of organic reactions is density functional theory (DFT, Fig. 2a). Since rising to prominence in the early 90s, DFT has enjoyed extraordinary success in rationalizing reactivity and selectivity across the reaction spectrum by modelling the full reaction mechanism.12 The success of DFT can be traced in part due to a fortuitous cancellation of errors, which makes it particularly suited for properties such as enantioselectivity, which depends on the relative energies of two structurally very similar transition states (TSs). However, this cancellation of errors does not generally extend to the prediction of the absolute magnitude of reactions barriers (activation free energies, ΔG‡). In particular, DFT struggles with one very important class of reactions: ionic reactions in solution. Plata and Singleton even suggested that computed mechanisms of this type can be so flawed that they are “not even wrong”.13 Similarly, Maseras and co-workers only achieved agreement with experiment for the simple condensation of an imine and an aldehyde in water by introducing an ad hoc correction factor, even when using more accurate methods than DFT.14 These results point to the fact that the largest error in the DFT simulations is often due to the poor performance of the solvation model.
Machine learning represents a potential solution to the problems of DFT. Based on reaction data in different solvents, machine learning models could in principle learn to compensate for both the deficiencies in the DFT energies and the solvation model. Accurate QSRR machine learning models (Fig. 2b) for reaction rates or barriers have been constructed for, e.g., cycloaddition,15,16 SN2 substitution,17 and E2 elimination.18 While these models are highly encouraging, they treat reactions that occur in a single mechanistic step and they are based on an amount of kinetic data (>500 samples) that is only available for very few reaction classes. Another promising line of research uses machine learning to predict DFT barrier heights and then use these barrier heights to predict experimental outcomes.19 A recent study from Hong and co-workers used the ratio of predicted DFT barriers to predict regioselectivity in radical C–H functionalization reactions.20 While these models can show good performance, the predicted barriers still suffer from the shortcomings of the underlying DFT method and solvation model. We therefore believe that for models to be broadly applicable in guiding experiments, they should be trained to reproduce experimental rather than computed barrier heights.
Based on the recent success of machine learning for modelling reaction barriers, we wondered if we could combine the traditional mechanistic modelling using DFT with machine learning in a hybrid method (Fig. 2d). Machine learning would here be used to correct for the deficiencies in the mechanistic modelling. Hybrid models could potentially reach useful chemical accuracy (error below 1 kcal mol−1)21,22 with fewer training data than QSRR models, be able to treat more complicated multi-step reactions, and naturally incorporate the effect of catalysts directly in the DFT calculations. Mechanistic models are also chemically understandable and the results can be presented to the chemist with both a view of the computed mechanism and a value for the associated barrier. As a prototype application for a hybrid model, we study the nucleophilic aromatic substitution (SNAr) reaction (Fig. 1c), one of the most important reactions in chemistry in general and the pharmaceutical chemistry in particular. The SNAr reaction comprises 9% of all reaction carried out in pharma,23 and features heavily in commercial routes to block-buster drugs.24,25 It has recently seen renewed academic interest concerning whether it occurs through a stepwise or a concerted mechanism.26 We show that hybrid models for the SNAr reaction reach chemical accuracy with ca. 100–200 reactions in the training set, while traditional QSRR models based on quantum-chemical features seem to need at least 200 data points. Models based on purely structural information such as reaction fingerprints need data in the range of 350–400 samples. If these results hold also for other reaction classes, we envision a hierarchy of predictive models depending on how much data is available. Here, transfer learning might ultimately represent the best of both worlds. Models pre-trained on a very large number of DFT-calculated barriers27 can be retrained on a much smaller amount of high-quality experimental data to achieve chemical accuracy for a wide range of reaction classes.
Fig. 4 Dataset visualization with unsupervised (PCA) and supervised (PLS + UMAP) dimensionality reduction for the Xfull feature set. |
For the hybrid model, we needed features for both the ground state molecules and the rate-determining transition state. We opted for physical organic chemistry features which would be chemically understandable and transferable to other reactions.29 We selected features associated with nucleophilicity, electrophilicity, sterics, dispersion and bonding as well as features describing the solvent. As “hard” descriptors of nucleophilicity and electrophilicity, we used the surface average of the electrostatic potential (s) of the nucleophilic or electrophilic atom.30,31 (“Descriptor” and “feature” are here used as synonyms.) As “soft” descriptors we used the atomic surface minimum of the average local ionization energy (Is,min),32 as well as the local electron attachment energy (Es,min), which has been shown to correlate well with SNAr reactivity.33,34 From conceptual DFT, we used the global electrophilicity descriptor ω and nucleophilicity descriptor N, as well as the corresponding local nucleophilicity and electrophilicity descriptors and .35 These electronic features were complemented with atomic charges (q) from the DDEC6 scheme36 and the electrostatic potential at the nuclei (VN) of the reactive atoms.37 In terms of sterics and dispersion, we use the ratio of solvent accessible surface area (SASAr)38 of the reactive atoms and the universal quantitative dispersion descriptor Pint.39 For bonding, we used the DDEC6 bond orders (BO) of the carbon-nucleophile and carbon-leaving group bonds.40 Solvents were described using the first five principal components (PC1–PC5) in the solvent database by Diorazio and co-workers.41 The most important TS feature was the DFT-calculated activation free energy (ΔG‡DFT), i.e., a “crude estimation” of the experimental target.42 We also added VN, DDEC6 charges and bond orders at the TS geometry. We decided to not include the reaction temperature as one of the features, as reactions with higher barriers tend to be run at higher temperatures as they are otherwise too slow, and therefore correlate unduly with the target (see ESI, Section 5.6†). Atomic features are denoted with C (central), N (nucleophilic) or L (leaving) in parenthesis, e.g., q(C) for the atomic charge of the central atom. Features at the TS geometry are indicated with a subscript “TS”, e.g., q(C)TS.
We chose to investigate three main feature sets: (1) Xfull, containing all the features (34 in total) for maximum predictive accuracy, (2) XnoTS without any information from the TS, to assess whether hybrid models are indeed more accurate, and (3) Xsmall, which represents a minimal set of 12 features that can be interpreted more easily, with ΔG‡DFT as the only TS feature (see ESI† for complete list). We also made two versions of XnoTS, excluding either surface electronic descriptors (Xtrad, missing s, Is,min, Es,min and Pint) or traditional features (Xsurf, missing ω, N, ω, N, VN, q, and BO). As a comparison to the physical organic features, we investigated four feature sets that only make use of the 2D structural information of the molecule: the Condensed Graph of Reaction (CGR) with the In Silico Design and Data Analysis (ISIDA) descriptors43 (XISIDA,atom and XISIDA,seq), the Morgan reaction fingerprints44 as implemented in the RDKit (XMorgan),45 as well as the deep learning reaction fingerprints from Reymond and co-workers (XBERT).46 These structural features can be calculated almost instantaneously and are useful for fast prediction. We added solvent information to the structural features by concatenating PC1–PC5.
The results of the model validation (Fig. 6a and Table S5†) show a clear progression from simpler models such as linear regression (LR) with a MAE of 1.20 kcal mol−1, to intermediate methods such as random forests (RF) at 0.98 kcal mol−1, to more advanced Support Vector Regression (SVR) and Gaussian Process Regression (GPR) models at 0.80 kcal mol−1. Most importantly, the best methods are well below chemical accuracy (1 kcal mol−1). In comparison, the raw DFT barriers ΔG‡DFT show a high MAE of 2.93 kcal mol−1 and have the same predictive value as just guessing the mean of the training dataset (Fig. 6b). Compensating for systematic errors in the DFT energies by linear correction helps, but still has an unacceptable MAE of 1.74 kcal mol−1. Interestingly, simpler linear methods such as the Automatic Relevance Determination (ARD) can achieve the same performance as the non-linear RF when polynomial and interaction features of second order (PF2) are used to capture non-linear effects. The overall best method considering MAE is GPR with the Matern 3/2 kernel (GPRM3/2). This result is very gratifying as GPRs are resistant to overfitting, do hyperparameter tuning internally and produce error bars that can be used for risk assessment. We therefore selected GPRM3/2 as our final method and also used it to make comparisons between different feature sets. In the BBC-CV evaluation, it had an R2 of 0.87, an MAE of 0.80 kcal mol−1, and an RMSE of 1.41 kcal mol−1. Importantly, the BBC-CV method indicates a very low bias of only 0.02 kcal mol−1 for MAE in choosing GPRM3/2 as the best method, so overfitting in the model selection can be expected to be small. Indeed, GPRM3/2 shows excellent performance on the external test set, with a R2 of 0.93, a MAE of 0.77 kcal mol−1 and an RMSE of 1.01 kcal mol−1 (Fig. 7). The prediction intervals measuring the uncertainty of the individual prediction have a coverage of 99% for the test set, showing that the model can also accurately assess how reliable its predictions are.
Fig. 6 Model performance. (a) Selection of hybrid models of increasing complexity. (b) Performance of DFT versus hybrid model. (c) Comparison of physical organic feature sets. (d) Comparison of structural feature sets. Error bar corresponds to one standard error. Abbreviations explained in the text. The orange bar corresponds to the same model in all subplots, GPRM3/2 with the Xfull feature set. A comparison of all the models on the same scale is given in Fig. S6.† |
Fig. 7 Performance of final GPRM3/2 model on the external test set. Coverage for test set: 99%. The grey band corresponds to ±1 kcal mol−1 with respect to the identity line. |
How good can the model get given even more data? Any machine learning model is limited by the intrinsic noise of the underlying training data, given in our case by the experimental error of the kinetic measurement. In the dataset, there are four reactions reported with the same solvent and temperature but in different labs or on different occasions. Differences between the activation energies are 0.1, 0.1, 0.5 and 1.6 kcal mol−1. The larger difference is probably an outlier, and we estimate the experimental error is on the order 0.1–0.5 kcal mol−1. In comparison, the interlaboratory error for a data set of SN2 reactions was estimated to ca. 0.7 kcal mol−1.17 It is thus reasonable to believe that the current model with a MAE of 0.77 kcal mol−1 on the external test set is getting close to the performance that can be achieved given the quality of the underlying data. Gathering more data is therefore not expected to significantly improve the accuracy of the average prediction, but may widen the applicability domain by covering a broader range of structures (vide infra) and reduce the number of outlier predictions.
In summary, models based on reaction fingerprints are an attractive alternative when a sizeable dataset of at least 350 reactions are available as they are easy to develop and make very fast predictions.
Although feature importances can be easily calculated, they are not always easily interpretable. In particular, correlation between features poses severe problems. This problem is present for our feature set, as shown by the Spearman rank correlation matrix and the variance inflation factors (VIFs)54 of Xfull (Fig. S8†). Therefore, special care has to be taken when calculating the feature importances, and the final interpretation of them will not be straightforward. To get around this technical problem of multicollinearity, we clustered the features based on their Spearman rank correlations, keeping only one of the features in each cluster (Fig. 10a). The stability of the feature ranking was further analysed by using ten bootstrap samples of the data.
Fig. 10 (a) Clustering of correlated features based on Spearman rank correlation. (b) Bootstrapped feature ranking for clustered Xfull and XnoTS. For identity of feature clusters, see the ESI.† |
First, we looked at how important ΔG‡DFT is in our hybrid model. It turns out that it is consistently ranked as the most important feature across all bootstrap samples (Fig. 10b). The second-most important feature is the soft electrophilicity feature Es,min(C). To see more clearly which are the most important ground-state features, we analysed a model built on XnoTS (Fig. 10b). The most important feature is again the soft electrophilicity descriptor Es,min(C). Other important features are the soft nucleophilicity descriptor Is,min(N) and the feature cluster of hard electrophilicity represented by s(C). The global nucleophilicity descriptor N and the electrophile–nucleophile bond strength through BO(C–N) are also ranked consistently high.
In summary, the most important feature for the hybrid models is, as expected, the DFT-computed activation free energy ΔG‡DFT. Models trained without the TS features give insight into the features of substrates and nucleophiles that govern reactivity. Here, the most important features are the electrophilicity of the central carbon atom of the substrate, followed by those related to nucleophilicity. Steric and solvent features are less important. There are a number of plausible reasons why solvent features have lower importance. Firstly, the effect of the solvent is already incorporated in the ΔG‡DFT through the use of both implicit and explicit solvation (for anions of second and third row elements of the periodic table). Secondly, the implicit solvent influences the values of the quantum-mechanically derived features. For the Es,min descriptor this effect has been shown to be substantial.34,55 Thirdly, the solvent correction can likely be learnt implicitly also from the other features for problematic nucleophiles such as those with anionic oxygen. This aspect is also connected to the fourth factor, that solvent variation is low for the anionic nucleophiles, where for example reactions with oxygen nucleophiles are only carried out in water or methanol (Fig. S25†). There is therefore limited data for the model to learn from the solvent features for some nucleophile classes. Collection of more balanced data will be key to improved models. Steric features may become more important for other types of substrates as the current data set doesn't include very sterically crowded substrates or nucleophiles.
First, the model should only be applied to SNAr reactions. Second, the model can only be used with confidence for those reactive atom types that are part of the training set (Fig. 3b). One example of reactions falling outside the applicability domain according to these rules are those with anionic carbon nucleophiles. A provisional analysis of the outliers (residual of >2 kcal mol−1) for the training set identifies many reactions involving methoxide nucleophile (Fig. 11). This tendency can be partly understood based on the poor performance of implicit solvation models for such small, anionic nucleophiles which is probably not corrected fully by our model. Additionally, some of these reactions with methoxide and unactivated substrates are very slow and have been run at high temperature and the rate constants have been determined through extrapolation techniques, leading to a larger error in the corresponding rates. Outliers from the test set involve azide and secondary amine nucleophiles. However, secondary amines are also the most common type of nucleophile in the dataset and are therefore expected to contribute to some outliers. For a complete list of all outliers, see the ESI.†
Fig. 11 Example of outliers for training and test set. Ranges in parenthesis correspond to signed errors in kcal mol−1. |
We also investigated whether the prediction uncertainty given by the GPRM3/2 model could be trusted. For predicting molecular properties, similarity to the training set is usually used to assess whether a prediction should be trusted (in the applicability domain) or not (outside the domain). The similarity is measured by distance metrics based on molecular fingerprints.57 For reactions, difference fingerprints have been shown to differentiate between reaction classes, but their suitability for defining an applicability domain within one reaction class is not clear.58 In the absence of a clear similarity metric, we used the Distance to Model in X-space (DModX) of a PLS model with two components (dimensions) to compare to GPRM3/2. PLS is a type of linear method that uses dimensionality reduction and that is used widely in chemometrics. DModX is based on distance in the latent space used by the PLS model and is an established metric for defining the applicability domain.59 We compared the performance of DModX to the standard deviation (std) of the GPRM3/2 predictions using integral accuracy averaging curves (Fig. 12). The integral accuracy averaging curve is a standard tool for evaluating uncertainty metrics, where the predicted values are ordered from most to least reliable according to the uncertainty metric.60 The MAE is then plotted as a function of the portion of included data. A good uncertainty metric should show a curve with an upward slope from left to right, as including more points with larger uncertainty should lead to a larger error. As can be seen from the plot, both DModX and GPRM3/2 std are decent measures of uncertainty. As the GPRM3/2 std performs best, the prediction intervals (as shown in Fig. 7) can be used directly as a measure of how reliable the prediction is.
As there are 62 reactions which occur in the dataset with more than one reaction condition (different temperature or solvent), we investigated the performance of leaving these reactions out altogether in the modelling. With this leave-one-reaction-out validation approach, we observed a MAE of 1.00 kcal mol−1 for GPRM3/2 (compared to 0.80 kcal mol−1 from normal cross-validation). In comparison, a model trained on XBERT decreased from 1.03 to 1.11 kcal mol−1. We also tested leave-one-electrophile-out, giving a MAE of 1.20 kcal mol−1 and leave-one-nucleophile-out, giving a of MAE 0.68 kcal mol−1. These results indicate that the model is able to predict outside its immediate chemical space with good accuracy, and not only interpolate.
Taken together, the applicability domain of our model is defined strictly in terms of the type of reaction (SNAr) and the types of reactive atoms in the training set. The outlier analysis identified that extra care should be taken when interpreting the results of reactions at high temperature and with certain nucleophile classes. The width of the prediction interval from the GPRM3/2 model is a useful measure of the uncertainty of the prediction.
Overall, it is remarkable that the hybrid model GPRM3/2 performs so well (86%) for selectivity as it was not explicitly trained on this task. In comparison, for electrophilic aromatic substitution (SEAr), single-task deep learning models trained for selectivity prediction of bromination, chlorination, nitration and sulfonylation achieved top-1 accuracies of 50–87%.62 For bromination, the RegioSQM model based on semiempirical energies of the regioisomeric intermediate σ-complexes achieved 80%, compared to 85% for the neural network just mentioned. In light of these models, the 86% top-1 accuracy obtained with GPRM3/2 for SNAr looks very competitive. Likely, hybrid models explicitly trained for selectivity prediction can perform even better. Another approach would be to use transfer learning to repurpose deep learning models for barrier prediction to selectivity prediction.
By studying models built on reduced sets of the physical organic features, as well as reaction fingerprints, we identified separate data regimes for modelling the SNAr reaction (Fig. 13). In the range 0–50 samples, it is questionable whether accurate and generalizable machine learning models can be constructed.63 Instead, we suggest that traditional mechanistic modelling with DFT should be used, with appropriate consideration of its weaknesses. With 50–150 samples, hybrid models are likely the most accurate choice, and should be used if time and resources for their development is available. In the range 150–300 samples, traditional QSRR models based on physical organic features reach similar accuracy as hybrid models, while models based on purely structural information become competitive with over 300 samples. It will be very interesting to see if these numbers generalize to other reaction classes. For choosing features for QSRR models, it is notable that electronic reactivity features from DFT were consistently ranked high in the feature importances, and we saw that the inclusion of surface features makes the models significantly better.
Fig. 13 Models for quantitative rate prediction for different data regimes based on the SNAr dataset. |
Our workflow can handle the mechanistic spectrum of concerted and stepwise SNAr reactions, and we are currently working on extending it to handle the influence of general or specific acid and base catalysts as well as treating related reaction classes. This general model is a significant improvement in generality compared to previous work, which modelled selectivity of SNAr reactions in terms of the relative stability of the σ addition complexes and therefore was only applicable to reactions with step-wise mechanisms.64–66 Although promising, we believe that widespread use of hybrid models is currently held back by difficulties in computing transition states in an effective and reliable way. We envision that this problem will be solved in the near future by deep learning approaches that can predict both TS geometries67 and DFT-computed barriers27 based on large, publicly available datasets.68,69 In the end, machine learning for reaction prediction needs to reproduce experiment, and transfer learning will probably be key to utilizing small high-quality kinetic datasets together with large amounts of computationally generated data. Regardless of their construction, accurate reaction prediction models will be key components of accelerated route design, reaction optimization and process design enabling the delivery of medicines to patients faster and with reduced costs.
Footnote |
† Electronic supplementary information (ESI) available: Detailed computational methods, full description of machine learning workflow and results. Description of the experimental and computed datasets. See DOI: 10.1039/d0sc04896h |
This journal is © The Royal Society of Chemistry 2021 |