Open Access Article
Nicolai
Ree
a,
Andreas H.
Göller
*b and
Jan H.
Jensen
*a
aDepartment of Chemistry, University of Copenhagen, Universitetsparken 5, 2100 Copenhagen Ø, Denmark. E-mail: jhjensen@chem.ku.dk
bBayer AG, Pharmaceuticals, R&D, Computational Molecular Design, 42 096 Wuppertal, Germany. E-mail: andreas.goeller@bayer.com
First published on 28th January 2022
We present RegioML, an atom-based machine learning model for predicting the regioselectivities of electrophilic aromatic substitution reactions. The model relies on CM5 atomic charges computed using semiempirical tight binding (GFN1-xTB) combined with a light gradient boosting machine (LightGBM). The model is trained and tested on 21
201 bromination reactions with 101k reaction centers, which are split into training, test, and out-of-sample datasets with 58k, 15k, and 27k reaction centers, respectively. The accuracy is 93% for the test set and 90% for the out-of-sample set, while the precision (the percentage of positive predictions that are correct) is 88% and 80%, respectively. The test-set performance is very similar to that of the graph-based WLN method developed by Struble et al. (React. Chem. Eng., 2020, 5, 896–902) though the comparison is complicated by the possibility that some of the test and out-of-sample molecules are used to train WLN. RegioML out-performs our physics-based RegioSQM20 method (Nicolai Ree, Andreas H. Göller, Jan H. Jensen, J. Cheminf., 2021, 13, 10) where the precision is only 75%. Even for the out-of-sample dataset, RegioML slightly outperforms RegioSQM20. The good performance of RegioML and WLN is in large part due to the large datasets available for this type of reaction. However, for reactions where there is little experimental data, physics-based approaches like RegioSQM20 can be used to generate synthetic data for model training. We demonstrate this by showing that the performance of RegioSQM20 can be reproduced by a ML-model trained on RegioSQM20-generated data.
368 bromination reactions. A thorough dataset curation is then performed to obtain a set of unique SMILES (simplified molecular input line entry system) of the reactants and their corresponding site of bromination, which reduces the total number of reactions to 21
896. For example, a reaction is discarded if there is not an exact one-to-one mapping between the heavy atoms of the reactant and the product excluding the reacting bromine(s), or if a reacting bromine forms a bond with something other than a cyclic sp2 hybridized carbon atom (accounting for 5314 reactions). Furthermore, reactions with unique reaction IDs in Reaxys but identical reactants are merged (accounting for 3158 reactions).
896 reactions to get proton affinities for all of the unique reaction sites. An extension of this method is also applied in which the RegioSQM20 calculations are followed by single point density functional theory (DFT) calculations in methanol (MeOH, dielectric constant = 33.6) using the PBEh-3c composite electronic structure method15 and the conductor-like polarizable continuum model16,17 (C-PCM) as implemented in the quantum chemistry program ORCA version 4.2.18A few of the calculations resulted in extreme proton affinities corresponding to outliers in the dataset that complicated the development of regression models. However, the calculated proton affinities for both the original and extended RegioSQM20 calculations follow a Gaussian distribution (see the ESI†), which enables the use of Chauvenet's criterion to remove these outliers. In Chauvenet's criterion, the probability of the farthest point is calculated under the assumption of a Gaussian distribution. If this point is below some predefined value, then the point is removed, and the procedure is repeated until no more points are removed. In our dataset, molecules are removed if at least one atom in the molecule has a proton affinity corresponding to a probability below 1%.
From the screening of the seven atomic descriptors, we find that a charge shell descriptor with 5 shells and values sorted according to the Cahn–Ingold–Prelog (CIP) rules is particularly good for predicting the regioselectivity of bromination reactions (see Table S3 and Fig. S4 in the ESI†). An illustration of this 485-dimensional descriptor can be seen in Fig. 1.
![]() | ||
| Fig. 1 An illustration of the charge shell descriptor with values sorted according to the Cahn–Ingold–Prelog (CIP) rules. The green values correspond to the calculated CM5 charges using GFN1-xTB. | ||
201 reactions corresponding to 100
588 unique reaction sites. Thus, applying the above procedure results in a training/test set and an out-of-sample set of 15
246 and 5955 molecules, which correspond to 73
123 and 27
465 unique reaction sites, respectively.
Uniform stratified and random splits are then used to obtain a 80
:
20 ratio between the training and test sets resulting in 12
196 and 3050 molecules corresponding to 58
384 and 14
739 unique reaction sites, respectively. For the uniform stratified split, each of the individual clusters are randomly split and hereafter combined to ensure that both the training and test sets have similar representations of the underlying data distribution. On the other hand, the random split is indeed completely random with respect to all of the molecules obeying the cluster size cutoff.
As the strategy of this work is to learn and predict using atoms instead of molecules, all of the atomic descriptors for atoms in molecules belonging to the training, test, and out-of-sample sets are collected into different input sets and the corresponding proton affinities or classifications into different output sets.
An analysis of the training, test, and out-of-sample datasets can be found in the ESI.†
Furthermore, we examine the imbalance in the dataset using a “Null model”, where all sites are predicted to be non-reactive. And we employ a 1-nearest neighbor (1-NN) classifier as a baseline model using the brute-force search algorithm and the Jaccard (also known as Tanimoto) metric as implemented in scikit-learn,36 which corresponds to a perfect memorization of the training set.37
384, 14
739, and 27
465 unique reaction sites in the training, test, and out-of-sample sets, respectively. The experimental data often contain just single or a few reported reactive sites among all reaction sites in the reactant, i.e. there are significantly more negatives (N) than positives (P) in the dataset. Consequently the accuracy (the proportion of correct predictions, ACC) can be a misleading metric. For example, a “Null model”, where all sites are predicted to be non-reactive achieves a respectable accuracy of 76% (Table 1) for both the test and out-of-sample sets, but this just reflects the fact that 76% of the sites in both the datasets are unreactive. The Matthews correlation coefficient38 (MCC) is a more robust metric to assess the model performance, since it also considers false positives (FP) and false negatives (FN) in addition to true positives (TP) and true negatives (TN).![]() | (1) |
are also known as precision, recall, specificity, and negative predictive value, respectively.
739 unique reaction sites in the test set and the 27
465 unique reaction sites in the out-of-sample set. The reported metrics are accuracy (ACC), Matthew's correlation coefficient (MCC), precision (PPV or positive predictive value), recall (TPR or true positive rate), specificity (TNR or true negative rate), and negative predictive value (NPV). All of the ML models are trained on the 58
384 unique reaction sites in the training set with two exceptions: the “RegioML (all reactions)” model is trained on all of the available data including the training, test and out-of-sample sets, and the “LightGBM (532 reactions)” model is trained on data from the RegioSQM20 paper6 in which 37 and 74 reactions are part of the test and out-of-sample sets, respectively. These reactions are therefore excluded from the reported statistics
| Method | Test set | Out-of-sample set | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ACC | MCC | PPV | TPR | TNR | NPV | ACC | MCC | PPV | TPR | TNR | NPV | |
| Null model | 0.76 | 0.00 | 0.00 | 0.00 | 1.00 | 0.76 | 0.76 | 0.00 | 0.00 | 0.00 | 1.00 | 0.76 |
| 1-NN | 0.86 | 0.62 | 0.71 | 0.72 | 0.91 | 0.91 | 0.81 | 0.49 | 0.59 | 0.64 | 0.86 | 0.88 |
| RegioML | 0.93 | 0.81 | 0.88 | 0.83 | 0.96 | 0.95 | 0.90 | 0.72 | 0.80 | 0.76 | 0.94 | 0.93 |
| WLN (not retrained) | 0.93 | 0.80 | 0.92 | 0.78 | 0.98 | 0.93 | 0.92 | 0.78 | 0.88 | 0.78 | 0.96 | 0.93 |
| RegioML (all reactions) | 0.99 | 0.98 | 0.99 | 0.99 | 1.00 | 1.00 | 0.99 | 0.98 | 0.99 | 0.98 | 1.00 | 0.99 |
| RegioSQM20 | 0.89 | 0.70 | 0.75 | 0.81 | 0.91 | 0.94 | 0.88 | 0.69 | 0.73 | 0.80 | 0.91 | 0.94 |
| LightGBM (532 reactions) | 0.84 | 0.52 | 0.84 | 0.43 | 0.97 | 0.84 | 0.84 | 0.51 | 0.78 | 0.46 | 0.96 | 0.85 |
| LightGBM RegioSQM20 | 0.88 | 0.69 | 0.75 | 0.78 | 0.92 | 0.93 | 0.86 | 0.62 | 0.69 | 0.74 | 0.89 | 0.92 |
| RegioSQM20 PBEh-3c | 0.90 | 0.73 | 0.81 | 0.78 | 0.94 | 0.93 | 0.90 | 0.72 | 0.79 | 0.79 | 0.93 | 0.93 |
| LightGBM RegioSQM20 | ||||||||||||
| PBEh-3c | 0.90 | 0.72 | 0.81 | 0.76 | 0.94 | 0.93 | 0.87 | 0.65 | 0.74 | 0.73 | 0.92 | 0.91 |
| LightGBM RegioSQM20 regression | 0.87 | 0.65 | 0.74 | 0.73 | 0.92 | 0.92 | 0.86 | 0.61 | 0.70 | 0.71 | 0.90 | 0.91 |
The MCC values for both the test and out-of-sample sets are zero, which clearly shows that the Null model lacks any real predictive power.
As a baseline model, we trained a 1-nearest neighbor (1-NN) classifier corresponding to a perfect memorization of the training set.37 The data show that an impressive-looking 86% accuracy can be achieved for the test set by simple memorisation of the training set. In contrast, the MCC value is only 0.62 for the test set and considerably lower (0.49) for the out-of-sample set. These values primarily reflect a low precision where only 71% and 59% of the positive predictions are actually correct.
Our best machine learning model (LightGBM) achieves considerably better precisions of 88% and 80% for the test and out-of-sample sets, respectively. Note that while there is only a 3% drop in accuracy on going from the test set to the out-of-sample set, there is an 8% drop in the precision (and a concomitant drop in the MCC). Hereafter, we refer to this method (i.e. LightGBM trained on experimental data) as RegioML.
The test set MCC of RegioML is virtually identical to the Weisfeiler–Lehman neural network (WLN) architecture specifically trained to predict the regioselectivity of EAS reactions by Struble et al.8 While the precision is 4% higher for WLN, the recall (the fraction of positives that are predicted correctly) is 5% lower, leading to a nearly identical overall performance. WLN performs better on the out-of-sample set, with an MCC value that is nearly identical to that of the test set. However, it should be noted that many of the molecules in these two sets are likely included in the set used to train the WLN method, which is likely to inflate the MCC values of WLN. For example, we are able to achieve a MCC value of 0.98 on both the test and out-of-sample sets by training the LightGBM model on the entire collection of data using 10-fold cross-validation (the MCC value is for the best performing model).
For the test set, the recall of RegioSQM20 is similar to that of RegioML (81% vs. 83%), but the precision is significantly worse (75% vs. 88%). For the out-of-sample set, the recall is somewhat better for RegioSQM20 (80% vs. 76%), but the precision is still worse (73% vs. 80%), leading to a slightly smaller MCC value of 0.69 compared to the 0.72 for RegioML. In contrast to RegioML, the overall performance of RegioSQM20 is very similar for the test and out-of-sample sets, as one would expect from a more physics based method. However, RegioSQM20 does not offer an advantage over RegioML for the out-of-sample dataset, while being computationally much more demanding (see below).
The main advantage of the RegioSQM20 approach is that it may offer an accuracy similar to that of RegioML based on a much smaller training set. Indeed a LightGBM model trained on the same 532 reactions used to develop RegioSQM20 results in MCC values around 0.5 for both the test and out-of-sample sets. While the precision is quite good for this model, the recall is less than 50% due to a large proportion of false negatives. Thus, in cases where little experimental data is available, physics-based methods such as RegioSQM20 are likely to outperform ML based methods, even if the latter rely on quantum descriptors such as atomic charges.
The computational expense of the physics-based methods can be mitigated by using them to generate synthetic data for the machine learning model. Indeed, a LightGBM classifier trained on RegioSQM20 predictions for the large training set of 58k reaction centers offers the same performance as that of RegioSQM20 for the test set. Of course, the performance is worse for the out-of-sample set just like for RegioML, but the training dataset can now easily be expanded to ensure a better coverage of chemical space. Furthermore, since RegioSQM20 is not used to offer real-time predictions to a user, more accurate and computationally expensive methods can be explored. For example, the precision of RegioSQM20 can be increased by 6% by using PBEh-3c single point calculations to compute the proton affinities – an increase that is reflected in the corresponding ML model. The overall performance of RegioSQM20 PBEh-3c is now identical to that of RegioML for the out-of-sample dataset, with a MCC value of 0.72.
We also explore whether it is better to predict proton affinities using regression and use them to identify reactive centers, rather than the classification approach. Although the LightGBM RegioSQM20 regression model is able to achieve a mean absolute error (MAE) of 2.00 kcal mol−1 on the test set, its accuracy is not good enough to distinguish between reactive and non-reactive sites compared to that of the LightGBM RegioSQM20 model, as evidenced by the low recall values of 71–73%. However, the LightGBM RegioSQM20 regression model can be used to predict low, medium, or high reactivity as we showed in the RegioSQM20 paper.6 In fact, by combining the classification model and the regression model, one gets both regioselectivity predictions and a qualitative prediction of the reactivity of a molecule with almost no additional cost as the atomic descriptors only have to be calculated once. Examples of the output of RegioML can be seen in Fig. 2.
The results show that the median CPU time requirements of the RegioSQM20 method is 48 s per molecule on four Intel® Xeon® CPU E5-2643 v3 @ 3.40 GHz cores. The RegioML model is almost 100 times faster on just a single Intel® Xeon® CPU X5550 @ 2.67 GHz core with a median CPU time of less than half a second per molecule. The WLN architecture is able to achieve a mean CPU time of just 0.03 s per molecule on the single Intel® Xeon® CPU X5550 @ 2.67 GHz core. The main reason for the slower performance of RegioML is the GFN1-xTB single point calculations needed to compute the atomic charges.
201 bromination EAS reactions corresponding to 101k reaction centers, which are split into a training, test, and out-of-sample datasets with 58k, 15k, and 27k reaction centers, respectively. The accuracy is 93% and 90% for the test and out-of-sample sets, respectively, but this is not a good measure of performance due to the preponderance of non-reactive sites. For example, the precision (the percentage of positive predictions that are correct) is 88% for the test set, but only 80% for the out-of-sample set. The final RegioML model released to users is trained on the entire data set and we expect similar performance for molecules in-sample and out-of-sample for this large dataset. For example, for a molecule in this large training set, we expect a precision of 99%, while for a molecule that is similar and out-of-sample, we expect a precision of 88% and 80%, respectively. The test-set performance is very similar to that of the graph-based WLN method developed by Struble et al.8 though the comparison is complicated by the possibility that some of the test and out-of-sample molecules are used to train WLN. RegioML out-performs our physics-based RegioSQM20 method6 where the precision is only 75%. Even for the out-of-sample dataset, RegioML slightly outperforms RegioSQM20.
The good performance of RegioML and WLN is in large part due to the large datasets available for this type of reaction. For example, if we retrain the RegioML model on the same 532 reactions we used to develop RegioSQM20, the performance is much worse due to a large increase in the false negative rate leading to a recall (the percentage of positives that are predicted correctly) below 50% compared to the 80% for RegioSQM20. Thus, one use of physics-based approaches such as RegioSQM20 is to generate synthetic data for the ML model for reactions where there is little experimental data. We demonstrate this by showing that the performance of RegioSQM20 can be reproduced by a ML model trained on RegioSQM20-generated data.
Additional code for dataset curation and machine learning training is available at: https://sid.erda.dk/sharelink/HypB1igzDl.
Footnote |
| † Electronic supplementary information (ESI) available. See DOI: 10.1039/d1dd00032b |
| This journal is © The Royal Society of Chemistry 2022 |