 Open Access Article
 Open Access Article
      
        
          
            Jingru 
            Lu
          
        
      , 
      
        
          
            Irina 
            Paci
          
        
       * and 
      
        
          
            David C. 
            Leitch
* and 
      
        
          
            David C. 
            Leitch
          
        
       *
*
      
Department of Chemistry, University of Victoria, 3800 Finnerty Rd. Victoria BC, CANADA V8P 5C2. E-mail: ipaci@uvic.ca; dcleitch@uvic.ca
    
First published on 17th October 2022
We report a multivariate linear regression model able to make accurate predictions for the relative rate and regioselectivity of nucleophilic aromatic substitution (SNAr) reactions based on the electrophile structure. This model uses a diverse training/test set from experimentally-determined relative SNAr rates between benzyl alcohol and 74 unique electrophiles, including heterocycles with multiple substitution patterns. There is a robust linear relationship between the experimental SNAr free energies of activation and three molecular descriptors that can be obtained computationally: the electron affinity (EA) of the electrophile; the average molecular electrostatic potential (ESP) at the carbon undergoing substitution; and the sum of average ESP values for the ortho and para atoms relative to the reactive center. Despite using only simple descriptors calculated from ground state wavefunctions, this model demonstrates excellent correlation with previously measured SNAr reaction rates, and is able to accurately predict site selectivity for multihalogenated substrates: 91% prediction accuracy across 82 individual examples. The excellent agreement between predicted and experimental outcomes makes this easy-to-implement reactivity model a potentially powerful tool for synthetic planning.
One class of organic reactions for which accurate predictive models would be invaluable is nucleophilic aromatic substitution (SNAr). SNAr is one of the most important and well-studied transformations in organic synthesis.26–29 It is extensively used in total synthesis of natural products,30–37 medicinal chemistry and agrochemistry,38–43 and manufacturing of active pharmaceutical and agrochemical ingredients.44–48 For example, SNAr reactions are particularly powerful for the synthesis and functionalization of N-heterocycles, which are among the most ubiquitous structural components in active pharmaceutical ingredients.49–51
Because of its importance in synthesis, designing efficient and highly selective SNAr reactions involving complex molecules is crucial. Substantial research over the past 100 years has been devoted to understanding the operative reaction mechanisms, whether stepwise or concerted,26,52–54 and in collecting experimental reactivity and selectivity data for myriad substrate combinations. For example, Hammett55 and/or Mayr parameters4 are often used as mechanistic probes and to correlate/predict SNAr reactivity (Fig. 1A).56–62
Theoretical and computational methods have been used to develop predictive models for specific subsets of SNAr chemistry (Fig. 1B). Early work focused on stability of the σ-complex intermediates using Iπ-repulsion theory,63,64 or frontier molecular orbital considerations65 to explain and predict regioselectivity.66 Baker and Muir67,68 as well as Brinck, Svensson, and co-workers69–71 have published several works on predicting regioselectivity for SNAr reactions using DFT-calculated transition state energies and/or stability of the σ-complex intermediates (SS).71
Quantum chemical transition state calculations are undeniably a powerful tool to explore reaction mechanisms and provide theoretical evidence to support experimental findings; however, the computational cost of performing transition state analyses remains high, and the complexity and nuance of these calculations make them beyond the expertise of many synthetic research groups. More desirable from an end-user perspective are models built from easily obtained molecular descriptors. In addition to established electronic and steric descriptors,55,72,73 in 2016 Brinck and co-workers introduced the local electron attachment energy (analogous to the local electron affinity) as a molecular descriptor for electrophilicity,74 and have applied it toward reactivity/selectivity predictions for SNAr reactions.75 While this descriptor correlates well with sets of experimental rates, and is able to provide qualitative selectivity predictions in multihalogenated systems, there is a need for new and more varied data and descriptor sets as foundations to build broadly applicable models for synthetic planning.
Recently, Jorner, Brinck, Norrby, and Buttar reported the use of a hybrid DFT/machine learning (ML) approach to predicting experimental activation energies (Fig. 1C).21 This important study collates more than 440 SNAr reaction rates from the existing literature, and uses 34 ground state and transition state descriptors as the training/test set. Notably, DFT-calculated transition state energies are a crucial descriptor in the best-performing model. This hybrid approach is demonstrably powerful, able to generate a broadly applicable and accurate model; however, the existing experimental rate data contains key gaps, such as an overemphasis on nitroarenes, and relatively few heterocyclic electrophiles. The hybrid DFT/ML approach also requires transition state calculations for maximum accuracy, especially if relatively few data points are available.
In this work, we consider the following three aspects of a predictive model to have equal importance: (1) the prediction accuracy the model provides, especially for new (external) predictions; (2) the breadth of applicability the model affords across chemical space; and (3) the ease and simplicity of applying the model to new systems. In the previously described examples, reaction rate/selectivity data used to train and validate the QSRR/QSSR models are taken from literature values, skewing the chemical space coverage toward well-studied systems. To complement the existing SNAr rate data from the literature, we measured relative reaction rates for 74 individual electrophiles – including many nitrogen heterocycles relevant to pharmaceutical synthesis – using a competition experiment approach, which is commonly used to generate univariate Hammett plots.76–81 Having control over the composition of our training set gives us the flexibility to have a varied and balanced distribution of structural features, which is necessary to ensure both accuracy and applicability in making new predictions. To make the model easy to implement, and to reduce the computational cost required, we combined simple and easy-to-obtain ground state molecular descriptors with our own experimentally determined SNAr rates. From this combination of factors, we have constructed a QSRR model for SNAr reactions with excellent performance in predicting reactivity trends and site selectivity for many different electrophiles, including for multiple external test sets with significantly different molecular structures (Fig. 1D).
Finally, we calibrated these relative rate constants using the touchstone reactions, giving absolute rate constants and the corresponding ΔG‡SNAr values for the entire array of SNAr reactions (Table S3†). We used the absolute ΔG‡SNAr value for the 2-chloropyridine touchstone reaction (88.8 kJ mol−1) as the calibration point, with the other two touchstone reactions (2-chloro-6-methylpyridine, and 2-chloro-5-methoxypyridine) used to confirm the validity of the competition determined ΔG‡SNAr values. We obtain a percent difference between the competition values and touchstone values of <2% (Fig. S3†). In addition, we determined independent ΔG‡SNAr values for 17 substrates using multiple competition experiments, giving an estimate of the error for the relative ΔG‡SNAr values; the difference between the average ΔG‡SNAr value and the individual measurements is between 0.2 – 1.7 kJ mol−1 (Table S5†).
Using this competition approach, we were able to rapidly build a reliable and self-consistent data set from a library of 74 (hetero)aryl halides. This includes 6-membered aromatic electrophiles with many different substitution patterns – electron donating/withdrawing groups in all possible positions, multiple substituents, and several heterocycle classes – and thus a variety of electronic effects. The reactivity of these substrates crosses a broad range, with the reaction rates spanning 6 orders of magnitude; a quantitative reactivity scale for several representative electrophiles is shown in Fig. 2D. As an initial check on the validity of our data set, we assessed the general reactivity trends against the known features of SNAr reactivity. As expected, electron-deficient arenes react much faster than electron-rich ones; furthermore, the reactivity of the halides leaving groups follows the established trend, with rates decreasing as Ar–F >> Ar–Cl ∼ Ar–Br.82 We also constructed Hammett plots for four sets of 2-X-pyridine substrates (X = Cl, Br), giving linear correlations with rho values of ∼4–5 (Fig. S4–S7†). Finally, we have prepared and isolated 5 representative SNAr products (compounds S1-S5), and confirmed their structures using NMR spectroscopy and high-resolution mass spectrometry (Fig. S8–S17†).
By building a multivariate linear correlation between these three ground state descriptors and our experimentally obtained ΔG‡SNAr values, we have established a unified structure-reactivity model able to accurately predict SNAr rates for electrophiles with various structural features and leaving groups under our reaction conditions. There is an excellent linear correlation between the predicted and actual ΔG‡SNAr values (R2 = 0.92) and a mean absolute error (MAE) of only 1.8 kJ mol−1 (0.43 kcal mol−1) (Fig. 3B). Performing a min/max normalization of the descriptors reveals their percentage contribution to the model, with ESP1 being most important (50%), followed by ESP2 (35%), and finally only a modest contribution from the EA (15%). We note that including steric-based descriptors was not necessary to obtain good correlations for our data set; adding substituent A-values as an additional factor in our multivariate regression led to no change in the model, and a very small coefficient for the A-value term (Table S7†). Further work to explore steric effects in a wider range of SNAr reactions is ongoing.
We have assessed the robustness of the model using cross-validation with five different random 60/40 training/test set data splits (Fig. 3C and S20–24†) and one structured split (Fig. S25†). All of these regression analyses give essentially identical results, with excellent correlation statistics as indicated by the range of Q2 values89 from 0.86 to 0.93, and MAE values from 1.6 to 2.3 kJ mol−1 for the test sets. We also evaluated the 95% prediction intervals for the 29 members of the test set in Fig. 3C, giving a range of ±5.1 kJ mol−1 to ±5.5 kJ mol−1 (Fig. S20†). Finally, we also assessed the model performance by analysing the distribution of residuals across the data set, and identifying any possible outliers. As shown in Fig. 3D, the residuals are randomly distributed, almost exclusively in the range −5 to +5 kJ mol−1 (i.e. within an order of magnitude of the experimental rate). A box plot reveals only one significant outlier (|residual| > 5 kJ mol−1): 2-(N-methylcarboxamide)-4-chloropyridine.
The selection of these specific molecular descriptors was guided by the mechanistic features of nucleophilic aromatic substitution, as well as our previous work on a multivariate model for oxidative addition with (hetero)aryl halides.83 We also carried out an iterative refinement of the included descriptors based on our experimental observations and model performance (Table S5†). The following discussion provides more detail on creation and refinement of the model and its mechanistic basis.
A classic approach to describing nucleophile/electrophile reactivity involves frontier molecular orbital (FMO) theory.90,91 At a basic level, a lower LUMO energy for the electrophile leads to smaller HOMO–LUMO gap between nucleophile and electrophile. This results in a lower energy transition state, and therefore a faster reaction. On the other hand, this simple connection between electrophilicity and LUMO energy is not necessarily valid for every system: in one recent example, Zipse, Ofial, and Mayr have demonstrated poor correlation between LUMO energy and electrophilicity for a series of Michael acceptors.92 This is attributed to substituent effects that increase π-conjugation (lowering LUMO energy), but decrease electrophilicity. Nevertheless, we considered including LUMO energies as a potential molecular descriptor for SNAr reactivity.
As a substitute for LUMO energies, we initially used calculated electron affinity (EA) values for each electrophile, since EA is a physical observable that can be experimentally measured. Conceptually, EA and LUMO energy are related according to the Koopmans's theorem approximation (that the LUMO energy is the negative of the EA),93,94 enabling an intuitive analogy to be made to FMO treatments. To confirm this analogy for the substrate set under study, we compared our calculated EA values to LUMO energies obtained via DFT (B3LYP/def2-TVZPD, Fig. S26†), revealing a strong linear correlation (R2 = 0.94). We also investigated an operationally simpler approach to calculating LUMO energies using Entos Envision,95 an open online interactive platform for molecular simulation and visualization that performs rapid semi-empirical calculations using GFN1-xTB.96 Comparing these semi-empirical LUMO energies to our EA calculations also reveals a strong linear correlation (R2 = 0.88, Fig. S28†). In addition, using either set of LUMO energies in lieu of EA values gives nearly identical linear regression models to that in Fig. 3B (Figs. S27 and S29†). While we retained the EA values for our subsequent validation and external predictions, LUMO energies from DFT or semi-empirical calculations could certainly be a rapid and easy to calculate alternative for synthesis-focused research groups.
To account for substituent effects beyond those on FMO energies, we used average molecular ESP at individual aromatic ring atoms as a local descriptor.85–88 The extent of electron deficiency at the reactive carbon is a key factor in determining SNAr rates, and the corresponding ESP is a quantitative descriptor of this molecular feature. Previously, we observed excellent correlation between ESP-based descriptors and rates of Ar–X oxidative addition to Pd(0),83 which shares mechanistic aspects with SNAr reactivity.97 All ESP calculations were performed using the freely available Multiwfn application (version 3.7).98,99
We initially constructed a bivariate linear model using just two descriptors: EA and ESP1 (at the carbon undergoing substitution) (Fig. 4A). This model gives good predictions for halogenated pyridines and quinolines; however, it significantly underestimates the reactivity of halogenated pyrimidines, and overestimates the reactivity of several non-heterocyclic haloarenes. The nature of these outliers led us to consider the electronic structure of the Meisenheimer intermediate and SNAr transition state more generally. During substitution, the excess negative charge in the intermediate/TS‡ is distributed via resonance to the ortho and para positions relative to the reactive site; the degree to which these atoms can stabilize this negative charge should therefore affect the reaction rate. Thus, we included the ESP2 descriptor to account for these additional electronic effects, giving the superior model shown previously in Fig. 3B (vide supra).
To highlight the importance of ESP2 in making accurate predictions for multiple electrophile classes, we examined the two largest outliers from the bivariate model on either side of the distribution. We paired these two outliers with halopyridines that have very similar ESP1 values, but significantly different observed ΔG‡SNAr (Fig. 4B). In the first case, the faster than predicted outlier 4-chloro-6-morpholinopyridine has very similar EA and nearly identical ESP1 values to 4-chloro-2-methylpyridine; however, these two electrophiles have a ΔΔG‡SNAr = 10.9 kJ mol−1 (∼100 fold rate difference at 298 K). These substrates have strikingly different ESP2 characteristics, with the pyrimidine exhibiting a substantially larger negative value due to the additional nitrogen in the ring. The same situation is observed for the slower than predicted outlier 1-bromo-3,5-bis(trifluoromethyl)benzene and 2-chloro-5-(trifluoromethyl)-pyridine (ΔΔG‡SNAr = 11.3 kJ mol−1): both substrates have nearly identical EA and ESP1 descriptor values, but a more than 120 kJ mol−1 difference in ESP2.
For the 13 multihalogenated substrates in our library, we determined the experimental site selectivity and compared the resulting ΔΔG‡SNAr to that predicted by our descriptor-based model. We also calculated ΔΔG‡SNAr values for 5 of the substrates from DFT analysis of the corresponding transition states (Fig. 5). In every case, using the three-descriptor model from Fig. 3B to independently predict ΔG‡SNAr for each site correctly identifies the most reactive position, with reasonable quantitative accuracy that is comparable to that obtained via transition state analysis; however, the model-predicted ΔΔG‡SNAr between sites does appear to be systematically low (i.e. selectivity is consistently underestimated).
To identify possible reasons for this systematic underestimation, we considered that our global EA descriptor may not be optimal in these cases, and chose the first three substrates from Fig. 5 for further investigation. To assess the FMOs involved in these specific regioselective SNAr reactions, we examined the symmetries of the LUMO and LUMO + 1 orbitals of the substrates, and calculated the structures and energies of the SNAr transition states (Fig. 6 and S30–S39†). In each case, we could not locate a Meisenheimer-type intermediate along the reaction coordinate, but did locate transition states consistent with concerted SNAr reactions.29,53,54,102 As shown for 2,4-dichloropyridine in Fig. 6, the relevant electrophile FMO for attack at C4 is the LUMO, whereas for attack at C2 it is the LUMO+1; this is evident from the LUMO/LUMO+1 symmetries of the substrate, and the HOMO symmetries of two transition states. Subtracting the calculated LUMO/LUMO+1 gap from the EA as a correction when applying the model from Fig. 3B for C4 versus C2 predictions of the first three substrates does give increased accuracy, with errors of 0.3–1.5 kJ mol−1 for ΔΔG‡SNAr.
|  | ||
| Fig. 7 Model validation through assessing correlations between experimental ΔG‡ values and predicted ΔG‡SNAr for three external data sets. (A) SNAr between chlorobenzene derivatives and methoxide; experimental data from ref. 56 and 103 (B) SNAr between (hetero)aryl chlorides/fluorides and piperidine; experimental data from ref. 105 (C) SNAr between substituted 1-bromo-2-nitrobenzenes and piperidine; experimental data from ref. 104. | ||
Notably, we are able to account for solvation effects on electrophile reactivity during descriptor generation. In data set C (Fig. 7C), there are several substrates containing acidic or basic functional groups where the initial correlation between experimental and predicted reactivity is poor (Fig. 7C, substrates 4C, 11–13C, red points). Given that these functional groups will hydrogen-bond with the piperidine solvent, significantly altering the electronics of the substrate, we included one explicit solvent molecule and recalculated the ESP descriptors for these four electrophiles.75 Using these revised ESP values, we obtain excellent linear correlation across the entire substrate set.
In addition to the success in applying the ESP/EA model beyond the training set, and in identifying solvation effects on reactivity, we can also identify potential experimental outliers. For example, the data set in Fig. 7C contains one significant outlier (6C). In this case, 6C has two potentially reactive positions (Ar–Br and Ar–F). We have experimentally confirmed that reacting 6C with piperidine leads to a mixture of the two SNAr products, in a 1.5![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) :
:![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 1 ratio, slightly favouring Ar–Br substitution (Fig. S40†).
1 ratio, slightly favouring Ar–Br substitution (Fig. S40†).
|  | ||
| Fig. 8 Site selectivity predictions and rate correlation for SNAr between fluorinated arenes and ammonia. Experimental data from ref. 99 | ||
|  | ||
| Fig. 9 Site selectivity predictions and rate correlation for SNAr between fluorinated arenes and methoxide. Experimental data from ref. 100 | ||
|  | ||
| Fig. 10 Site selectivity predictions and rate correlation for SNAr between fluorinated heterocycles and ammonia. Experimental data from ref. 99, 101, and 102. | ||
The first data set involves 7 multiply fluorinated arenes undergoing substitution with ammonia, where 5 substrates have potential for regioisomer formation (Fig. 8).106 In each case, the predicted major site based on the ESP/EA model matches the experimental site. Furthermore, the predicted ΔG‡SNAr values correlate well with the experimental ln (k) values for these 5 substrates (R2 = 0.95). Notably, ln (k) for substrates 8b and 8d do not correlate; this exact situation was noted by Stenlid and Brinck, who also observed these two substrates as significant outliers when correlating ln (k) with the local electron attachment energy.75 While these authors attributed this discrepancy between prediction and experiment to steric effects, there may be a different underlying reason considering the small size of both the nucleophile (ammonia) and the cyano group in 8d.
The second data set also involves multiply fluorinated arenes, this time undergoing SNAr with the methoxide anion as the nucleophile in methanol solvent (Fig. 9).107 Across these 10 substrates, 5 have the potential to form regioisomers. In each of these cases, the ESP/EA model correctly predicts the major site of reaction. For substrate 9d, the predicted second most reactive site is incorrect (C2) based on experimental observation (C3); however, for 9e the predicted reactivity order from first to third site is correct. While we again observe an underestimation of selectivity based on predicted ΔG‡SNAr values, we do observe excellent linear correlation with experimental ln (k) across the entire substrate set. This is notable in the context of Stenlid and Brinck's prior work with local electron attachment energy, where the experimental ln (k) for 9g-j does not correlate with that descriptor. Here, the ESP/EA model correctly predicts that these four substrates should have similar SNAr rates (within a factor of 10 of each other).
The third data set contains 18 multiply fluorinated nitrogen heterocycles undergoing SNAr with ammonia, with 15 examples where regioisomers can be formed (Fig. 10).106,108,109 In every case ESP/EA model correctly predicts the major site of reaction, and in all but one case (10l) it also predicts the second site of reaction. The quantitative selectivity predictions are also much closer to the experimental values within this data set. We again observe excellent linear correlation between experimental ln (k) and predicted ΔG‡SNAr. Note that substrate 10r, which has a rate “too fast … to measure”,109 is estimated to have an ∼105-fold larger rate constant than 10d; this estimated data point is not included in the linear correlation.
Finally, to challenge the qualitative accuracy of the model, we applied it toward a series of more complex SNAr examples with a wider variety of nucleophiles (Fig. 11). Sets A–D were previously collated and categorized by Brinck, Svensson, and co-workers and categorized depending on the nature of the nucleophile/electrophile pairing.70,109–127 Using only the structure of the electrophile, our ESP/EA model is able to correctly predict the major site of reaction in 26 of the 32 cases. Within sets A and C – (hetero)aryl halides reacting with anionic nucleophiles – the two incorrect predictions are for relatively non-polar fluorinated arenes. For sets B and D, which employ neutral nucleophiles, the incorrect examples all involve secondary amine nucleophiles. In these cases, steric effects appear to play a significant role in overriding the electronic nature of the electrophile; for example, pentachloropyridine reacts preferentially at C4 (as predicted) with alkoxide or ammonia nucleophiles, but switches to C2 selectivity with diethylamine. We also applied predictions to 6 mixed halide electrophiles reacting with a variety of nucleophiles in set E, drawn from examples in medicinal/agrochemical discovery.128–133 The model is able to correctly identify the major site of reactivity for each example, except for a case where the predicted site is at an Ar–F, and the observed reactivity is at a 2-Cl-pyridine site.
|  | ||
| Fig. 11 Qualitative site selectivity predictions for combinations of (hetero)aryl halides with anionic (A and C) and neutral (B and D) nucleophiles, and for mixed halide aromatics (E). | ||
The first four examples concern site selective SNAr to generate a variety of targets from structurally complex substrates. In each of these cases, the ESP/EA model is able to predict the correct reactive site. Thus, applying these predictions during synthetic design would help pharmaceutical process chemists to proceed with confidence that selective substitution is feasible. In fact, the chemists at Pfizer used an internal prediction tool (based on Fukui indices) to help guide their synthetic planning toward the EGFR T790 M inhibitor (2nd example in Fig. 12).135
A particularly powerful aspect of in silico reactivity predictions is the ability to evaluate multiple options in substrate design before committing experimental resource. We have examined three examples where the substitution pattern of the SNAr electrophile affects the site selectivity or reactivity. In the first case, synthesis of the target SRI/5-HT2A antagonist requires a site selective SNAr to install an aryl ether ortho to a carbonyl functionality.138 This was initially performed using an aldehyde moiety; however, the relatively poor site selectivity meant column chromatography was required to purify the intermediate. Further process developments identified an N-methylamide as a more selective alternative that retained key functionality for progressing to the target API. This improved selectivity is predicted by the ESP/EA model. A second case involves choice of either an Ar–F or Ar–Cl electrophile for SNAr with an alkoxide nucleophile.139 Experimental evaluation of each revealed that both substrates are viable, with the Ar–Cl version requiring slightly higher reaction temperature than the Ar–F analogue. The ESP/EA model predicts that the F for Cl switch would result in a relatively modest reactivity decrease, indicating both should be suitable substrates.
The final example concerns an intramolecular SNAr to generate an indazole en route to merestinib.140 The final API contains a methoxy group para to the indazole nitrogen; however, attempts to perform the intramolecular SNAr with this strong electron donating group para to the substitution site were not successful. Instead, the researchers installed a nitro group to enable the SNAr to proceed, but which would require multiple functional group interconversions. The substantial difference in reactivity between –OMe and –NO2 derivatives is conceptually obvious (and borne out by the ESP/EA model); however, the orders-of-magnitude difference in predicted rate between the two means that the more desirable –OMe substrate could be ruled out earlier on in synthetic development. Furthermore, additional hypothetical substrates that retain the required oxygen (such as a sulfonate) could be evaluated using the prediction model (the –OMs derivative has a predicted ΔG‡SNAr halfway between the –NO2 and –OMe derivatives).
Importantly, even though the model was trained using only one set of reaction conditions with a single nucleophile, it is suitable for making correlations and predictions about SNAr reactivity for a wide variety of nucleophiles, solvents, and temperatures. These include a >90% success rate in predicting the major reaction site for multihalogenated arenes (>80 cases), and examples where substrate design for active pharmaceutical ingredient synthesis can be informed by predicted reactivity. Thus, this simple and easy-to-apply model can generate rapid and accurate predictions for complex molecule targets. There are still specific limitations to be addressed, including the inability of the model to properly predict selectivity outcomes for non-halogenated leaving groups (e.g. –NO2 or –OMe) and for bulky nucleophiles (as shown in Fig. 11). Further work to build additional targeted models for these effects in SNAr chemistry, as well as for additional commonly-used organic reaction classes is currently underway in our laboratories.
| Footnote | 
| † Electronic supplementary information (ESI) available: detailed experimental and computational procedures, statistical modeling information, supplementary figures, tables of molecular descriptors, and coordinate files for calculated structures. See DOI: https://doi.org/10.1039/d2sc04041g | 
| This journal is © The Royal Society of Chemistry 2022 |