Automated Quantum Chemistry for Estimating Nucleophilicity and Electrophilicity with Applications to Retrosynthesis and Covalent Inhibitors

Reactivity scales such as nucleophilicity and electrophilicity are valuable tools for determining chemical reactivity and selectivity. However, prior attempts to predict or calculate nucleophilicity and electrophilicity are either not capable of generalizing well to unseen molecular structures or require substantial computing resources. We present a fully automated quantum chemistry (QM)-based workﬂow that automatically identi-ﬁes nucleophilic and electrophilic sites and computes methyl cation aﬃnities and methyl anion aﬃnities to quantify nucleophilicity and electrophilicity, respectively. The calculations are based on r 2 SCAN-3c SMD(DMSO) single-point calculations on GFN1-xTB ALPB(DMSO) geometries that, in turn, derive from a GFNFF-xTB ALPB(DMSO) conformational search. The workﬂow is validated against both experimental and higher-level QM-derived data resulting in very strong correlations while having a median wall time of less than two minutes per molecule. Additionally, we demonstrate the workﬂow on two diﬀerent applications: ﬁrst, as a general tool for ﬁltering retrosynthetic routes based on chemical selectivity predictions, and second, as a tool for determining the relative reactivity of covalent inhibitors. The code is freely available on GitHub under the MIT open source license and as a web application at www.esnuel.org.


Introduction
A fundamental principle covered in most chemistry textbooks is that electrophiles (electronseeking) react with nucleophiles (nucleus-seeking), and one textbook even states that: In essence, organic chemistry is all about the interaction between electron-rich atoms or molecules and electron-deficient atoms or molecules. 14][5] Over the years, nucleophilicity and electrophilicity have been quantified to create reactivity scales that explain why some atoms or molecules are more reactive than others.Most noteworthy is the work by Mayr and co-workers, who experimentally studied the reactivity of organic molecules that led to the Mayr-Patz equation; where the bimolecular rate constant (k 20 • C ) of a nucleophile-electrophile reaction is related to the following experimentally determined reactivity parameters; nucleophilicity (N ), electrophilicity (E), and a nucleophile-specific sensitivity factor (s N ).
Essential to Equation 1 is that k 20 • C can span from non-observable reactions (k 20 • C < 10 −5 M −1 s −1 ) to diffusion-controlled reactions (k 20 • C > 10 9 M −1 s −1 ), 7,8 and its applicability has been confirmed for various molecules resulting in Mayr's database that currently holds experimental reactivity parameters for 352 electrophiles and 1,261 nucleophiles. 7However, measuring reaction rates to extract reactivity parameters is generally time-consuming and often difficult at the extremes of the reactivity scales.Thus, several in silico methods have been proposed to circumvent this problem. 8,9This includes estimating the rate constant using the Eyring equation 10,11 and computing the reactivity parameters from frontier molecular orbital (FMO) energies 12,13 or chemical affinities [14][15][16] .[19][20][21][22][23] In this work, we will focus on the studies by Van Vranken and Baldi showing that calculated methyl cation affinities (MCAs) and methyl anion affinities (MAAs) of structurally different molecules correlate with Mayr's N •s N and E, respectively, when considering solvent effects. 15,16Based on these findings, they created two QM-derived datasets with reactivity parameters for 1,232 nucleophiles and 1,113 electrophiles (we have excluded 76 duplicates) covering ∼ 50 orders of magnitude in each case. 24The datasets were used to train different ML models by treating the affinities as either atomic, functional group, or molecular properties.The best-performing architecture was an atom-based graph attention network (GAT) achieving 10-fold cross-validation R 2 coefficients of 0.92 ± 0.02 and 0.94 ± 0.02 for MCAs and MAAs, respectively.However, it is worth noting that the two datasets contain molecules that are not in Mayr's database, and the ML models have not been validated against experimental reactivity parameters.Furthermore, the atom sites are identified by hand, which makes it difficult to apply their method to arbitrary molecules in an automated fashion.This work will address these limitations by introducing a fast and fully automated quantum chemistry-based workflow that detects relevant sites and calculates associated MCAs and MAAs.In addition, we will apply the workflow to different tasks to highlight the applicability of MCAs and MAAs as quantitative measures of chemical reactivity.

Automated Quantum Chemistry-Based Workflow
We have previously presented fully automated quantum chemistry-based workflows for predicting the regioselectivity of EAS reactions 25 and palladium-catalyzed Heck reactions 26 .Following this work, we introduce a workflow that, given a SMILES string as input, identifies possible nucleophilic and electrophilic sites and individually attaches methyl cations or anions to calculate associated MCA or MAA.The input SMILES string is modified using a set of SMIRKS and RDKit 27 , which can easily be adjusted to include/exclude certain functional groups.The nucleophilic sites include double/triple-bonded atoms, singly charged anions, atoms with lone pairs, and specific functional groups such as aldehydes, amides, amines, carbanions, carboxylic acids, cyanoalkyl/nitrile anions, enolates, esters, ethers, imines, isonitriles, ketones, nitranions, nitriles, and nitronates.The electrophilic sites include double/triple-bonded atoms, singly charged cations, and specific functional groups such as acyl halides, aldehydes, amides, anhydrides, carbocations, esters, imines, iminium ions, ketones, Michael acceptors, and oxonium ions.The calculated MCAs and MAAs are defined in Equations 2 and 3 as the negative energy difference (−∆E) of a nucleophile (Nuc) reacting with a methyl cation (CH + 3 ) and an electrophile (Elec) reacting with a methyl anion (CH - 3 ), respectively.
For each compound, we embed min(1 + 3 • n rot , 20) conformers using RDKit, where n rot is the number of rotatable bonds.The conformers are prescreened with force-field optimizations in dimethyl sulfoxide (DMSO, dielectric constant = 46.68)using GFNFF-xTB 28 and the analytical linearized Poisson-Boltzmann (ALPB) solvation model 29 as implemented in the open source semiempirical software package xtb. 30Only unique conformers within a 10 kJ/mol cut-off are carried forward by selecting the centroids of a Butina clustering using a pairwise heavy-atom position root mean square deviation (RMSD) with a threshold of 0.5 Å.The remaining conformers are then reoptimized with GFN1-xTB ALPB(DMSO) before refining the energy of the lowest energy conformer with single-point DFT calculations in DMSO using the r 2 SCAN-3c composite electronic structure method 31 and the SMD solvation model 32 as implemented in the quantum chemistry program ORCA version 5.0.1. 33

Datasets
The automated quantum chemistry-based workflow is tested against both experimental and higher-level QM-derived datasets.The former refers to subsets of Mayr's database, 7 which were used in the initial work of Van Vranken and Baldi,

Results and Discussion
Comparison to Experimental and Higher-Level QM-Derived Data In Figure 1, we present a comparison of the automated quantum chemistry-based workflow against the experimental and higher-level QM-derived data.The reference data are provided for a single atom in each molecule, so the calculated MCAs and MAAs are only evaluated for these labeled atom sites.However, such labels are not provided for the data in Figure 1c.In this case, we use the highest calculated MAAs, although some associated sites have been compared to the molecular coordinates provided by Mood et al. 15 and confirmed as the actual reaction centers.
The top panels in Figure 1 show that the workflow can reproduce a strong correlation between experimental reactivity parameters and calculated MCAs and MAAs as previously observed by Van Vranken and Baldi. 15,16Specifically, the R 2 coefficients of 0.84 and 0.94 are somewhat similar to 0.89 and 0.96 for MCAs and MAAs, respectively, with the latter based on Gibbs free energies at the PBE0-D3(BJ)/DEF2-TZVP COSMO(∞) level of theory.
The ability to replicate higher-level results is further supported by the very strong correlation in the bottom panels of Figure 1 with R 2 coefficients of 0.98 and 0.99 for MCAs and MAAs, respectively.These results actually outperform the ML models of Tavakoli et al. 24 , which achieved 10-fold cross-validation R 2 coefficients of 0.92 ± 0.02 and 0.94 ± 0.02 for MCAs and MAAs, respectively.Of course, the better performance comes with a higher computational cost, although the median wall time for this data is less than two minutes using eight CPU cores (Intel(R) Xeon(R) CPU X5550 @ 2.67GHz).In fact, the timings can be further improved due to the handling of structures and conformers being embarrassingly parallel.Alternatively, omitting the single-point r 2 SCAN-3c SMD(DMSO) calculations will greatly reduce the computational cost without impacting the above results significantly as seen in Figure S2 in the supporting information.It should be noted that the molecules in the bottom panels of Figure 1 have on average ∼ 10 heavy atoms and ∼ 6 identified electrophilic and nucleophilic sites.

Post-Filtering of Retrosynthetic Routes
To highlight the applicability of the workflow, we will first demonstrate how MCAs and MAAs can provide insights into the selectivity of chemical reactions.In Figure 2, we show two examples of synthesizing the antiretroviral drug Raltegravir that is used to treat HIV infections.The first example is an experimentally reported procedure from the work of Caputo et al. 34 and the second example is a predicted retrosynthetic route by Manifold, which is a CASP tool from PostEra 35 .The most nucleophilic and electrophilic sites for each structure are highlighted in green and blue, respectively, and additional values can be found in the supporting information.
The results in Figure 2a show that by locating the highest MCA and MAA among the two reactants (the values marked in bold), it is possible to predict the selectivity of the reaction.The most electrophilic site is the carbonyl carbon of 1a with a MAA of 395 kJ/mol, which is 89 kJ/mol higher than the second-highest MAA.The MAA of the carbonyl carbon is marked with a star ("*") as the chlorine atom acts as a leaving group during the geometry optimization when attaching the methyl anion to the carbonyl carbon.The most nucleophilic site in Figure 2a is the primary amine of 2a with a MCA of 449 kJ/mol, which is 56 kJ/mol higher than the second-highest MCA.Hence, the experimentally observed nucleophilic acyl substitution reaction can be correctly predicted by locating the highest MCA and MAA despite the two reactants 1a and 2a having a total of 27 nucleophilic and 20 electrophilic sites identified by the workflow.Now, we will turn to Figure 2b to analyze and validate the reaction steps proposed by Manifold and demonstrate how the workflow can be applied as a post-filtering tool.A CASP tool like Manifold usually outputs many different retrosynthetic routes, but some of the steps in the retrosynthetic routes may not be feasible due to selectivity issues.We propose using MCAs and MAAs as a tool to predict chemical selectivity and flag steps that are potentially incorrect.The retrosynthetic routes can then be ranked based on the number of warning flags similar to how retrosynthetic routes are commonly ranked based on the number of synthetic steps and the total price of building blocks.The first reaction step in Figure 2b involves an ester amidation between 1b and 2b.However, based on the highest MCA and MAA, a more favorable reaction would be an ester amidation between two 2b compounds with MCA and MAA of 478 and 242 kJ/mol for the primary amine and the carbonyl carbon, respectively.The latter is marked with a star ("*") as the proton from the phenol group moves to the carbonyl oxygen during the geometry optimization when attaching the methyl anion to the carbonyl carbon.The MAA of the carbonyl carbon in 1b is 186 kJ/mol, which is 56 kJ/mol lower than the highest MAA.In fact, another site in 1b has a MAA that is 35 kJ/mol higher than the MAA of the carbonyl carbon.This reaction step should therefore be flagged as the chance of this reaction step being a success is low.Instead, one could imagine a similar retrosynthetic route starting with a nucleophilic acyl substitution similar to the one shown in Figure 2a by replacing 1b with 1a.
The second reaction step involves a Chan-Lam coupling reaction between the product of the first reaction step (3b) and methylboronic acid (4b).This reaction employs a copper catalyst, which makes the reaction mechanism more complex, and validating the reaction solely based on the highest MCA and MAA is probably not sufficient.However, looking at the MCAs and MAAs we see that none of the proposed reaction centers have the highest MCA and MAA.In terms of the MAAs, the workflow did not identify any electrophilic sites in 4b, and the highest MAA is therefore found in 3b.This MAA is marked with a star ("*") as the attached proton moves to the neighboring nitrogen atom during the geometry optimization.The highest MCA is 428 kJ/mol, which is only 17 kJ/mol higher than the MCA of the proposed reaction center.Removing the proton from the proposed reaction center would make it the most nucleophilic site.However, based on predicted pKa values by MarvinSketch from ChemAxon the phenol group is more acidic, and a deprotonation of this phenol group would lower the MCA of the proposed reaction center as seen in the supporting information.
The last reaction step involves an ester amidation reaction between the product of the second reaction step (5b) and 4-fluorobenzylamine (6b).The most nucleophilic site is the primary amine of 6b with a MCA of 489 kJ/mol, and this site is also the proposed reaction center of 6b.The most electrophilic site is the highlighted carbon atom next to the fluorine atom of 6b with a MAA of 310 kJ/mol.The MAA is marked with a star ("*") as the fluorine atom acts as a leaving group during the geometry optimization when attaching the methyl anion to the highlighted carbon atom.The second most electrophilic site is the carbonyl carbon in 5b, which is the other proposed reaction center.This site is marked with a star ("*") as the proton from the phenol group moves to the carbonyl oxygen during the geometry optimization when attaching the methyl anion to the carbonyl carbon.In summary, the use of MCAs and MAAs to flag retrosynthetic steps for further inspection seems promising but will require further work.

Covalent Inhibitor Reactivity Predictions
A second application of the automated quantum chemistry-based workflow is the ability to predict the reactivity of covalent inhibitors.In the work of Hermann et al. 13 , calculated activation energies of different warheads reacting with CH 3 S -were used to estimate the reactivity towards cysteine.As a faster alternative, they also showed that a warhead-associated electrophilicity index could be used to predict the reactivity of some warhead classes.The electrophilicity index (ω) is defined as where χ and η are the electronegativity and chemical hardness, respectively.Following Koopmanns' theorem to define the ionization potential (IP = −ε HOMO ) and the electron affinity (EA = −ε LUMO ), the electronegativity and chemical hardness can be approximated as where ε HOMO and ε LUMO are the energies of the highest occupied and lowest unoccupied molecular orbitals.Unfortunately, employing the canonical HOMO and LUMO energies to calculate the electrophilicity index does not result in high predictability with respect to the reactivity of covalent inhibitors.Instead, Hermann et al. 13 showed that by analyzing the highest occupied and lowest unoccupied orbitals and selecting those that are associated with the warhead (i.e., the left-hand side of the structures in Figure 3), it is possible to calculate a warhead-associated electrophilicity index that strongly correlates with the reactivity of covalent inhibitors.
In Figure 3, we compare the calculated activation energies to the warhead-associated electrophilicity index and calculated MAAs for various acrylamides (top), propargylamides (middle), and 2-chloroacetamides (bottom).The calculated MAAs are obtained for the leftmost carbon atom in each of the depicted structures with the chlorides (Cl -) being removed for the 2-chloroacetamides.
The results of the acrylamides show strong correlations for both the warhead-associated electrophilicity index and MAAs with R 2 coefficients of 0.87 and 0.80, respectively.However, computing the warhead-associated electrophilicity index requires an analysis of the FMOs to select suitable orbitals, whereas the MAAs are straightforward to calculate.In fact, the R 2 coefficient for the acrylamides without adjusting the electrophilicity index to the warhead-associated HOMO and LUMO energies (i.e., simply relying on the canonical MO-based HOMO and LUMO energies) is only 0.60, which is significantly worse than calculating MAAs.
The propargylamides only include nine different structures shown in Figure S11.When considering all of them, the R 2 coefficients are 0.64 and 0.67 for the warhead-associated electrophilicity index and calculated MAAs, respectively.Excluding the red entries in Figure 3b, viewed as outliers in the work of Hermann et al. 13 , results in a significantly better R 2 coefficient of 0.89 for the warhead-associated electrophilicity index.Yet, the MAAs for these structures align well with the other entries.On the other hand, the blue entry is relatively far from the black regression line.This entry corresponds to a structure with bulky groups on both sides of the triple bond, and the transition vector could be pointing toward the neighboring SP-hybridized carbon atom.Unfortunately, the transition state structures are not available.Instead, we can calculate the MAA for this neighboring carbon atom resulting in a strong correlation with an R 2 coefficient of 0.85.This approach is only possible for the atom-specific MAAs as the warhead-associated electrophilicity index uses FMOs primarily localized on both SP-hybridized carbon atoms.
The results of the 2-chloroacetamides show no correlation for both the warhead-associated electrophilicity index and MAAs.This behavior is extensively studied in the work of Hermann et al. 13 , and their arguments reflect the change from the Michael-type nucleophilic additions to an S N 2 reaction.Specifically, they show that the LUMO energy correlates with the bond strength of both the C-Cl and C-SMe bonds.Thus, the electrophilicity index fails to capture the energetics of the reaction due to the simultaneous bond formation and rupture.However, the calculated MAAs behave similarly despite not depending on FMO energies, which raises questions about their reasonings.Furthermore, when using CH 3 S - instead of CH - 3 , we again find no correlation with an R 2 coefficient of 0.01 as seen in the supporting information.This result is surprising as it contradicts the Bell-Evans-Polanyi principle (i.e., the change in activation energy for similar reactions being proportional to the change in reaction enthalpy), indicating that further analysis of the transition state structures should be carried out.

Conclusions and Outlook
We present a fully automated quantum chemistry (QM)-based workflow that automatically identifies nucleophilic and electrophilic sites and computes methyl cation affinities (MCAs) and methyl anion affinities (MAAs) to quantify nucleophilicity and electrophilicity, respectively.The workflow shows strong correlations against experimental data from Mayr's database with R 2 coefficients of 0.84 and 0.94 for the comparison of MCAs and MAAs to experimental N • s N and E values, respectively.Furthermore, the workflow achieves similar performance as higher-level PBE0-D3(BJ)/DEF2-TZVP COSMO(∞) calculations with R 2 coefficients of 0.98 and 0.99 for MCAs and MAAs, respectively, while having a median wall time of less than two minutes per molecule.
Additionally, we highlight two different applications of the workflow.The first application is within computer-aided synthesis planning (CASP), where the workflow can be used to predict chemical selectivity and detect potential problems in a retrosynthetic route.This is demonstrated using experimental data from the work of Caputo et al. 34 and a retrosynthetic route by Manifold from PostEra 35 for the synthesis of Raltegravir.The workflow correctly predicts the reported reaction from the work of Caputo et al. 34 by locating the highest MCA and MAA despite the two reactants having a total of 27 nucleophilic and 20 electrophilic sites.However, some of the steps in the retrosynthetic route by Manifold are found problematic suggesting that another route should be preferred.
In the second application, we show that the workflow can be used to predict the reactivity of covalent inhibitors.We report a strong correlation between MAAs and calculated activation energies for various acrylamides and propargylamides similar to the work by Hermann et al. 13 using a warhead-associated electrophilicity index.The results of the 2-chloroacetamides showed no correlation for both the warhead-associated electrophilicity index and MAAs, which could be due to errors in the calculated activation energies.The advantage of the MAAs over the warhead-associated electrophilicity index is that the MAAs are atom-specific and completely straightforward to calculate.Whereas, the warhead-associated electrophilicity index requires a selection of the highest occupied and lowest unoccupied orbitals that are associated with the warhead to match the performance of the MAAs.Future work will use the QM-based workflow to calculate MCAs and MAAs for a large set of diverse molecules and train an atom-based ML model for each property similar to the one presented in Ree et al. 37 .The ML models will then be used to predict the chemical selectivity for a large set of experimentally reported reactions to provide a statical basis for using the MCAs and MAAs to predict chemical selectivity.

Figure 3 :
Figure 3: Covalent inhibitor reactivity prediction for various acrylamides (top), propargylamides (middle), and 2-chloroacetamides (bottom).The calculated MAAs are obtained at the r 2 SCAN-3c SMD(DMSO)//GFN1-xTB ALPB(DMSO) level of theory and otherwise ωB97XD/cc-pVDZ CPCM(H 2 O).The black regression lines consider all points, whereas the blue regression lines are defined in the text.The red dots are considered outliers in the work of Hermann et al. 13

( 11 )Figure S1 :Figure S2 :Electrophilicity
FigureS1: Flowchart describing the automated QM-based workflow for identifying possible nucleophilic and electrophilic sites and computing methyl cation affinities (MCAs) and methyl anion affinities (MAAs) to quantify nucleophilicity and electrophilicity, respectively.An example of the visual output can be seen in FigureS3.

Figure S5 :
Figure S5: Calculated MAAs and MCAs for the first reaction step in the retrosynthetic route by Manifold (see Figure S4).The MAAs and MCAs are obtained at the r 2 SCAN-3c SMD(DMSO)//GFN1-xTB ALPB(DMSO) level of theory.

Figure S6 :
Figure S6: Calculated MAAs and MCAs for the second reaction step in the retrosynthetic route by Manifold (see Figure S4).The MAAs and MCAs are obtained at the r 2 SCAN-3c SMD(DMSO)//GFN1-xTB ALPB(DMSO) level of theory.

Figure S8 :Electrophilicity
FigureS8: Predicted pKa values for the first reactant in the second reaction step of the retrosynthetic route by Manifold (see FigureS4).The predicted pKa values are obtained using MarvinSketch.

Figure S9 :
Figure S9: Calculated MAAs and MCAs for the second reaction step in the retrosynthetic route by Manifold (see Figure S4).The atom site 1.19 is deprotonated compared to Figure S6.The MAAs and MCAs are obtained at the r 2 SCAN-3c SMD(DMSO)//GFN1-xTB ALPB(DMSO) level of theory.

Figure S10 :
Figure S10: Calculated MAAs and MCAs for the second reaction step in the retrosynthetic route by Manifold (see Figure S4).The atom site 1.23 is deprotonated compared to Figure S6.The MAAs and MCAs are obtained at the r 2 SCAN-3c SMD(DMSO)//GFN1-xTB ALPB(DMSO) level of theory.
35,16and contain 90 nucleophiles and 74 electrophiles.The latter involves the two QM-derived datasets by Tavakoli et al.24containing 1,232 nucleophiles and 1,113 electrophiles with calculated MCAs and MAAs at the PBE0-D3(BJ)/DEF2-TZVP COSMO(∞) level of theory. The w failed for two nucleophiles and one electrophile due to atomic connectivity changes during the geometry optimizations, and these are therefore excluded in the following data analysis.The workflow is also applied to two different tasks.The first task employs experimental data from the work of Caputo et al.34and a retrosynthetic route by Manifold from PostEra35for the synthesis of Raltegravir.The data are used to highlight the applicability of the workflow within computer-aided synthesis planning (CASP) by predicting the selectivity of chemical reactions.The second task involves reactivity data for different covalent inhibitors including 249 acrylamides, 9 propargylamides, and 238 2-chloroacetamides.13Thedataset contains calculated activation energies for the reaction of the warheads with methanethiolate (CH 3 S -) and FMO-derived electrophilicities at the ωB97XD/cc-pVDZ CPCM(H 2 O) level of theory.