Machine learning meets mechanistic modelling for accurate prediction of experimental activation energies

Accurate prediction of chemical reactions in solution is challenging for current state-of-the-art approaches based on transition state modelling with density functional theory. Models based on machine learning have emerged as a promising alternative to address these problems, but these models currently lack the precision to give crucial information on the magnitude of barrier heights, influence of solvents and catalysts and extent of regio- and chemoselectivity. Here, we construct hybrid models which combine the traditional transition state modelling and machine learning to accurately predict reaction barriers. We train a Gaussian Process Regression model to reproduce high-quality experimental kinetic data for the nucleophilic aromatic substitution reaction and use it to predict barriers with a mean absolute error of 0.77 kcal mol−1 for an external test set. The model was further validated on regio- and chemoselectivity prediction on patent reaction data and achieved a competitive top-1 accuracy of 86%, despite not being trained explicitly for this task. Importantly, the model gives error bars for its predictions that can be used for risk assessment by the end user. Hybrid models emerge as the preferred alternative for accurate reaction prediction in the very common low-data situation where only 100–150 rate constants are available for a reaction class. With recent advances in deep learning for quickly predicting barriers and transition state geometries from density functional theory, we envision that hybrid models will soon become a standard alternative to complement current machine learning approaches based on ground-state physical organic descriptors or structural information such as molecular graphs or fingerprints.


Predict-S N Ar workflow
The predict-SNAr workflow takes a reaction SMILES 1 as input, identifies the nucleophile, substrate, product and leaving group and then sets up and performs all calculations of the mechanism and the descriptors ( Figure S1). Temperature and solvent are also used throughout the calculations. The results are stored in a database for easy retrieval. predict-SNAr has been optimized for robustness and includes extensive error checks for the quantum chemical calculations. Below, a detailed description of the different steps will be given.

Input preparation
The literature data from the file "kinetic_data_v4.xlsx" was first pruned to remove entries for which either activation free energy, solvent or temperature was missing. (After the modelling was completed, some additional entries have been added in database, see section 5.) Solvent mixtures were processed to determine the "influential solvent" as the SMD solvent model (vide infra) can only handle single solvents. It is clear that solvent properties are not a simple linear interpolation between the properties of the constituent solvents. Determining which single solvent to substitute for a solvent mixture is somewhat arbitrary, but we used two principles to guide our reasoning: (1) preferential solvation and (2) activity. 2 Preferential solvation means that ions will be preferentially solvated by the solvent to which they have the strongest interactions. More polar solvents should therefore have a larger influence in solvating ionic reactants than expected based on their molar fraction in comparison with less polar solvents. Activity coefficients will be higher for minority solvents, meaning that they will exert a higher "effective" mol fraction than the raw numbers indicate. By combining these two principles, we came up with the following rule of thumb for binary solvent mixtures: if the polar solvent has a mole fraction of at least 0.2, it will be used as the single solvent in the workflow, otherwise the less polar solvent will be used. The input preparation process is documented in the Jupyter notebook "prepare_kinetic.ipynb" in the folder "prepare_kinetic_data". The solvent is parsed with SolventPicker object, where the influential solvent is selected. The work to generate the solvent data is given in the Jupyter notebook "Solvents.ipynb" in the folder "solvents".

Input parsing and structure generation
The input to the predict-SNAr workflow is the reaction SMILES, the reaction temperature in Kelvin and the solvent name. The solvent is converted to SMILES via the Open Parser for Systematic IUPAC nomenclature (OPSIN) tool (2.4.0). 3,4 It is then checked against a list of solvents available in the Gaussian16 program. If an exact match cannot be found, the most similar solvent based on distance in the standardized three-dimensional space of dielectric constant (ε) and hydrogen bonding properties (Abraham's AH and BH) is used. 5 If the hydrogen bonding parameters are not available for the input solvent, the choice is based on just the dielectric constant. The same process is used for the xtb program, which has a much more limited selection of solvents. Note that xtb energies are only used in intermediate steps of the workflow and do not appear in the final output of the model. Therefore it does not matter significantly what solvent is chosen for xtb step, as long as the right structures are found and later optimized with Gaussian16.
The reaction SMILES is parsed first by the ReactionSmilesParser object which identifies substrate, nucleophile, product and leaving group based on minimum common substructure matches (with some additional rules pertaining specifically to the SNAr reaction). If the leaving group is not specified, it is created. Intramolecular reactions are identified. Molecules which do not take part in the bond breaking and bond making events are sorted as agents and placed "above the arrow" in the reaction SMILES (in between the two ">>"). An AgentDetector object parses the agents to find acids and bases. The reactive atoms are also identified.
The Smiles2XYZ object uses the output from the ReactionSmilesProcessor to construct 3D structures of all reactants and products using the RDKit. 6 The conformer generation recipe of Deane and co-workers is used, 7 with the ETKDG algorithm by Landrum and Riniker. 8 Structures are optimized and ranked with the MMFF 9,10 or UFF 11 force fields, depending on atom availability, keeping only the lowest-energy conformer. Smiles2XYZ also constructs a reaction complex with the nucleophile situated at a distance of 6 Å from the substrate to serve as a starting point for further calculations with xtb. Smiles2XYZ further identifies the atoms of the reactive ring and if the nucleophile and transition state would be candidates for implicit/explicit solvation modelling (Section 2.3.5). The criterion for explicit solvation is that the nucleophilic atom should be a negatively charged and of the second or third row of the periodic table. The rationale is that these small anions should be more localized and harder for the implicit solvation models to treat. Smiles2XYZ also detects the following cases which are of interest in the further modelling:  12 Geometry optimizations employed the ωB97X-D functional 13 with the 6-31+G(d) 14,15 basis set as this functional has shown good performance for SNAr reactions. 16 All stationary points were confirmed with frequency calculations. Thermal contributions to the free energies were corrected for low-frequency modes with Grimme's quasi-harmonic scheme 17 using GoodVibes 3.0.1 18,19 at the reaction temperature. Final energies were obtained by combining single-point energies with ωB97X-D/6-311+G(d,p) 20 with the thermal free energies at the ωB97X-D/6-31+G(d) level. Solvent effects were accounted for by the SMD solvation model. 21 GFN2-xTB 22 calculations were performed with the xtb 6.2 software. 23 Solvent effects with xtb were treated with the GBSA solvation model. To better model anions, xtb calculations employed an electronic temperature (2000-7000 K) to simulate the effect of diffuse basis functions (Section 2.3.6). Solvation of anionic nucleophiles are generally treated poorly by continuum models, especially in polar solvents. To correct our solvation energies, we used an automatized approximate version of the cluster-continuum model of Pliego and Riveros (Section 2.3.5). 24,25 . For all structures, we performed conformational sampling with CREST (2.8) 26 and xtb. The resulting conformers were ranked according to their electronic energy with ωB97X-D/6-31+G(d) and only the lowest energy conformer was selected for further optimization.

Transition state calculations
We first checked whether we could locate a stable σ complex to assess whether the reaction was concerted or stepwise. If the σ complex was stable, we scanned each reactive bond separately using xtb, starting from the σ complex, to find the TSs for addition and elimination. If the σ complex was not stable, we performed a concerted scan of the reactive bonds using generalized internal coordinates (GICs) to located the concerted TS, scanning from the reactive complex to the product complex. Single point DFT calculations with ωB97X-D/6-31+G(d) along the scan coordinate identified the approximate location of the TS, which was conformationally sampled with frozen reaction core and subsequently fully optimized with DFT. TS conformations were sampled with atom position constraints on the aromatic core and bond constraints for the reactive bonds. The reactions involving aliphatic alkoxides as leaving groups had unrealistically large activation energies owing to proton-transfer being involved in the rate-determining step. Our workflow treats this proton transfer as going from the nitrogen to the oxygen in a concerted manner, while in reality it would occur by two separate proton transfers involving the solvent. 27 Owing to the inadequacy of our model, we use the barrier from the addition step as rate-determining in the later machine learning, in accordance with the literature. 28 Proper treatment of these types of mechanisms will be the topic of a future study.
Reactant and product complexes were optimized with GFN2-xTB at a C-Nu distance corresponding to the sum of the vdW radii of the two reactive atoms. The distance to the ortho carbons were also frozen at distances determined from the C-Nu and C-ortho C distances together with the Pythagorean theorem. For intermediate optimizations, the xtb optimization and CREST conformational search used frozen C-LG and C-Nu bond lengths which were taken from the GFN2-xTB geometries of the substrate and the product, respectively. For finding transition states, the GFN2-xTB energy profile was analysed with respect to peaks higher than 0.01 kcal/mol, which were identified as candidate transition states. A bond order criterion discarded peaks for which the bond order of any of the reactive bonds changed by less than 0.05 Å. If no peak could be identified, the workflow moved on to the next stage, except in the case of the second step of a stepwise mechanism, where the first geometry on the bond scan was taken as a TS guess. The rationale was that the barrier could be very small and the peak had probably been missed by the bond scan procedure. The peaks were sorted with respect to energy and an attempt to locate the TS from the geometry of the first peak was attempted. If a TS could not be found, the program proceeded to the next peak in the list. In the special case where concerted bond breaking and proton transfer could occur in the second step of a stepwise mechanism, a GIC scan was conducted to find additional guess transition states of this type if no other TS could be found. Transition states were validated by projecting the normalized Cartesian coordinate displacements from Gaussian16 of the TS mode onto the reactive bonds. Any TS candidate with a projected displacement below 0.13 units were rejected. TS structures were optimized by DFT after the initial CREST conformational search as described above. First, the geometry was optimized with the reactive bonds frozen, and then a full TS optimization was conducted.
For molecules including iodine, we used an effective core potential (ECP). The solution was to combine the def2-SVPD and def2-TZVPD basis sets 29,30 and their associated ECPs together with the 6-31+G(d) and 6-311+G(d,p) basis sets for the rest of the atoms, for double and triple zeta basis set calculations, respectively. Generation of the basis set dictionaries is documented in the Jupyter notebook "create_pickle_files.ipynb" in the directory "basis sets".

Use of additional constraints for xtb optimizations
We used some additional constraints for xtb calculations to avoid complications were xtb would favour geometries that would be very high in energy at the DFT level or lead to discontinuities in bond scans. These constraints were lifted for the final DFT optimization and do not impact the final geometries that are used to calculate the reaction barriers and the descriptors.
We found that the combination of ortho nitro groups on the substrate together with amine nucleophiles could cause problems with xtb optimizations using an electronic temperature. During optimizations and conformational sampling of intermediates and transition states, spurious proton transfer could happen from the amine to the nitro group. Therefore, we froze the N-H bonds for this type of calculations to prevent the spurious transfer.
Azide nucleophiles tended to bend at the GFN2-xTB level in intermediates and transition states, which could cause problems with discontinuities in the DFT single point calculations along the GFN2-xTB bond scans. We therefore imposed two constraints for these cases ( Figure S2  For bond scans, we also implemented an additional set of constraints. For scans, for the first step of a step-wise mechanism, we implemented constraints for negatively charged oxygen nucleophiles which were connected to saturated atoms. Due to lack of diffuse functions, the O-C bond is artificially short with GFN2-xTB due to negative hyperconjugation effects to relieve the instability of the negative charge on the oxygen atom. We therefore froze the C-O distance during the scan to the value calculated with DFT for the isolated nucleophile.
For all the GIC bond scans, we also froze the difference of the distances between the nucleophilic atom and the ortho carbons of the substrate to ensure a symmetric approach of the nucleophile. For the bond scans from the intermediate with xtb, we instead scanned the Nu-ortho C distances together with the C-Nu distance.

Calculation monitor
The calculation monitor handles common errors and issues that occur during electronic structure calculations and geometry optimizations.

Displacement of negative frequencies for geometry optimizations
If a minimum structure is optimized and the frequency calculation shows one or more negative frequency, the structure was displaced along the corresponding normal mode and the optimization restarted. For the workflow, one such preoptimization was done.

Dissociation check
The calculation monitor checked for dissociation of bonds in the optimization of the intermediate. Dissociation was judged to have occurred if the bond orders of either the C-Nu or C-Lg bond decrease below 0.3. The calculation was then aborted and the workflow proceeded to calculate a concerted mechanism.
Adaptive keyword changes in response to SCF errors and coordinate system errors Common convergence errors are captured and appropriate Gaussian keywords are inserted to try to remedy the situation. One example is the error "Convergence failure --run terminated" for which the keyword "scf=xqc" is used. Coordinate system errors are also addressed, e.g., "RedCar failed" which is addressed with a series of Gaussian IOps: "1/59=10", "1/59=14", "1/59=4", "1/59=40" and "1/59=44". Also more obscure and non-reproducible errors such as "Internal input file was deleted!" are handled by simply restarting the calculation.

Explicit solvation
Explicit solvation is handled via the cluster-continuum model of Rivero and Pliegos. 25 We targeted the energy that is missed by the implicit solvation model, which can be corrected by using explicit solvent molecules. We denote the free energy in solvent according the full cluster-continuum model with n explicit solvent molecules as ΔGsolv,n(solute), while that using the implicit model is ΔGsolv(solute). The correction to the implicit model is then given by where ΔGsolv(cluster) is the free energy of the cluster, ΔGsolv(solute) the free energy of the solute and ΔGsolv(solvent) is the free energy of the solvent (with the appropriate standard state), using the implicit solvation model. The number of solvent molecules, n, is determined by a variational principle, where the n which gives the largest correction is the most appropriate. This variational principle stems from two competing factors. The first factor is the stabilizing effect due to specific interactions between the explicit solvent molecules and the solute, which are not captured fully by the implicit solvation model. The second factor is the destabilizing effect of bringing solvent molecules from the bulk solvent to form the cluster, which corresponds to an entropic loss.
We have implemented an approximate version of the cluster-continuum model that uses a combination of GFN2-xTB and DFT to select the optimal number of solvent molecules ( Figure S3). The final correction term is then calculated with full DFT. Figure S3. Sketch of automated cluster generation process.
The first step of the procedure is to generate a cluster geometry. We start by placing solvent molecules around the solute at a distance of 4 Å at the geometries given in Table S1. The cluster is the optimized with a small force constant (0.0005) pulling the solvent molecule towards the closest atom in the molecule, to form a crude guess for the cluster geometry. The pulling step is followed by two rounds of CREST conformational searches in the non-covalent interactions (NCI) mode. The conformers generated in the second run are ranked with DFT, and cluster free energy is constructed from the DFT single-point energy together with the free-energy contributions from GFN2-xTB. The energy of the solvent and solute are calculated in the corresponding manner, and the solvent correction term is calculated from Equation 1. Solvent molecules are added iteratively until the solvent correction terms decreases in magnitude according to the variational principle. A cut off of 2.0 kcal/mol is used to decide whether to go ahead with the final DFT cluster optimization. Special treatment is done for hydroxide in water, where GFN2-xTB optimization gives a delocalized structure where the proton is shared equally between two O atoms (  The xtb program uses a standard electronic temperature of 300 K. We found that a higher electronic temperature could serve as a substitute for diffuse functions for the GFN2-xTB method, which lacks them by default. The rationale is that electrons of the approaching anionic nucleophile could partly delocalize into the π* orbitals of the aromatic ring. However, if the delocalization becomes too strong, complete charge transfer is the result. It is therefore necessary to regulate the electronic temperature carefully to keep in the window where the lack of diffuse functions is remedied, while the problems of charge transfer don't become too apparent. We found that a good starting guess could be based on the energy gap between the HOMO energy of the nucleophile and the LUMO energy of the substrate. The following rules were used to set the initial electronic temperature, which could later be changed for some steps.

HOMO-LUMO gap (eV) Electronic temperature (K)
After the initial setting based on the separated reactants, the following rules were used, which reflect that the HOMO-LUMO gap of the reaction complex is different than for the separated reactants. Bond scans and conformational scans are examples were the workflow will recheck the HOMO-LUMO gap and adjust the electronic temperature. For bond scans with neutral nucleophiles, the following ranges are used.

Feature generation
The solvent-accessible surface area and the Pint dispersion descriptor were calculated with an in-house development version of the morfeus software. 31 The SASA calculation used the algorithm by Shrake and Rupley 32 with the vdW atomic radii taken from the CRC Handbook. 33 Pint was calculated on a surface constructed from the atomic radii of Rahm and co-workers 34 and with the D3 dispersion coefficients. 35 Electronic structure features were calculated based on B3LYP/6-31+G(d) calculations with the SMD solvation model. The local electron attachment energy 36 and the average local ionization energy 37 as well as the surface electrostatic potential were calculated with the HS95 program (version 190510). 38 The Is,min and Es,min values were taken as the lowest values on the atomic surface, regardless if there was a stationary point associated with the atom in HS95. The five solvent PCA components compiled by Diorazio et al. were used as solvent features. 5 DDEC6 charges and bond orders were calculated with Chargemol (3.5) 39,40 The electrostatic potential at the nuclei (VN) was calculated with Gaussian 16 using the keyword prop=potential. The global nucleophilicity parameter N was calculated as the negative of the ionization potential of the substrate in solution. 41 The global electrophilicity parameter ω was calculated as where η is the chemical hardness given by Here, IP and EA are the vertical ionization potential and electron affinity, respectively, and μ is the chemical potential, 42 given by The local electrophilicity descriptor was calculated as where f is the quadratic Fukui function, and f 2 is the dual descriptor. 43 The local nucleophilicity descriptor was calculated as where f − is the Fukui function for electrophilic attack. The quadratic Fukui function was calculated as where q − is the charge of the atom in the anion, and q + is the charge of the atom in the cation. The dual descriptor was calculated as where f + is the Fukui function for nucleophilic attack. The Fukui functions for nucleophilic and electrophilic attack were in turn calculated as and where q is the charge of the neutral molecule. We used atomic Hirshfeld charges 44 that were calculated with Gaussian 16 using the population=hirshfeld keyword.
Morgan fingerprints were calculated with the RDKit (2020.03.1.0) 6 as count vectors and length 1024 bits. Reaction difference fingerprints were obtained by subtracting the summed fingerprints of the reactants from the summed fingerprints of the products. Using bits instead of counts, or using a length of 512 or 2048 instead 1024 did not change the results significantly (Table S5).

Running the workflow
The workflow is designed as a Python package predict_snar, version 0.1.0, that can be installed with "pip install .". The Python package requirements are specified in the "setup.py" file and listed in Table  S2. The workflow is designed for running on a Linux cluster. Required software is listed in Table S3. The workflow has been tested with the versions of software noted in Section 2.3. A configuration file called ".ps_config" must be placed in the user home folder with desired default options and directories to software. Java must also be available in the environment to run the packaged OPSIN tool.
[ A config file must be created in the directory before running the workflow. This is done by running the script ps_create_config. For instructions on its use, run ps_create_config --help.
The workflow is invoked with the command python -m predict_snar -p <n_cpus> -m <mem in GB> '<reaction SMILES>' where the argument "-p" specifies the number of CPUs to use and "-m" the amount of available memory in GB. An example job script is included in the supporting information with the article. Two example outputs from the workflow are also given.

Machine learning
The machine learning is reproduced in the Jupyter notebooks listed in Table S4.

Model selection
For the model building, we used the Python machine learning package scikit-learn (0.22.1). 53 Data was handled by the pandas (1.0.3) data analysis and manipulation package. 54 Plots were generated with matplotlib (3.1.0) 55 and seaborn (0.9.0). 56 We first processed the dataset in accordance with the description in Section 5.1 and put the reactions through the workflow(Section 2). We then split the dataset randomly into a training set (80%) and a test set (20%) and proceeded with model selection on the training set. As the first step in the model selection, we choose between different variations of the DFT activation energies, seeing that the best performer was with full cluster-continuum solvent treatment of both nucleophile and TS, using the Grimme quasi-harmonic correction for the free energies ( Figure S5). We investigated correlation between the features using the Pearson 57 correlation matrix and the variance inflation factors. 58 We confirmed the initial feasibility of modelling using linear regression and inspecting the normal probability plot (Q-Q plot) of the residuals, with reassuring result. We used the bias-corrected bootstrap cross-validation (BBC-CV) method for model selection, with the goal to avoid doing nested cross-validation. BBC-CV is a more economical alternative to nested crossvalidation that allows unbiased comparison between different families of algorithms. 59 It can also estimate the selection bias for choosing the best estimator based on the cross-validation scores. Empirical studies have found this estimate to give slightly worse scores as compared to the final evaluation on an external test set. 60 For methods where hyperparameter tuning was done via grid search, all resulting models were grouped within a family (e.g., random forest, SVR etc.), and BBC-CV was used to estimate the bias due to overfitting on the cross-validation procedure within this family. 61 This bias was then subtracted from the score of the best model in the family before the scores of each family were compared. Hyperparameter tuning was done using grid search, with the GridSearchCV object in scikitlearn, picking the model with the best performance within each family of models. All models were evaluated on the same 10 cross-validation folds, which were generated at the beginning of the BBC-CV procedure. To create interaction features and polynomial features for linear models, we used the PolynomialFeatures object in scikit-learn with order to second order. Standardization was applied to numeric features, and excluded categorical or count features such as fingerprints.
The full results showed that the GPRM3/2 was the strongest contender on the MAE score and we therefore chose it as our final model to be evaluated on the external test set (Table S5). A selection of methods , used in Figure 6 of the manuscript, are given on the same scale in Figure S6. All models from Figure 6 in the manuscript on the same scale..
Gaussian Process Regression (GPR). 65 We scaled both the features and the target using the TransformedTargetRegressor in scikit-learn. We tried the radial basis function (RBF), rational quadratic (RQ) and Matern 3/2 and 5/2 (M 3/2, M 5/2) kernels. The final composite kernel was created by multiplying with a constant kernel and adding a white kernel to model the noise. An introduction to Gaussian Process Regression in the context of chemistry has been given by Segall and co-workers. 66 Bayesian ridge regression (BRR). 67 This Bayesian version of ridge regression tunes the regularization parameter automatically from the data.
Automatic Relevance Determination (ARD). 68 ARD produces a more sparse solution than BRR as is suitable when a large number of features are used.
Partial Least Squares (PLS). 69 A linear method that can handle multicollinearity and a large number of features compared to samples. We used a grid search between 1 and 15 to select the number of components.
Dummy regression models. We used the mean and the median of the training set for future prediction as reasonable baseline models. These models were implemented with DummyRegressor in scikit-learn.

Dataset visualization
The full feature set Xfull was decomposed with Principal components analysis (PCA) 71 and the two components with highest explained variance were used for the visualization. For the PLS + UMAP plots, we first used PLS for supervised dimensionality reduction. The number of components were chosen based on the R 2 score, using the one-standard error rule to select the model with fewest dimension with a score within one standard error of the best-scoring model overall. This approach led to a model with 5 components. The Xfull feature space was transformed into this five-dimensional latent space of the PLS model, and further reduced to two dimension with the Uniform Manifold Approximation and Projection (UMAP) method, 72 using the UMAP python package, 73 with standard settings, except the keyword min_dist=0.3, and n_neighbors=10. UMAP is a non-linear dimensionality reduction technique that is frequently used for visualization of high-dimensional data. A reaction map ( Figure S7) was generate with the tmap package (1.0.4) 74,75 and visualized with faerun (0.3.10). 76,77 The Jupyter notebook "reaction_map.ipynb" in the folder "reaction_map" gives the details of how this map was generated. An interactive version of the reaction map is given in the file "index.html".

Feature importances
Before assessing the feature importances, we checked the variance inflation factors ( Figure S8- Figure  S10) and Pearson correlation matrices ( Figure S11- Figure S13) to assess the extent of collinearity among the features in Xfull and XnoTS and Xsmall. Variance influence factors are a measure of collinearity for each feature, with a value of 1 indicating no collinearity and values above 5-10 considered problematic. 58 It is clear that Xfull and XnoTS have large problems with collinearity, especially in comparison with Xsmall. We therefore decided to cluster the features with SciPy (1.2.1) 78 based on the Spearman rank correlation, 79 using the Ward linkage and the distance criterion (following a recipe from the scikit-learn documentation). The threshold for clustering was selected manually to reduce the number of highly correlated features while retaining a clear chemical interpretation of each cluster ( Figure S14- Figure  S16). We computed the permutation importances version of feature importances with the GPRM3/2 model together with the permutation_importances function from scikit-learn. We used 10 repeats, and the MAE score and the stability of the feature ranking was assessed by running 10 bootstrap samples with BootstrapOutOfBag in mlxtend. 80 Figure S8. VIFs for Xfull. A value above 10 is indicative of high collinearity with other features. Note log scale.

Learning curves
Learning curves were calculated with the learning_curve function in scikit-learn with splits of 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% and 100% of the data using 10-fold cross-validation. To assess how much data is needed to accurately predict the DFT activation energy, we used XnoTS as the feature set.

Analysis of reactions with large errors
Reactions with absolute residuals larger than 2.0 kcal/mol for the training and test sets with the GPRM3/2 model are given in Table S6 and Table S7, respectively.

Dependence on number of heavy atoms
We checked the dependence of the model error on the number of heavy atoms to see if more complex molecules would display larger error, but could see no such trend ( Figure S17).

Y-randomization
We further tested for overfitting by y-randomization, in which the link between the feature vectors in X and the experimental activation free energies in the response vector y is broken by randomizing the y vector (Table S8). 81 The R 2 for GPRM1.5 is lowered from 0.86 to −0.05, indicating complete lack of learning when y is randomized. This is strong evidence that the models are not just learning noise but are finding a real signal in the feature data. We performed the y-randomization using the permutation_test_score function in scikit-learn with 10 permutations and 10-fold cross-validation for each permutation, as recommended in the literature. 82 Table S8. Results of y-randomization with GPRM1.5 on Xfull for the training set.

Regio-and chemoselectivity validation
We used a reaction dataset from the patent literature, collected by Landrum and co-workers. 83 The dataset ("dataSetB.csv") comprised 50,000 reactions that were first classified with the NameRxn (3.1.9) 84 tool and then filtered according to reaction classes that include SNAr reactions according the SNAr "concept" provided by NextMove (Table S9). Reactions with transition metals were excluded as they would likely go via the Buchwald-Hartwig reaction. The dataset was further filtered manually to remove reactions which apparently did not go via the SNAr mechanism (e.g., SN2 reactions). Reactions including boronic acids or phosphine ligands (presumed to occur via Buchwald-Hartwig) were also removed. Furthermore, some reactions also included the product among the starting material and were removed. Chan-Lam arylamine coupling 1. 3.6 Bromo N-arylation 1. 3.7 Chloro N-arylation 1. 3.8 Fluoro N-arylation 1. 3.9 Iodo N-arylation 1. 3.10 Triflyloxy N-arylation 1.3.11 Chichibabin amination 1. 3.12 Mesyl N-arylation 1. 3.13 Mesyloxy N-arylation 1.3.14 Tosyloxy N-arylation 1. 7 The pruned set of reactions comprised 4353 SNAr reactions, of which 1209 had alternative reactive sites with halogens on the reactive ring, leading to questions relating to regio-and chemoselectivity ( Figure  S18). These reactions were sorted according to the molecular weight of the product, and all possible reactions relating to regio-and chemoselectivity of the reactive ring were generated for the 100 reactions with lowest product molecular weight. Sorting on molecular weight was done to allow a quickly calculated validation test. Figure S18. Two examples of reactions with regio-and chemoselectivity questions.
The protonation state of the nucleophile was not included in the patent data, and needed to be set. We used the following set of rules: For a negatively charged nucleophile, the leaving group was also considered to be negatively charged.
As reaction solvent and temperature was not available in the dataset, we used a set of standard reaction conditions: acetonitrile for neutral reactions and methanol for ionic reactions and a reaction temperature of 298 K. This is expected to introduce some noise in the predictions. Solvents are sometimes included in the reaction SMILES in the dataset, but it is not always clear what is the reaction solvent and what was used for, e.g., work-up.
For analysis of the results, we constructed the reaction feature matrix Xfull for the validation reactions and used the previously trained GPRM3/2 model for prediction. The isomer corresponding to the reaction with the lowest activation energy was considered as the predicted major product. For comparison, we also used the DFT activation energies.
The validation work is documented in the notebooks listed in Table S10. The validation database is given in the folder "validation_2020_07_04". Table S10. Jupyter notebooks used for validation work under the folder "validation_dataset".
5 Experimental dataset 5.1 Description of the modelling data An extensive literature search was performed to identify intermolecular nucleophilic aromatic substitution reactions with associated second-order kinetic data with a focus on reactions where the initial nucleophilic attack is rate-limiting. The reaction dataset was extracted from 37 references spanning the years 1950-2019. Kinetic data is not curated by the standard reaction databases, and thus manual extraction and curation was required. In total 518 data points were extracted (the full dataset), and 446 where used in the model building (the modelling dataset). The reason for this smaller number is that 48 of the data points were found after the modelling was finished, 15 had missing data which prevented them from being used, 6 failed in the transition state calculations, two were carried out in a solvent (tetramethylsilane) for which we did not have solvent features, and one was removed due to a mismatch in the reaction SMILES during the modelling process. The whole data set (518 points) covers 88 different nucleophiles, 121 different substrates and 41 different solvents. The experimental temperature range is 273-468 K and log k, the base-10 logarithm of the second-order rate constant, spans −15.9 to −3.59. Rate constants were converted to activation energies at the temperature in question using the Eyring equation where R is the gas constant, T is the reaction temperature, k is the second-order rate constant, h is the Planck constant, kb is the Boltzmann constant and using a standard state of 1 M. Of the reactions reported, 284 have one set of reaction conditions and 70 reactions have more than one set of reaction conditions reported. The partial dataset (446 reactions) used for modelling is described in the main article. A record of this database is given in the files "SNAR_reaction_dataset_SI.csv" and "SNAR_reaction_dataset_SI.xlsx". Each dataset record includes the following information: canonicalized SMILES of the reaction, reactants and products, second-order rate constants, activation free energy, temperature, solvent, literature source information and flag on its use in the model building process. A full listing of the column names and a corresponding description is given in Table S11. "not modelled": Reactions that were extracted and added to the data set after model building was completed. "failed": Reactions that failed the predict-SNAr workflow due to TS finding problems (not finding or finding wrong TS) leading to no barrier, very low barrier, or very high barrier. "removed": Reactions that were removed for other reasons. "missing data": Reactions that lack activation energies, temperature or solvent.

Distribution of activation free energies
The distribution of activation free energies in the modelling dataset is given in Figure S19. Figure S19. Distribution of activation free energies in the modelling dataset.

Most common substrates and nucleophiles
The most common substrates and nucleophiles in the modelling dataset are given in Figure S20 and Figure S21.

Replicated reactions
The reactions replicated in the full dataset are given in Table S12, of which the one carried out in TMS is not included in the modelling dataset as solvent features are lacking for the TMS solvent.

Conditions
Conditions are here defined in terms of both solvent and temperature. Figure S22 and Table S13 give the number of reactions with more than one condition in the modelled dataset (446 reactions).

Reaction temperatures
The distribution of reactions at different reaction temperatures for the modelling dataset are given in Figure S23 and Table S14. Correlation analysis confirmed that exclusion of reaction temperature was warranted (R: 0.64, ρ: 0.54 τ: 0.43) as it could possibly artificially inflate the performance of the model ( Figure S24). Reactions with higher barriers are often run at higher temperatures to be measured in reasonable time.   Figure S24. Experimental activation energy as a function of reaction temperature.

Solvent distribution
The distribution of reaction solvents depending on nucleophile charge and atom type are given in Figure  S25. The solvents are given as their SMILES strings. While reactions with neutral nitrogen nucleophiles are carried out in a variety of different solvents, anionic oxygen nucleophiles are only used in water or methanol. Figure S25. Distribution of reaction solvent depending on nucleophile charge and atom type.

Workflow database
The database holding the calculations from the predict-SNAr workflow is a Python shelve database which operates as a dictionary with keys and values. Each key is a unique identifier, holding a dictionary of the calculation results which are described in Table S15. The database is housed in the folder "machine_learning/2019-12-15" under the name "db". It can be loaded with the following Python code: import shelve d = shelve.open("db") The workflow dataset includes a number of duplicate reactions that need to be removed before modelling (see the notebooks referenced in Section 2). The keys used in the sub-dictionaries are described in Table S16. Table S16. Keys for sub-dictionaries in the workflow dataset.

Key
Description "substrate" Substrate "nucleophile" Nucleophile "product" Product "leaving_group" Leaving group or leaving atom "ts" Transition state "agent" Agent. Not implemented. "reaction" Related to the reaction, e.g., reaction SMILES "reaction_orig" Related to the reaction as originally input, e.g., reaction SMILES "solvent" Solvent The keys for the descriptor sub-dictionary are given in Table S17.