Data enhanced Hammett-equation: reaction barriers in chemical space

It is intriguing how the Hammett equation enables control of chemical reactivity throughout chemical space by separating the effect of substituents from chemical process variables, such as reaction mechanism, solvent, or temperature. We generalize Hammett's original approach to predict potential energies of activation in non aromatic molecular scaffolds with multiple substituents. We use global regression to optimize Hammett parameters ρ and σ in two experimental datasets (rate constants for benzylbromides reacting with thiols and ammonium salt decomposition), as well as in a synthetic dataset consisting of computational activation energies of ∼2400 SN2 reactions, with various nucleophiles and leaving groups (–H, –F, –Cl, –Br) and functional groups (–H, –NO2, –CN, –NH3, –CH3). Individual substituents contribute additively to molecular σ with a unique regression term, which quantifies the inductive effect. The position dependence of substituents can be modeled by a distance decaying factor for SN2. Use of the Hammett equation as a base-line model for Δ-machine learning models of the activation energy in chemical space results in substantially improved learning curves reaching low prediction errors for small training sets.


The original Hammett procedure
The Hammett equation was originally intended only for reactions occurring on simple aromatic molecules that have only one substituent on the ring. However, the equation itself contains no assumptions on the structure of the molecule. Due to its linear nature, this equation can be applied to any data set and property P where: (i) the ordering of the substituents with respect to P is mostly stable across all reactions, (ii) the set of values for the property P correlates linearly for any two reactions. The first condition is necessary to have one unique set of substituent constant for every reaction, the second allows to calculate P using only a single multiplicative factor ρ.

Hammett revisited
The equilibrium constant can be expressed as a function of the free energy difference between product and reactant. The transition state theory extends this formulation to the kinetic constant by assuming a quasi-chemical equilibrium between transition state and reactant, thus using the free energy difference between these two. Both constants can be expressed as: Thanks to eq 1, we can replace the log K in the Hammett equation with a free energy difference ∆G or a potential energy difference, since it meets the conditions imposed by the Hammett equation presented above. The logarithm of the kinetic constant can be replace by the activation energy E a , giving: where r is one of the N R reactions, s one the N S set of substituents and E 0 is the activation energy for the unsubstituted molecule. In the following, we describe our approach formally. Moreover, the exact implementation used in this work is freely available [1].
We first evaluate the set of reaction constants {ρ}. If we compare the activation energies of any two different reactions r i and r j which share common set of substituents, we obtain the following system.
Dividing the first equation by the second one gives: The ratio between the two reaction constants ρ(r i ) and ρ(r j ) can be obtained via a linear regression of energies, as it is given by the linear slope m of such regression. We made use of a robust regressor [2] to minimize the impact of strong outliers on the final values. This gives a set of N 2 R − N R equation of the following form: These equations can be combined in the linear system: Where M is coefficient matrix with the values for m obtained via equation 5, ρ is the column vector of the N R reaction constants and 0 is a vector of zeroes. Solving equation 6 for ρ will give {ρ}. For numerical reasons, it might be necessary to initially fix one arbitrary reaction constant to 1 to avoid trivial solutions. This is the only source of bias in the procedure, its effects are discussed below.
Once the {ρ} is defined, the substituent constants {σ} are obtained by averaging the ratio between activation energy and reaction constant across all reactions: We treated each E 0 (r) as a model parameter and set it to the median of all the activation energies available for the reaction r. This is done in order to reduce the dependence of the model on only N R calculations.
This procedure gives a set of substituent constants {σ} that is much less sensible to reference reaction and the presence of outliers. These {σ} can then be used to improve the reaction constants {ρ} such that they give the best evaluation of activation energies via equation 2. The new values are obtained with a linear regression between {σ} and {E a }. For a specific reaction r j , the reaction constant ρ(r j ) is given by: Where the sum runs over all the possible substituents N s . This procedure can be used to improve the values for {E 0 } at the same time.

Decomposition of σ
The substituent constants obtained from eq 7 are molecular properties, which describe the effect of the entire set s of substituents. By denoting each substituted position on the molecule by the index p and each substituent group (e.g. NO 2 ) by the index g, we highlight the dependency of each σ as σ(s) = σ({g p }), where by Electronic Supplementary Material (ESI) for Chemical Science. This journal is © The Royal Society of Chemistry 2020 g p we indicate the group g to be in position p. If N P is the total number of positions p on the molecule, and N G the total number of substituent groups g, the maximum number of set s is N N P G . However, each molecular σ depends only on N P terms at most. The overall σ(s) can be expressed as a linear combination of these N P terms: Theσ(g p ) are the single-substituent sigmas. They are independent from one another and can be determined via categorical regression using a dummy encoding. In this, fingerprint-like representation, each molecule in the data set is described by a vector of N P N G values, representing all the possible combinations of position and group. All the elements are zeros, except for the ones corresponding to the group-position pairs present in the molecule. These vectors are then stacked into a matrix A which is then used to solve the linear system This type of decomposition reduces the number of parameters needed to describe the substituents from N N P G to N G N P and allows to predict values of σ(s) for set of substituents for which no data is available. However, theseσ(g p ) still depend on both the position and the group, meaning that the same group will have a different value depending on its position on the molecules. While this is chemically sound, it limits the transferability of the model. To separate the effect of the group g from the one of the position p, we replace the dependence on the latter by distance decaying function that scales the single-substituent effect. This is why the energy difference is modelled after the electronic density. Here an exponential decaying push/pull effect is given by electron withdrawing and electron donating group, respectively. This can be modelled by the following functions: where α(g) is a parameter which depends only on the group g, regardless of its position on the molecule, d p is the distance between the position p and the reacting centre on the molecule, and τ is a parameter of the model which regulates the distance decay of the inductive effect. α(g) is determined by a linear regression while the optimal τ can be found by a scan. This approach further cuts down the number of parameters required by the model to describe the substituents from N P N G to N G + 1. It requires geometrical information on the backbone of the molecule, which is easily obtainable.
Eq 9, 11 and 12 all neglect the interactions between different group-position pairs. These could be modelled by three body terms such as the Axilrod-Teller-Muto potential form [3]: In this case, V is is not a potential, but it keeps the same functional form and includes distances and angles between any two group-position and the reacting centre. This can be used to describe the residuals of the previous fit by including many-body effects. The added flexibility comes at the cost of (g 2 + g)/2 additional parameters, one for each possible substituent pair.

Dependence on the reference reaction
As discussed above, it necessary to initially set on of the reaction constants ρ to 1, in order to avoid trivial solutions. This is the only source of bias in our model and its effect is observed to be limited. For the experimental data sets described in section III.1, the effect of the reference's choice is shown in figure 1. For the computational S N 2 data, we show the influence of the reference choice in figure 1.
LG-Nu The two panels show the Mean Absolute Error (MAE) of the prediction of activation energies. For the top panel, the substituent constants are obtained from eq 7; we named this method σ-Hammett. In the bottom panel, the substituent constants are obtained from the sum of individual contributions with a power-law distance decay, as calculated from eq 12; we named this method α-Hammett.
Each gray line corresponds to a different choice for the reference reaction, out of the 12 listed on the x-axis, and shows how the MAE changes across the reaction space. In each panel, we highlighted in blue the one that gives the best overall prediction. The red circles show the total error, i.e. across all the 12 reactions, for each reference indicated indicated on the x-axis. These results are compared to the accuracy of the MP2 method, shown by the dashed line.
These plots show how the overall prediction, given by the red circles is only partially affected by the reference bias, especially for the α-Hammett model. Additionally, the gray lines are all very close to each other, meaning that even the description on smaller subset of the data remains mostly consistent regardless of the reference reaction chosen.
The α-Hammett model gives a worse prediction, by about 0.75 kcal/mol on average, but it almost completely negates the effect of the reference bias.

Machine Learning
The activation energies can also be obtained from Machine Learning. In this work we use Kernel-Ridge Regression, for which the property of interest y of a molecule X can be predicted as: where i runs over all the molecules in the training set, α i are regression coefficients and k(X, X i ) is a kernel function. In this work, we used a Laplacian kernel, where each element j, i is given by: where A j and B j are representation vectors and w is the kernel width. The regression coefficients α i can be calculated as: where λ > 0 is a hyperparameter used as a regularizer and K and I are the kernel matrix and identity matrix respectively. As representation X we used the one-hot encoding described in sec. I 3. In this case, the string describes not only the set of substituents, but also the reaction being considered, and for this reason it contains R extra characters, one for each reaction in the data set. This type of machine learning algorithm is used to either predict directly the activation energies or to learn the residuals of the Hammett regression using Delta Machine Learning [4]. The latter works on the assumption that learning the target property from a smoother surface is easier, and thus requires fewer training points to reach high accuracy.