Pierre L.
Bhoorasingh
and
Richard H.
West
*
Northeastern University, 360 Huntington Ave, Boston, MA 02115, USA. E-mail: r.west@neu.edu; Tel: +1 617 373 5163
First published on 28th September 2015
Detailed kinetic models to aid the understanding of complex chemical systems require many thousands of reaction rate coefficients, most of which are estimated, some quite approximately and with unknown uncertainties. This motivates the development of high-throughput methods to determine rate coefficients via transition state theory calculations, which requires the automatic prediction of transition state (TS) geometries. We demonstrate a novel approach to predict TS geometries using a group-additive method. Distances between reactive atoms at the TS are estimated using molecular group values, with the 3D geometry of the TS being constructed by distance geometry. The estimate is then optimized using electronic structure theory and validated using intrinsic reaction coordinate calculations, completing the fully automatic algorithm to locate TS geometries. The methods were tested using a diisopropyl ketone combustion model containing 1393 hydrogen abstraction reactions, of which transition states were found for 907 over two iterations of the algorithm. With sufficient training data, molecular group contributions were shown to successfully predict the reaction center distances of transition states with root-mean-squared errors of only 0.04 Å.
Parameter estimation methods are computationally efficient strategies to provide thermodynamic and kinetic values.4 Most parameter estimation methods are based on Benson's group additivity,5 in which the thermodynamics of a molecule are estimated by summing the contributions from the molecular groups present in the molecule, these group values having first been calculated from molecules with known thermodynamic parameters.6,7 Such group contribution methods have been shown to work well for thermochemistry of hydrocarbon species, and the concept has been extended to kinetic parameter estimation.8–11 Group contribution methods become less accurate when parameters are estimated using groups values that have not been well determined, due to insufficient training data. For example, group values have been difficult to extend to thermodynamics of fused rings leading to inaccuracies in their estimates.12
Such inaccuracies in group-based estimation methods have motivated high-throughput electronic structure calculations for thermodynamics and kinetics.13,14 Such a procedure was recently developed to calculate thermodynamic parameters within the framework of the automatic Reaction Mechanism Generator (RMG).12,15,16 In that procedure, 3-dimensional structures were created via distance geometry,17 with the structures optimized using force-fields and semi-empirical electronic structure calculations to provide molecular parameters, allowing thermodynamic parameters to be calculated. Thermodynamic error was greatly reduced for fused-ring species compared to estimates derived from Benson's group additivity.
In a similar manner, kinetic parameters currently estimated from poorly trained group values could be improved by applying electronic structure calculations and transition state theory, but this requires a high-throughput approach for finding transition state geometries. A transition state geometry estimate, which is typically provided manually, must be quite similar to the correct transition state geometry for the optimization to converge. Manual estimation of transition states is not compatible with the context of automated mechanism generation, which requires thousands or even millions of reaction rates. With continuing advances in computing power, it has become feasible to automate these searches.
One approach used the growing string double-ended method18 to search for possible transition states.19 While there is an increased computational cost associated with the transition state search, the use of the string method negates the need for a path analysis step to validate the transition state. This method has been extended to the construction of detailed mechanisms, where the user controls the mechanism generation with restrictions such as barrier height limits.20 Adoption of this method is limited to those with access to software with reliable double-ended methods. Zimmerman has further developed these methods to create a single-ended transition state search.21 This makes use of driving coordinates from reactants to find intermediates, from which the transition state can be found using the growing string method.
Zádor and Najm instead use a rule-based approach to direct atoms from a reactant configuration towards the product, using energy calculations at each step to determine the location of the transition state.22 This method is best suited to reaction systems with a small number of atoms, such as the exploration of a pressure dependent reaction network.
The AARON code automates transition state searches to screen potential organocatalysts.23 A catalyst structure is provided by the user and mapped onto a parent catalyst structure for which the transition state geometry is already known, then a series of partially constrained semiempirical and DFT optimizations allow the new transition state to be found.
Maeda and Morokuma used an artificial force to push reacting molecules together, to probe the potential energy surface around atoms, predicting reactions and finding their transition states.24,25 This artificial force induced reaction method requires many random starting orientations.
The methods highlighted above explore the potential energy surface for a given set of atoms, finding many reaction pathways for a few reactants. These are not well suited to automated mechanism generation where it is routine to have many reactions of the same type but with varying reactants. For such applications, we propose an alternative method to estimate transition state geometries using molecular group contributions. The group contributions are used to predict inter-atomic distances in the reaction center of the transition states.
Estimated 3D geometries are constructed from the predicted distances using distance geometry. Optimization and validation of the transition state estimates have also been automated. Hydrogen abstraction reactions from a diisopropyl ketone combustion model,26 previously developed using RMG, were used to test the method, with transition states found for over 65% of the 1393 reactions.
Distances separating reactive atoms undergo significant change during a reaction, but the rest of the molecule remains relatively unaffected. As a result, distances between the reactive atoms are unknown at the transition state, but existing methods can be used to determine the remaining distances.
For hydrogen abstraction reactions, three atoms lie in the reaction center: the abstracted hydrogen (H), the atom bonded to the abstracted hydrogen (X), and the radical abstracting the hydrogen (Y). The three distances separating each reactive atom pair are denoted as dXH, dHY, and dXY. Estimating these distances allows the entire transition state geometry to be created using distance geometry. Typically the geometry is specified manually, but we demonstrate here a group contribution method to estimate the required reaction center distances.
The structure of the molecular group tree was first taken from the kinetics database of the RMG software. This tree structure was developed to make efficient use of sparse data for estimating kinetic parameters relevant to hydrocarbon combustion. The development of this tree involved several researchers making independent modifications over a number of years to provide improved kinetic estimates for specific fuels. Sometimes modifications were made with the aim of minimizing disruption of the existing tree, rather than of optimizing the overall tree structure. The uncoordinated nature of the modifications has led to a tree structure that is hierarchical, but lacks obvious logic in its structure, and was certainly not optimized for transition state distances. For example, the O_H group has descendants that are peroxides except for the peroxyradical group (Orad_O_H), which is instead a sibling group.
A new tree structure was also developed for comparison to the RMG designed structure. The new design was built to understand the effect of the tree structure. The same starting head nodes were used for the new tree as they described all possible reacting molecules for the hydrogen abstraction family. Care was taken to ensure subsequent generations had a single characteristic defined across all sibling nodes, and that characteristics thought to be more important were defined earlier (higher in the tree). For example, the children of the head nodes specified the elements of the wildcard atoms (X and Y), but no bonding or radical electrons were specified because, while important, they are less critical than the wildcard atom types. This meant that child nodes to X_H_or_Xanyrad_H were H2, C_H, and O_H (the X is defined as H, C, or O), while the children of Y_anyrad were Hrad, Orad, and Crad. The following two generations defined the radicals and bonding. For the X branch of the tree the bonding was defined first, then the radicals; on the Y branch the radicals were defined first, then the bonding. This convention was continued until the bonding on the nearest neighbor atoms were defined (the R groups in R_X_H and R_Y_rad).
Both the original and the updated trees are available in the ESI.†
The linear least squares regression calculates group values by finding the best fit to the available training data. For each set of distances in the training set, the reactants are matched to groups in the group tree. All groups that match the X_H_or_Xanyrad_H reactant are paired with the groups that match the Y_anyrad reactant, and the sum of each pair and a base value is set equal to the training distances. This creates a system of equations where the variables are the group values and the known values are the training data. The regression is conducted using the linear algebra package in numpy, finding the group values that best fit the data.29 A detailed description of the least squares regression is available in the ESI.†
The reaction CH4 + C2H5 is used as an example. Table 1 shows the sections of the molecular group tree relevant to this reaction. The most specific group that matches each reactant is found by descending the tree. CH4 matches the C_methane group in the X_H_or_Xanyrad_H tree, while C2H5 matches the C_rad/H2/Cs\H3 group in the Y_anyrad tree. An explanation of the naming convention, and complete tree definitions, are provided in the ESI.† The distance estimates are calculated by summing the top node value and the group correction for each reactant, predicting respective values for dXH, dHY, and dXY as 1.388 Å, 1.331 Å, and 2.721 Å.
Group | dXH | dHY | dXY |
---|---|---|---|
Base | 1.336010 | 1.336330 | 2.667560 |
L1: X_H_or_Xanyrad_H | |||
L2: X_H | −0.002556 | 0.002864 | 0.000227 |
L3: H2 | −0.327434 | −0.045046 | −0.369886 |
… | |||
L3: Cs_H | 0.007461 | 0.023642 | 0.032296 |
L4: C_methane | 0.076680 | −0.051468 | 0.028801 |
L4: C_pri | 0.025511 | −0.002230 | 0.025031 |
L5: etc. | |||
L4: C_sec | −0.026003 | 0.069757 | 0.044341 |
L4: C_ter | −0.025676 | 0.062321 | 0.034956 |
L5: Etc. | |||
L2: Xrad_H | 0.094987 | −0.106435 | −0.008430 |
Etc. | |||
L1: Y_anyrad | |||
… | |||
L2: Y_rad | 0.002857 | −0.002500 | 0.000277 |
L3: H_rad | −0.044160 | −0.330263 | −0.371926 |
… | |||
L3: Cs_rad | 0.024200 | 0.007289 | 0.032625 |
L4: C_methyl | −0.050813 | 0.075919 | 0.028607 |
L4: C_pri_rad | −0.001792 | 0.025273 | 0.025176 |
L5: C_rad/H2/Cs | −0.032772 | 0.051719 | 0.021617 |
L6: C_rad/H2/Cs\H3 | −0.024753 | 0.045959 | 0.024509 |
L6: C_rad/H2/Cs\Cs2\O | −0.125966 | 0.025305 | −0.097425 |
Etc. |
Transition state estimates are created by randomly “embedding” the atoms in 3D space such that they satisfy the bounds matrix. Repeating this process allows multiple conformers to be created. The conformer geometries are then optimized using a UFF force field calculation constrained by the bounds matrix. The lowest energy conformer according to the UFF calculation is selected as the transition state estimate. While the accuracy of the force field energy calculation is low, it is sufficient for conformer selection.
The IRC result should connect the original reactants and products for a successful transition state. The result is typically inspected visually for comparison, but this is not possible for an automatic procedure. In our algorithm, the IRC geometries are extracted and converted into chemical graphs using a simplified version of the connect the dots method in Open Babel.31 The atoms are sorted along the z-coordinate, with the method starting with the lowest atom, continuing along the axis, and terminating with the highest. A bond is made between this first atom, A, and its nearest neighbor, B, if all the following are true:
(1) No bond currently exists between A and B
(2) the number of other bonds to A and B is less than their respective valencies
(3) the distance between A and B is less than the sum of their covalent radii +0.2 Å.
The process is repeated with atom A being compared each time to the next-nearest atom from the previous iteration, until either there are no more atoms to be compared or the number of bonds on A equals its valency. The method then proceeds on to the next atom along the z-axis.
With the bonding complete, the chemical graphs of the IRC molecules are compared to the starting reactants and products using a graph isomorphism algorithm.32 The transition state search is successful if the chemical graphs are isomorphic. The automated algorithm is outlined in Fig. 2.
First, a preliminary training set was created from 44 unique hydrogen abstraction transition states, and was used to train the molecular group tree. As few groups were trained, we found the distance estimates to be insufficient for reliably predicting transition states. As a result, the training set was expanded to contain data from a total of 230 transition states. This expansion of the training set was done with geometries found both manually and using the automated algorithm. The reactions from the DIPK model were then passed to the automated algorithm, with data from the successfully found transition states added to the training set. The groups were retrained, and the method was tested again on the same reactions from the DIPK model. This led to the expansion of the training set from data for 230 transition states to 827 and then 1071 transition states. Characteristics of the group contribution method were investigated using 4 training sets (Table 2).
Training set name | Transition states in training set | Geometries found | RMS error (Å) | |
---|---|---|---|---|
Original tree | Modified tree | |||
44TS | 44 | Not run | 0.181 | 0.124 |
230TS | 230 | 658 | 0.102 | 0.088 |
827TS | 827 | 734 | 0.040 | 0.042 |
1071TS | 1071 | Not run | 0.036 | 0.041 |
Further comparison tested the performance of the molecular group trees for small training sets. The largest training set (1071TS) was randomly sampled to create many smaller training sets containing data from 44 transition states. With each of the smaller training sets, group values were trained and distances were predicted then compared to known distances from validated transition states. This was done using both the original and new tree structures.
The observed improvement in the distance predictions as the groups were trained with more data was consistent with our hypothesis. With a larger training set, some untrained groups now have data and some trained groups have more data, improving their accuracy. If the group was newly trained, the algorithm would use more relevant and specific group values, improving the predicted distances. This was observed in the improvement in the distance predictions moving from 44TS to 230TS. With new training data, previously trained groups improve as more data are used to train the group values, as seen when comparing the groups trained using 230TS and 827TS. Little improvement in the RMS error for predictions made with 827TS and 1071TS shows that the 827TS groups were relatively well trained so the extra data from 244 transition state geometries had little effect on group value predictions.
The observations show certain data are more desirable when expanding a training set for molecular group values. For example, if the reactions of interest are hydrogen abstractions from the OH group of an alcohol, the training set should contain such reactions with different types of radicals abstracting the hydrogen. If the training set contains data from a large number of transition states for hydrogen abstractions from alkanes by an alkyl radical, little will be gained by adding a transition state for ethyl abstracting a hydrogen from methane. Both the reactions of interest and the available data should be considered when adding new data to a training set.
![]() | ||
Fig. 4 The RMS error for the distance estimates compared to the optimized transition state distances. |
The differences in error observed with the two trees shows the importance of the structure to the distance predictions. While the new tree structure improves the distance predictions from smaller training sets, other tree structures might be able to further improve the predictions.
1000 new training sets containing data from 44 transition states were created by randomly selecting data from the 1071TS training set. The new training sets were used to train both the original and new molecular group trees, and reaction center distances were predicted for comparison with the 907 known TS. In over 85% of the 1000 cases, the modified tree had a lower RMS error than the original tree. The probability distribution of the RMS errors (Fig. 5) show that the predictions should be more accurate if made using the modified tree instead of the original tree structure, given the small size of the training set.
The RMS error attained using the 44TS training set was 0.181 Å with the original tree and 0.124 Å with the modified tree (Table 2). Comparing with the probability distributions in Fig. 5, which peak around 0.09 Å, shows that the probability of randomly selecting from 1071TS the 44 transition states used in 44TS is very low, i.e. they are strongly correlated and non-random. This lack of variety in the 44TS set is what leads to the large RMS errors: some specific groups were well trained, but the overall tree was poorly trained. This shows that the value of each transition state in a training set decreases when a similar transition state already exists in that training set, i.e. it is important to have a variety of structures in the training data, distributed evenly across the tree.
Fig. 7 shows the probability of a failed transition state search increases with increasing root mean squared (RMS) error in the three reacting distances of the starting geometry. The lower bound probabilities are calculated from trials from the 230TS training set. It is a lower bound of P(failure) because only the 249 failures that later succeeded with the 827TS training set were included; for the 486 reactions that continued to fail, the true distances are not known and the RMS error could not be calculated. Because few of our starting geometries were worse than 0.2 Å we do not have many trials in this region and our estimate of the failure probability is quite uncertain, hence the wide Clopper–Pearson39 95% confidence interval of P(failure) (the vertical bars in Fig. 7). To estimate the upper bound of the failure probabilities, we distributed the 486 additional failures using a variety of assumptions, each giving a different estimate of the P(failure) curve; the upper bound in the figure encompasses all these curves.
![]() | ||
Fig. 7 Probability of a failed TS search as a function of RMS error in reactive distances of starting geometry. For each point the vertical bar show the Clopper–Pearson39 95% confidence interval of the lower bound and the horizontal bar shows the range of RMS errors used to calculate it. |
Although uncertain, the shapes of these bounds are informative, and they support the need for good starting geometries for a transition state search: embedded geometries with an RMS error greater than 0.15 Å have a high failure rate. Other reaction families, optimization algorithms, and software packages may behave differently.
The difference between the upper and lower bounds for the reaction center distance estimates is currently set to 0.1 Å, which can be as much as 10% of some distances. This range should be related to the uncertainty calculated when determining the group values by linear regression, allowing well known values to have tight restrictions.
Transition state searches facilitate kinetic calculations, but automating the entire kinetic calculations would also require the reactant and product geometries. These can be found using the existing automated thermodynamic parameter calculation algorithm.12 With the required geometries and calculations, the procedure could interface with thermodynamic and kinetic parameter calculators, such as CanTherm (Fig. 8).40
Footnote |
† Electronic supplementary information (ESI) available: Tables of group values, list of test reactions, and 3-D geometries of transition states. See DOI: 10.1039/c5cp04706d |
This journal is © the Owner Societies 2015 |