Uncertainty quanti ﬁ cation for quantum chemical models of complex reaction networks †

For the quantitative understanding of complex chemical reaction mechanisms, it is, in general, necessary to accurately determine the corresponding free energy surface and to solve the resulting continuous-time reaction rate equations for a continuous state space. For a general (complex) reaction network, it is computationally hard to ful ﬁ ll these two requirements. However, it is possible to approximately address these challenges in a physically consistent way. On the one hand, it may be su ﬃ cient to consider approximate free energies if a reliable uncertainty measure can be provided. On the other hand, a highly resolved time evolution may not be necessary to still determine quantitative ﬂ uxes in a reaction network if one is interested in speci ﬁ c time scales. In this paper, we present discrete-time kinetic simulations in discrete state space taking free energy uncertainties into account. The method builds upon thermo-chemical data obtained from electronic structure calculations in a condensed-phase model. Our kinetic approach supports the analysis of general reaction networks spanning multiple time scales, which is here demonstrated for the example of the formose reaction. An important application of our approach is the detection of regions in a reaction network which require further investigation, given the uncertainties introduced by both approximate electronic structure methods and kinetic models. Such cases can then be studied in greater detail with more sophisticated ﬁ rst-principles calculations and kinetic simulations.


Introduction
Highly complex reaction networks underlie chemical reactions that involve reactive species, harsh reaction conditions, or non-innocent solvents (or a combination of all).2][3][4][5] All these approaches make different assumptions on the processes studied such that, from a feasibility point of view, none is generally applicable.To illustrate this point, we consider two examples.On the one hand, the dynamics on a rugged energy landscape will demand advanced sampling methods from molecular dynamics or Monte Carlo simulations rather than a standard quantum chemical approach that considers only a few selected stationary points on that surface. 6,7On the other hand, for processes on a well-structured potential energy surface with non-shallow minima, explicit dynamics may suffer from sampling problems and is oen replaced by kinetic models that eventually allow one to access long time and length scales beyond the reach of explicit dynamical approaches. 8uantum chemical methods are well suited for describing energy changes due to changes in the electronic structure of reacting molecules if these electronic effects govern the overall energetics of the process.Usually, structures considered relevant as stable intermediates or transition states are optimized and their energies are compared to identify the relevant reaction paths.Clearly, this approach is limited, especially if carried out manually, to a rather small number of structures only.For predictive work on systems for which little or no experimental information is known, the exploration of potentially important structures becomes an immense task.Several approaches exist to overcome this issue.In reactive molecular dynamics simulations, [9][10][11][12][13][14][15][16][17] for example, the nuclear equations of motion are solved to explore and sample conguration space.9][20][21][22] By applying predened (possibly alchemical) transformation rules to create new chemical species, explorations in conguration space are greatly accelerated.Recently, we proposed a fully automated heuristicsguided exploration protocol 22 in which the heuristic rules rest on reactivity descriptors derived from quantum mechanics.
It is important to understand that to theoretically grasp the kinetics of complex reaction networks, we must be prepared to investigate an enormous number of possible intermediates (on different potential energy surfaces) not generated by simple conformational changes but by the sheer number of chemically different reactants.For truly complex chemical reaction networks, no universal protocol based on quantum chemical calculations has been established so far that would span the whole range of steps from molecular and electronic structure optimization to detailed kinetic modeling.However, signicant progress in all research areas that would contribute to the establishment of such a protocol has already been made.Given the algorithmic and hardware developments accomplished in the past two decades, it should be feasible to establish such a protocol in a single, integrated implementation.
Clearly, various choices and approximations need to be made and hence the protocol to be established will not be unique.Still, we demand the development of such a protocol be subjected to constraints that will make it universally applicable.Besides, we are faced with the fact that quantum chemical raw data are affected by method-inherent errors and need to be augmented by nuclear motion and temperature corrections before they can be subjected to kinetic modeling.Hence, if we must be prepared to make certain assumptions and approximations, we expect from our protocol that the violation of an approximation can be identied within the protocol and overcome by approaches beyond the realm of the protocol's standard methods.This way, we may be able to identify possible breaches that point to more sophisticated theoretical approaches to be applied.If the number of such breaches is small, then the general basis of the protocol, which will be quantum chemical methods in our case, will remain valid.And clearly, in view of the successes of quantum chemical reaction mechanism elucidation, we have good reason to believe that this is possible.Obviously, this will only be possible if we have uncertainty measures at hand that allow us to assess the accuracy of individual simulation results.For example, Vlachos and coworkers recently applied Bayesian statistics to predict rate constants for chemical processes on surfaces. 23he ingredients of a general protocol for the generation and analysis of chemical reaction networks are: (1) the automated exploration of possibly relevant intermediates and transition states, (2) the determination of free energies for reactions in condensed phase, (3) systematic error estimation based on, for instance, Bayesian statistics, and (4) the kinetic modeling of the reaction network emerging.
In this work, we apply components ( 2)-( 4) of the general protocol to a complex chemical reaction network in aqueous solution: the formose reaction.It is our goal to establish protocol-inherent validation measures that keep track of the validity of the assumptions made and that may point to advanced theoretical approaches to deliver more reliable data if needed.Moreover, our analysis is intended to be a general feasibility analysis of this protocol that will, as we shall show, point to interesting future developments.

The formose reaction
Formose reaction is the collective term for a plethora of possible autocatalytic oligomerization reactions of formaldehyde in aqueous solution. 24,25The reaction affords a highly complex (racemic) mixture of linear and branched monosaccharides (tetroses to octoses), polyols, and several degradation products.The identication of all products poses a major analytical challenge and the exact composition has not been elucidated yet, though over 50 products have already been characterized. 26,279][30] The rst step towards the formation of sugars is the dimerization of formaldehyde, which is extremely slow and may involve catalysis 31,32 or radiation-induced processes. 33,34The dimer can be regenerated autocatalytically 35,36 and the reaction can therefore be easily initiated.The rapid subsequent formation of sugars is likely to proceed through an alternating series of forward and reverse aldol reactions as well as tautomerizations. 35Kua et al. investigated these key steps in the formose reaction computationally and concluded that the experimentally proposed mechanism is also plausible from a theoretical point of view. 37Rappoport et al. explored the chemical reaction network of the formose reaction automatically based on heuristic guidance and reproduced major reaction pathways as well as experimentally observed products. 20Recently, hydride shis and associated quantum tunneling were found to play a major role in the formose reaction, [38][39][40] which were not considered in the computational studies.The product ratios are very sensitive to the reaction conditions (e.g., solvent, temperature, and pH) and the amount and type of reactants.2][43] So far, the complexity of the reaction network has prevented experimental and theoretical kinetic studies of the entire process. 25

Transition state theory and thermochemistry
A reaction network of all relevant intermediates and transition states of a chemical process sets the frame to study population trajectories through the network.In solution chemistry, typically trajectories of molar concentrations are studied, which depend on several conditions such as initial feed of reactants and temperature.While the correlation of these conditions with the product distribution can be determined quite straightforwardly by a suitable experimental setup, it remains a challenge to analyze why a certain product distribution was found.To resolve this issue, studying the kinetics of a chemical process is inevitable.Only then, intermediates relevant for the product distribution but not contained in it can be detected.This time-resolved picture would allow us to develop strategies to support the formation of a desired product or to suppress the formation of unintended side products.
As experimental kinetics can only examine a limited number of chemical species, thorough theoretical kinetic models corresponding to complex reaction networks spanning several time scales are desired.For the construction of a general-purpose (mass-action) kinetic model, rate constants are the essential elements to be determined.
Conventional transition state theory (TST) provides a simple approach to calculate rate constants for isothermal reactions.It is assumed in conventional TST that a reaction coordinate along a Born-Oppenheimer potential energy surface is orthogonally intersected by a hyperplane in such a way that once crossed by a trajectory starting from a reactant state, that trajectory ends in the corresponding product state. 44This crossing point is approximated by the rstorder saddle point of a reaction coordinate.Given a canonical ensemble of microstates, for which the number of molecules N, the temperature T, and the volume V are constant, the thermal rate constant k(T) of a reaction from a reactant to a product crossing the corresponding transition state depends on the Helmholtz free energy difference between reactant and transition state, DA ‡,* , through an exponential function, 45 kðTÞ where k B is the Boltzmann constant, R the ideal gas constant, and h the Planck constant.Throughout this paper, we refer to a standard state of N z 6.022 Â 10 23 and V ¼ 1 L (indicated by a superscript asterisk to the free energy).It is a known problem that conventional TST (a) cannot ensure recrossing-free trajectories through the approximated dividing hyperplane (overestimation of rate constants) and (b) cannot account for quantum effects such as tunneling (underestimation of rate constants).Both phenomena can be accounted for in conventional TST by introducing a fudge factor k to the right-hand side of eqn (1).Extended approaches such as variational TST 46 and quantum TST 47 provide ways to circumvent these problems, but require much more information on the potential energy surface than its low-order stationary points.However, it was shown that conventional TST works surprisingly well even for large molecules such as enzymes. 48,49he only quantity we need to determine for the construction of a kinetic model based on conventional TST is the Helmholtz free energy A for all intermediates and transition states contained in a given reaction network.A is determined by the, in our case canonical, partition function where all energy states of the system of N molecules enter Q.
The direct evaluation of Q is in general not feasible.However, a well-established procedure exists 50 to approximate gas-phase free energies.In the gas phase, Q may be factorized into a product of N molecular partition functions q of N isolated, indistinguishable, and non-interacting molecules, Q ¼ q N /N!.This factorization enables the calculation of A based on an isolated-molecule quantum chemical calculation.A results from the internal energy U and the temperatureweighted entropy S, ( For our set-up and standard state, U can be decomposed into the sum of the temperature-independent electronic energy, E elec , the zero-point vibrational energy, ZPE, and remaining thermal contributions to the internal energy U rest (T,Q), where N A is the Avogadro constant, T > 0, and U rest (T,Q) is calculated without the zero-temperature contributions.A can be related to the Gibbs free energy G, G ¼ A + pV, where p is the pressure.Usually, p and V are related through the ideal gas law, p ¼ Nk B T/V, in an oversimplied way.The molecular partition function q is usually approximated by a product of electronic (elec), q elec , translational (trans), q trans , rotational (rot), q rot , and vibrational (vib), q vib , contributionsdeliberately neglecting the coupling of the degrees of freedom (such as rovibrational coupling).
For the evaluation of q elec , we may assume that electronically excited states are high in energy and cannot be excited thermally at a given temperature. 50However, for intermediates with small HOMO-LUMO gaps, this may be taken as an indication for a necessary extension of this standard approach.Hence, only spin and orbital or point-group degeneracy needs to be taken into account.
To evaluate q trans , the particle in a box model is employed to determine the energy states associated with the translation of the center of mass of the molecule in V. 50 q trans depends on the mass of the molecule and V.If we neglect the existence of isotopomers, the mass is easily calculated for the most abundant isotopes.Recall that V is the volume for the chosen standard state.
To evaluate q rot , the molecule is treated as a rigid rotor. 50The molecular principal moments of inertia and the symmetry number of the molecule enter q rot .The error introduced by this assumption is severe, for instance, for microsolvated molecules, which feature not only internal rotational degrees of freedom, but also their coupling with the external rotations, i.e., the moments of inertia depend on internal rotational degrees of freedom. 51 vib is usually approximated by all energy eigenvalues of harmonic oscillators that describe the 3m À 6 normal coordinates of an m-atomic non-linear molecule. 50It is well known that the harmonic approximation will break down, if the potential energy surface deviates signicantly from the quadratic harmonic potential, 50,[52][53][54][55][56] for instance, for highly anharmonic modes, which are better described as internal rotations. 57The evaluation of the contribution of these anharmonic frequencies is computationally expensive, [58][59][60][61][62][63] and therefore, not feasible for the exploration of a large reaction network.If, however, reactants and products exhibit similar harmonic vibrational frequencies, error cancellation will occur (see, e.g., ref. 64).
The standard-state free reaction energy in solution, DA * sol , for a reaction R / P in solution can be obtained as the sum of the standard-state free reaction energy in the gas phase, DA * gas , and the difference of the free energies of solvation of the reagents, DDA * solv , when employing a thermodynamic cycle as illustrated in Fig. 1.6][67] The solute is immersed into a cavity in the continuum leading to parameter-dependent solvent-solute interactions.9][70] The parameters in the implicit solvation models are determined through parametrization to experimental Gibbs free energies of solvation. 66The Gibbs free energy of solvation is taken to be equal to the Helmholtz free energy of solvation, 66 which neglects volume changes.This effect was shown to be small for several test cases. 71Here, we assume that all effects associated with the transfer of the molecule from an ideal gas phase to the solution phase are absorbed in DA * solv . 57,72This assumption is reasonable for small, rigid molecules, whose structures do not signicantly change upon solvation.For such molecules, modern implicit solvation models (e.g., SM12) 70 exhibit a mean unsigned error of about 2 kJ mol À1 in comparison to experimental data. 70,72A more pronounced error can be expected for charged species. 70,72,73][76] The choice of implicit solvent models limits the choice of possible temperatures, because they are usually parametrized to Gibbs free energies of solvation at ambient temperature. 66The t to experimental data at one temperature implies that a subdivision of the DA * solv into DU * solv and DS * solv is not possible.Note, however, that Chamberlin et al. introduced a temperature-dependent implicit solvation model, which could expand the range of accessible temperatures. 77g. 1 Thermodynamic cycle for the calculation of the standard-state Helmholtz free reaction energy DA * sol for the reaction R / P in solution ('sol') from gas-phase ('gas') and solvation ('solv') data.
Furthermore, many popular continuum solvation models assume that thermal equilibrium between solute and solvent is reached instantaneously.This may be inadequate for reactive intermediates. 67,78n this work, a reaction or activation free energy in solution, DA * sol , is obtained from the difference of the electronic energies, DE elec , the difference of the zeropoint vibrational energies, DZPE, the difference of the thermal contributions to the free energy in the gas phase, DA * gas , and the difference of the free energies of solvation, DDA * solv , DA * sol ðT; Here, Q solv is a place holder for the continuum modeling of DDA * solv .We highlight the separation of contributions that do not depend on a partition function (DE elec , DZPE) from those that do depend on it (DA * gas , DDA * solv ).Recently, we have demonstrated how the error associated with DE elec can be assessed by applying Bayesian statistics 79 (see also Section 4).This error may be considered a lower bound for the error of DA * sol if the error of the other contributions is neglected.For the error estimation of A * gas , ZPE, and DA * solv the individual contributions must be investigated.
The model-inherent errors in Q, which is employed to calculate A * gas , are difficult to evaluate.We may assume that the error of q vib is dominant.As the harmonic approximation is a severe approximation for low-frequency modes, [52][53][54][55][56] the anharmonic q vib is required to assess the effect of this approximation.The application of scaling factors for harmonic frequencies 62,80 is, however, not sufficient to obtain a reference anharmonic q vib , because scaling factors neither correct the form of the quadratic harmonic potential nor the equidistance of the energy levels.4][55][56] The one-dimensional potentials are sampled along the normal coordinates.4][55][56] Hence, the deviation of the harmonic ZPE from the anharmonic ZPE is readily obtained.It is then possible to assess the error of the harmonic q vib by comparison with the anharmonic q vib obtained by explicit summation over all vibrational modes.This error can be considered a lower bound for the error of A * gas since mode-mode coupling effects and errors due to the approximations in q trans and q rot are not taken into account.
Accurate, theoretical reference data are difficult to obtain for DA * solv . 66,81Alternatively, the error of DA * solv can be estimated by comparison to available experimental data for benchmark sets (e.g., the Minnesota Solvation Database 82 ), assuming transferability.popular density functionals may, however, signicantly deviate from experimental data in a rather irregular manner. 83,84If, however, one could estimate the error of each computational result, one could assess whether conclusions drawn from the data are reliable.
In general, it is difficult to predict the error of density functional calculations. 836][87][88] By generating an ensemble of exchange-correlation functionals, a mean and a variance could be assigned to each result (see also ref. 89 and 90).
Based on the work of Jacobsen, Sethna, Nørskov, we developed a novel approach for the construction of reliable density functionals with Bayesian error estimation capabilities. 79Our ansatz was tailored to overcome the transferability problem by relying on system-focused reference data and to exploit the better accuracy of hybrid functionals.
Instead of considering only the best-t parameter (as commonly done with standard exchange-correlation functionals), we assign a conditional probability distribution to a linear parameter a in the exchange-correlation functional, where O is some observable (typically an energy contribution), D is some data set containing (computational or experimental) reference results, C denotes a cost function quadratic in a, and a 0 is the parameter value that minimizes C. In practice, this distribution needs to be sampled, where N is a Gaussian distribution with mean a 0 and variance s 2 ¼ 2C(a 0 )/(v 2 C(a)/ va 2 | a 0) (for a detailed derivation see ref. 79).With the set of N BEE parameters generated with eqn (6), ã ¼ {a 1 ,a 2 ,.,a N BEE }, an error estimate s for the observable O (e.g., an activation energy) can be calculated, By a system-focused reparametrization of LC-PBE0, the long-range corrected (LC) version of the density functional PBE0, [91][92][93] we were able to reliably estimate errors of calculated reaction energies. 79Hereinaer, we refer to such a functional as LC*-PBE0 in order to emphasize that the original parameters were modied (in this work according to data related to the formose reaction).In our previous study, 79 we concluded that four parameters in this exchange-correlation functional need to be modied to achieve accurate relative energies and reliable error estimates for a specic chemical system.
For an accurate reparametrization, the reference dataset D needs to be representative for the system to be studied.In this study, D contains structures of intermediates and transition states of the formose reaction.Specically, D ¼ {(A i ,B i )} consists of pairs of structures on the same potential energy surface, i.e., structures with the same number and type of atomic nuclei, the same number of electrons, and the same electronic spin state.Then, the electronic energy difference DE elec,i ¼ E elec (B i ) À E elec (A i ) between the two structures A i and B i of data set entry i denes the cost function C, In Tables 1 and 2, the accuracy of LC*-PBE0 (in comparison to standard functionals) with respect to the training set and test set is given.It can be seen that LC*-PBE0 is signicantly more accurate than most standard functionals considered here.The optimized parameters of LC*-PBE0 are close to those of PBE0 (see ESI †), which explains why the functionals are of similar accuracy.Due to its additional parameters, and therefore, higher exibility LC-PBE0 was chosen over PBE0 for the re-parametrization.Nonetheless, with a largest absolute deviation between 8-10 kJ mol À1 , it is clear that error estimation is still necessary.
In Fig. 2   For both data sets, we observe that the error is at least within AE4.2 kJ mol À1 (z1 kcal mol À1 ), unless the error estimate reported by the functional indicates otherwise (i.e., s > 4.2 kJ mol À1 ).While there are relative energies for which the errors are underestimated (D2, D4, and D25 in the training set and D30 and D38 in the test set), considering the diversity of this reference set and the error of some standard functionals (see also Tables 1 and 2), the accuracy of the error estimation is satisfactory.

Kinetic modeling
For the construction of an elementary kinetic model, free activation energies from rst-principles calculations are required as explained above.From the rate constants calculated by eqn (1), differential equations describing the time  propagation of population densities of all chemical species can be constructed.By integrating these differential equations, the underlying chemical process can be modeled.
Since differential equations describing chemical processes are generally coupled, analytical integration becomes rapidly impossible.Therefore, numerical integration is the standard method of choice.However, given that reaction networks may be arbitrarily large and entangled, numerical integration may become inefficient, 95 especially if the underlying process spans multiple time scales.For this purpose, a variety of approaches were designed focusing on the simplication of kinetic models. 96

Network structure and properties
We describe the structure of a reaction network by a graph of n vertices and 2l edges.As we assume every chemical transformation to be reversible (we refer to such a reversible elementary process as reaction pair), the graph is strictly bidirectional, which explains the enforced even number of edges.Either of both edges corresponding to a reaction pair is assigned an arbitrary but unique direction (forward or backward).This feature is exploited for the construction of the stoichiometry matrix S. It is of dimension n Â l and contains information on how many particles S ij of the i-th species are consumed or formed in the j-th elementary reaction, i.e., S ij is negative if the i-th species is consumed in the forward direction, and positive if the i-th species is formed in the forward direction.
We assign time-dependent population densities y i (t) with i ˛{1,., n} (here: molar concentrations) to the vertices, and rate constants k forward j and k backward j with j ˛{1,., l} to the edges.Assuming detailed balance, the rate of a reaction pair f j reads Note that we only consider chemical reactions with a molecularity smaller than or equal to 2. We consider this to be a good assumption for solution chemistry as long as solvent-solvent reactions are unlikely to occur (e.g., for the formose reaction in pure formaldehyde, a trimolecular reaction of formaldehyde to 1,3,5trioxane could be feasible 37 ).

Simplication of kinetic models
It is straightforward to deduce a kinetic model from the given network structure and properties.Given the n Â l stoichiometry matrix S and the l Â 1 reaction pair vector f ¼ {f j } according to eqn (9), the n Â 1 rate vector g can be constructed, which represents the rst derivative of the concentration vector y with respect to time, Our objective is to integrate this kinetic model such that concentration trajectories are obtained from the initial conditions (feed of reactants, temperature) up to thermodynamic equilibrium.
For the development of our kinetic simulation algorithm, we were inspired by two such simplication approaches, namely Markov State Models (MSMs) 97,98 and Computational Singular Perturbation (CSP). 99,100MSMs were developed for molecular dynamics simulations, where the phase space is decomposed into microstates such that a formerly continuous trajectory becomes a jump process, which is no longer Markovian (memoryless).Since local information is lost in a discrete phase space, the decomposition is chosen such that transitions within a microstate are much more likely to occur than transitions between microstates.This way, rapid convergence to local equilibrium can be assumed for these microstates, which recovers Markovianity.As a consequence, a kinetic model can be constructed from these discrete microstates by counting transitions between them.The microstates may in turn form macrostates (kinetic clusters) for which transitions are much more likely to occur than transitions between them.These clusters can be determined by studying the eigenvalues l i of the n Â n rate matrix K ¼ {K ij }. 101 Its elements K ij are a measure of the rate for a transition from the j-th to the i-th microstate.In the case of linear kinetic models (rst-order reactions only) as studied in MSMs, the rate matrix K is time-invariant and equals the Jacobian J ¼ {J ij }, the elements of which are dened as the rst partial derivative of the rate g i of the i-th species with respect to the concentration y j of the j-th species.The time scale s i corresponding to the process described by the i-th eigenvalue is inversely related to the modulus of that eigenvalue, i.e., the larger the modulus of an eigenvalue, the faster the corresponding process.If a predened gap 3 can be found in the eigenvalue spectrum, a time scale separation of processes is assumed to be possible such that the rate vector can be decomposed into fast and slow parts, With this decomposition at hand, it is possible to dissipate the fast processes applying a small time step s fast in the numerical integration until g fast z 0. Subsequently, the slow processes can be modeled from the updated initial conditions applying a much larger time step s slow .Clearly, the larger the demanded spectral gap, the smaller the error introduced by assuming decoupling of fast and slow processes.
Since the Jacobian is time-invariant in the case of linear kinetic models, the time-scale separation is also invariant in the course of the global reaction process and needs to be examined only once.However, in non-linear kinetic models (as studied here), the Jacobian is a function of time due to the inclusion of concentrations of reaction partners. 96This poses a challenge to the time-scale separation as now a steady examination of the time gap is necessary to ensure valid decoupling of fast and slow processes.
One of the most robust approaches in this respect is CSP. 102The basis of CSP is the assumption that the concentration trajectory of a chemical process is rapidly attracted onto a slow invariant manifold U, 99 which is an (n À m)-dimensional hypersurface in concentration space, where n denotes the number of species and m denotes the number of fast time scales.Consequently, s m and s m+1 are the time scales of the slowest of the m fast processes and of the fastest of the (n À m) slow processes, respectively.Two subspaces, the m-dimensional subspace of fast processes and the (n À m)-dimensional subspace of slow processes, are introduced, which are spanned by m n-dimensional (column) basis vectors a i (j ˛1,., m) and (n À m) n-dimensional (column) basis vectors a j (j ˛m + 1,/, n), respectively.Furthermore, a set of n-dimensional dual (row) basis vectors b p (p ˛1,/, n) is employed, which fulll the condition b p a q ¼ d pq , where d pq is the Kronecker delta.The decomposition ansatz for the rate vector reads CSP approximates the basis vectors b p and a q by an iterative renement procedure, where each renement introduces an accuracy increase. 100The renement procedure requires the time derivatives of the basis vectors. 99Therefore, computational savings due to the time-scale separation may be lost by iteratively determining the basis vectors aer each time step. 96However, the rst renement does not involve time derivatives and already guarantees numerical stability of the simplied model. 100

Kinetic simulation algorithm
To continuously determine slow and fast processes in a rolling fashion, we study the eigenvalues of the Jacobian.Given a predened time-gap criterion 3 (0 < 3 ( 1), we start from the second-smallest modulus of the eigenvalues, |l nÀ1 |, and compare it to the next higher modulus, |l Typically, the le and right eigenvectors of the Jacobian are chosen as an approximation for the basis vectors b p and a q , respectively, according to the CSP formalism.This approximation corresponds to the rst renement of the CSP basis vectors. 95It follows that the eigenvalues of the Jacobian can be obtained from our choice of CSP basis vectors, Here, we follow an alternative approach to determine which one of the l reaction pairs contributes to the fast processes.We consider the largest modulus of eigenvalues of each of the l sub-Jacobians corresponding to the isolated reaction pairs.If a dominant modulus is larger than |l m+1 |/3 (l m+1 is associated with the Jacobian of the entire reaction system), the reaction pair connected to it will contribute to the fast processes.The idea behind this approach is that an eigenvalue corresponding to the entire kinetic model is approximately the sum of the eigenvalues of sub-Jacobians with similar or smaller moduli. 95With this approach, we introduce the assumption that a reaction pair is either included in or excluded from the fast processes, which certainly is a simplication that requires careful investigation.
Next, we propagate the fast sub-network (i.e., the sub-network containing only those edges corresponding to fast reaction pairs) to local equilibrium.The stationary distribution can be determined through a non-linear optimization algorithm 103 or analytically for simpler networks.Due to the time-scale separation, it is assumed that this process occurs immediately, i.e., it is not resolved in the course of the kinetic simulation.
Then, the actual simulation starts.The partially equilibrated concentration vector y peq (t) is propagated according to the time scale s 1,slow , which corresponds to the fastest process of the Jacobian of the slow sub-network (i.e., the subnetwork containing only those edges corresponding to slow reaction pairs).The update of the concentration vector reads Aer that, the Jacobian of the entire network is decomposed again to determine the fast and slow processes for the next time step.
Our kinetic simulation algorithm can be summarized as follows: (1) Determine the number of fast time scales m by spectral decomposition of the Jacobian corresponding to the kinetic model under consideration.
(2) Identify a reaction pair as a fast one if the largest modulus of eigenvalues of its sub-Jacobian is larger than |l m+1 |/3.
(3) Propagate the fast sub-network to local equilibrium, y(t) / y peq (t).( 4) Determine the time step s 1,slow by decomposing the Jacobian of the slow sub-network.
(5) Update the partially equilibrated concentration vector according to eqn (17), y peq (t) / y(t + s 1,slow ).( 6) If global equilibrium is not yet reached, repeat steps 1 to 5; otherwise, stop the simulation.This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.
The formose reaction is an example of a large and highly entangled reaction network.The key challenge of this network is the presence of coupled reactions spanning multiple time scales.In recent work, 22 we have shown how such a network can be explored in general.Since the exploration of the formose reaction is beyond the scope of this work, only a sub-network of the formose reaction is investigated here.The structure coordinates are adapted from ref. 37  (see ESI †).The heuristics-guided exploration of the whole formose network is currently being studied in our group. 104his sub-network, which already features many conceptual challenges of the entire formose reaction, is shown in Fig. 4. It represents a possible mechanism for the rst steps of the formose reaction as described by Kua et al. 37 and comprises six chemical species and ve reaction pairs (ten elementary reactions Ri).We obtained all free energies in single-point calculations as described in the ESI.† In water, formaldehyde (1) is in equilibrium with its hydrated form, methanediol (2). 1 dimerizes to glycolaldehyde (3), which is a reaction with a high free activation energy (cf.Table 3).2][33][34] From experimental studies it is, however, well known that the dimerization proceeds very slowly.3 can react with water to 1,1,2-ethanetriol (5).Another possible reaction of 3 is the enolization to 1,2-ethenediol (4).The addition of 1 to 4 yields glyceraldehyde (6).This bimolecular reaction introduces a signicant entanglement in the model network.The model network does not capture the autocatalytic nature of the formose reaction, in which 3 can be regenerated autocatalytically from intermediates generated from 6 in subsequent reactions.
Table 3 presents (standard-state Helmholtz) free activation energies (in solution), DA ‡,* , calculated according to eqn (4) and the resulting rate constants k (together with error estimates) for the reactions in the model network.
It can be seen that DA ‡,* is high (above 100 kJ mol À1 ) for most reactions, and consequently, the reaction rates are small.In addition, most reactions have This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.estimated errors of above 10 kJ mol À1 , which reects the large uncertainty of the respective reaction rates.In Section 4, we showed that the LC*-PBE0 functional provides reliable error estimates above 4.2 kJ mol À1 .The estimated error for reaction R7 is below that, and therefore, most likely too small.For the simulation we selected an absolute temperature of 298.15 K, a 1 M solution of formaldehyde in water as initial feed, and a time-gap criterion of 3 ¼ 10 À3 .For technical details of the kinetic modeling employed here, see the ESI.† For every set of free activation energies, it was found that all reaction pairs but (R3, R4), the dimerization of formaldehyde (1) to glycolaldehyde (3), contribute to the fast processes.Therefore, only reaction pair (R3, R4) constitutes the slow subnetwork (Fig. 5).
The concentration trajectories from the kinetic simulation of the reaction network are shown in Fig. 6.The red curve corresponds to the trajectory obtained from the free activation energies listed in Table 3.The black curves correspond to the trajectories obtained from the free activation energies calculated from the ensemble of density functionals generated by our error estimation scheme according to eqn (6).
The simulated time scale of the global process exceeds the age of the universe in each case.This nding should not be interpreted in absolute terms, but it indicates that the uncatalyzed thermal formose reaction is very unlikely to occur if  one starts from formaldehyde (1) alone, provided that the free activation energies and their estimated errors are reliable.It should be noted that glycolaldehyde (3) is autocatalytically regenerated in the formose reaction, which is not considered in our model network.This way, the dimerization of 1 to 3 (reaction R3) can be circumvented, which leads to an acceleration of the overall process not depending on the extremely slow reaction R3.The concentration trajectories show clearly how sensitive rate constants are to variations in the free activation energies.For instance, the variation in time of the concentration trajectories of methanediol (2) spans almost 23 orders of magnitude (a factor of 8.7 Â 10 22 at an arbitrarily chosen concentration of y 2 ¼ 0.01 mol L À1 ).Since only reactions R3 and R4 contribute to the time resolution of the chemical process, uncertainties in the corresponding free activation energies need to be responsible for this signicant variation.In Table 4, properties of the fastest and slowest concentration trajectories (Fig. 6, species 2, le-most and right-most curves) are compared.For reaction R3, the free activation energy spans a range of about 60 kJ mol À1 , and for reaction R4, this range is about 100 kJ mol À1 , leading to a deviation in rate constants of about 10 and 17 orders of magnitude, respectively.Taking the concentrations of 1 and 3 (the constituents of reactions R3 and R4) at  3 are shown in red.The other trajectories (black) result from the free activation energies calculated from the ensemble of density functionals generated by the error estimation scheme.Note that the time scale of the equilibration process is extremely large, which originates from neglecting relevant intermediates and elementary reactions in our model network.For readability reasons, all plots start after the first global time step s 1,slow , which depends on the sampled free activation energies, and therefore, the onset of the trajectories is different.

Paper Faraday Discussions
This our arbitrarily chosen concentration of y 2 ¼ 0.01 mol L À1 into account, the rates of both reactions can be calculated.For both the fastest and slowest trajectories, reaction R3 is much faster than reaction R4.Therefore, we assume only reaction R3 to be relevant for the kinetic simulation.The reaction time can be roughly estimated by the inverse of the current reaction rate.In our case, the reaction time of the slowest trajectory is higher than that of the fastest trajectory by a factor of 1.4 Â 10 23 , which is quite close to the factor of 8.7 Â 10 22 determined from the concentration data of 2. Obviously, an error of this magnitude with respect to the free activation energy is far too large to quantify concentration trajectories in terms of absolute time.Moreover, it should be noted that the error introduced by choosing conventional TST to calculate rates is not considered here.
Even though the uncertainty in free activation energies strongly affects absolute time, it does not affect the qualitative ux of concentrations through the network in terms of non-crossing trajectories (Fig. 7).This nding can be explained by the distinct separation of the magnitude of the free activation energies.In Table 4, it can be seen that the free activation energy for reaction R3 in the slowest case is even lower than that for reaction R4 in the fastest case.Furthermore, the free activation energies and their uncertainties listed in Table 3 show that all reaction barriers are well separated from each other, which does not allow for an alternative reaction mechanism.Clearly, for small activation energy differences, such as found in enantioselective organocatalysis, large uncertainties would also lead to qualitatively different results.
Qualitative validity of the kinetic simulation is also underlined by the fact that in all cases, 1,1,2-ethanetriol ( 5) is the main product at chemical equilibrium.The population dominance of 5 over 3 was also found experimentally by Kua et al. 105 However, their calculated Gibbs free activation energies for the corresponding reaction pair (R7, R8) 37 (DG ‡;* 3/5 À DG ‡;* 5/3 ¼ 2:5 kJ mol À1 ) are very similar to each other.Their Gibbs free activation energies can be directly compared to our Helmholtz free activation energies, because volume changes are neglected.Our free activation energies for the reaction pair (R7, R8) differ signicantly from each other on average (DA ‡;* 3/5 À DA ‡;* 5/3 ¼ À18:9 kJ mol À1 ).A reason for the observed difference is the choice of computational methods for the calculations (e.g., different density functional and solvation model).It might seem surprising that 5 Table 4 Free activation energy DA ‡,* (in kJ mol À1 ), rate constant k (in s À1 and (L molÀ1 s À1 ) for unimolecular and bimolecular reactions, respectively), concentrations y 1 and y 3 (in mol L À1 ) at y 2 ¼ 0.01 mol L À1 and reaction rate r (in mol L À1 s À1 ) for reactions R3 and R4 of the fastest and slowest concentration trajectories (Fig. 6, species 2, left-most and right-most curves) is the main product in our simulation even though glyceraldehyde ( 6) is a thermodynamic sink.However, one should keep in mind that the concentration trajectory of 6 is temporarily signicantly populated.To understand this nding, we need to discriminate between fast and slow processes (Fig. 5).
Considering the fast sub-network (Fig. 5), we understand that there are two unconnected channels to form 6, i.e., (1, 2) and (3, 4, 5).This picture is equivalent to reaction A + B # C, where the initial concentration difference between A and B is conserved over the course of the reaction.It follows that D h (y 1 + y 2 ) À (y 3 + y 4 + y 5 ) is the conserved quantity in our case.If one of the two channels is unpopulated, 6 cannot be formed.This case holds in the beginning (dominant population of 1) and in the end (dominant population of 5) of the reaction process.The slow subnetwork (Fig. 5) now connects these two channels.Since channel (1, 2) is dominantly populated in the beginning of the reaction process, ux occurs towards channel (3, 4, 5) and, hence, towards 6.The concentration of 6 increases while the magnitude of D decreases.At a certain point in time, approximately when the concentration of channel (3, 4, 5) starts becoming dominant over that of channel (1, 2), the magnitude of D increases again so that the concentration of 6 decreases.Since D is asymptotically decreasing with time, we employed this quantity to dene the reaction progress in Fig. 7 as (D 0 À D)/(D 0 À D eq ), where D 0 is D at time t ¼ 0, and D eq is D at global equilibrium.Recall that here, we are studying a small segment of a complex reaction network, where 6 can isomerize to more stable intermediates or reacts with 1 to higher sugars.Therefore, the reux of 6 is most probably an artifact resulting from the particular choice of the network.Another feature of the conservation of D is that the kinetic model can be reduced to a single differential equation (see the ESI † for more details).This differential equation can be easily integrated by any conventional numeric solver.Here, we chose the standard fourth-order Runge-Kutta algorithm.We compared the result to that of our CSP-type method, where we employ an explicit Euler algorithm according to eqn (17), which is the simplest ansatz for numerical integration and known to be unstable due to the lack of an inherent time step selection.However, our CSP-type method provides the time step for the explicit Euler algorithm by continuously analyzing the Jacobian.We emphasize that both approaches to model the kinetics of the network (CSP/Euler vs. Runge-Kutta) yield identical results.

Conclusions
We established a robust protocol that combines electronic structure calculations and kinetic simulations for the accurate description of a complex kinetic network by studying stationary points across multiple potential energy surfaces.Employing a simplied model network, we highlighted and discussed the challenges of kinetic studies on complex chemical networks such as the formose reaction.We showed by employing a time-scale separation approach based on Computational Singular Perturbation 99,100 how the frequently occuring stiffness (rare-event problem) in kinetic simulations can be circumvented.As a consequence, we were able to propagate uncertainties in the free activation energies through the complete kinetic simulation up to global equilibrium.Since the rate constants depend on free activation energies DA ‡,* (in a canonical ensemble) through an exponential function, errors in DA ‡,* strongly affect the kinetic simulation.Therefore, error estimates for DA ‡,* are decisive for drawing meaningful conclusions from a kinetic analysis.While reliable error estimates for electronic energies can be obtained by Bayesian statistics as shown in Section 4, errors on other contributions of DA ‡,* have not been accounted for in a systematic way yet.We proposed a strategy to also obtain error estimates for these contributions, but defer their analysis to future work.To improve the accuracy of the kinetic simulation, elementary steps with large error estimates need to be subjected to highly accurate quantum chemical calculations such as those reported in ref. 106.
Then, the heuristics-guided exploration established for protonation reactions in a previous study 22 needs to be extended to enable the exploration of chemical systems such as the formose reaction, 104 which will facilitate a fully automated exploration and analysis of the formose reaction.
) where DE LC*-PBE0 elec;i and DE ref elec;i are the relative energies obtained with the LC*-PBE0 functional and the reference method, respectively.In this study, electronic energies from the DF-LCCSD(T0)-F12 method are chosen as reference.For further details, Cartesian coordinates, and the reference electronic energies in D, see the ESI.† To assess the transferability of the reparametrized functional, the dataset D was arbitrarily split into a training set and a test set with 25 and 17 entries, respectively.By minimizing C with respect to the training set with the L-BFGS-B algorithm, 94 a new set of parameter values for the LC*-PBE0 functional was obtained (see ESI †).
and 3, LC*-PBE0 (with AEs error bars, calculated from an ensemble of N BEE ¼ 50 functionals as described in the ESI †) is compared to contemporary density functionals with respect to the training set and test set, respectively.

Fig. 2
Fig. 2 Errors of LC*-PBE0 (with error bars indicating AEs) and several standard functionals with respect to the training set (D1-D25).The dashed lines indicate an error of AE4.2 kJ mol À1 .

Fig. 3
Fig. 3 Errors of LC*-PBE0 (with error bars indicating AEs) and several standard functionals with respect to the test set (D26-D42).The dashed lines indicate an error of AE4.2 kJ mol À1 .
nÀ2 |.If |l nÀ1 |/|l nÀ2 | $ 3, we continue by increasing each index by one.If |l i |/|l iÀ1 | $ 3 for all i ˛{2,., n À 1}, our time-gap criterion is not fullled and we cannot determine a spectral gap.Otherwise, the rst eigenvalue pair fullling the condition |l i |/|l iÀ1 | < 3 sets the number of fast time scales, m ¼ i À 1.

Fig. 4
Fig. 4 (a) Possible mechanism of the first steps in the formose reaction, (b) abstract graph representation of this reaction sub-network.

Fig. 5
Fig. 5 Fast (bottom left) and slow (top right) sub-networks of the reaction network shown in Fig. 4.

Fig. 6
Fig. 6 Concentration trajectories with respect to time for chemical species 1-6 according to the reaction network shown in Fig. 4. The trajectories resulting from the free activation energies listed in Table3are shown in red.The other trajectories (black) result from the free activation energies calculated from the ensemble of density functionals generated by the error estimation scheme.Note that the time scale of the equilibration process is extremely large, which originates from neglecting relevant intermediates and elementary reactions in our model network.For readability reasons, all plots start after the first global time step s 1,slow , which depends on the sampled free activation energies, and therefore, the onset of the trajectories is different.

Fig. 7
Fig.7Concentration trajectories with respect to reaction progress for chemical species 1-6 according to the reaction network shown in Fig.4.The trajectories resulting from the free activation energies listed in Table3are shown in red.The other trajectories (black) result from the free activation energies calculated from the ensemble of density functionals generated by the error estimation scheme.Contrary to Fig.6, here, the trajectories are laid on top of each other.

Table 1
Largest absolute deviation (LAD), mean absolute deviation (MAD), and mean signed deviation (MSD) of a selection of functionals, some with D3 dispersion corrections, for the training set (in kJ mol À1 )

Table 2
Largest absolute deviation (LAD), mean absolute deviation (MAD), and mean signed deviation (MSD) of a selection of functionals, some with D3 dispersion corrections, for the test set (in kJ mol À1 )

Table 3
Free activation energies DA ‡,* (in kJ mol À1 , with error estimates) and rate constants k (in s À1 and (L mol À1 s À1 ) for unimolecular and bimolecular reactions, respectively) for the reactions in the network Open Access Article.Published on 06 July 2016.Downloaded on 10/30/2023 3:30:01 AM.This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.