Open Access Article
Manuel Palma Banos
a,
Joel D. Kressb,
Rigoberto Hernandez
a and
Galen T. Craven
*b
aDepartment of Chemistry, Johns Hopkins University, Baltimore, Maryland, USA
bTheoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico, USA. E-mail: gcraven@lanl.gov
First published on 20th April 2026
A data-driven computational method is introduced to extract chemical reaction mechanisms from time series chemical concentration data. It is realized through the use of dynamic symbolic regression in which a sparse analytical form for a dynamical system is discoverable from the underlying data. We specifically develop the stoichiometrically-informed symbolic regression (SISR) method to address a standing challenge in complex chemical reaction networks: given a time-series dataset of concentrations of several components, what is the mechanism and the associated rate constants? SISR finds the optimal mechanism, kinetic equations and rate constants by combining differential optimization with a genetic optimization approach that searches a symbolic space of possible reaction mechanisms. Use of SISR in several paradigmatic examples spanning linear and nonlinear reaction schemes results in excellent agreement between true and predicted mechanisms, including when the method is applied to noisy data. The advantages of a stoichiometrically-informed approach such as SISR to address reaction discovery is illustrated through comparison with the use of generic state-of-the-art data-driven approaches.
Because of these difficulties, automated reaction mechanism generation is an emerging data-driven research approach that can accelerate the extraction of accurate reaction mechanisms from data.5–8,18–20 Data-driven and machine learning (ML) methods have been broadly and successfully applied in many areas of the physical sciences.21–35 Data-driven approaches for reaction discovery have been used to decipher complex and large datasets of chemical concentration data by extracting chemical reaction pathways, rate constants, and reaction mechanisms.5–8,36 The current data-driven reaction mechanism discovery methods, however, generally suffer from the same shortcomings that are typical of most data-driven and ML approaches including lacking interpretability, a large number of parameters, the black-box nature of the approximating function, and poor performance when extrapolation outside of the training data is performed. One ML approach that is used to circumvent these limitations is Symbolic Regression (SR)—a method to search for simple analytical functions that best describe a dataset. SR has been applied in multiple contexts to extract sparse and interpretable functional forms from data.37–39 In the context of SR applied to dynamical systems,40 dynamical SR approaches can be used to extract a sparse analytical form for a dynamical system from time series data. Integrating the discovered system of dynamical equations will generate the time-evolution of the input variables, for example the time-dependence of the concentrations of chemical species.41,42
One of the most prominent SR methods that is used to discover the dynamical equations giving rise to a time series dataset is the Sparse Identification of Nonlinear Dynamical systems (SINDy) approach.40,43,44 SINDy has been applied broadly in the physical sciences and has seen successful applications in diverse areas such as biological networks,45–47 aerodynamics,48 and plasma physics,49 among others. One of the primary advantages of SINDy is that given a collection of time series data, it can quickly (relative to, for example, genetic SR approaches50,51) search a large space of possible analytical functional forms and corresponding parameter values to generate a simple dynamical system that when integrated matches the time evolution of the input data. Other approaches to determine symbolic expressions for physical mechanisms have been developed. For example, the sure-independence screening and sparsifying operator (SISSO) method52,53 can be applied to generate sparse analytical descriptors of a material's properties.
The SINDy framework has been applied to determine chemical reaction mechanisms using an approach termed Reactive SINDy (see ref. 41). This interesting application of SR to chemical reaction networks produced sparse reaction mechanisms in good agreement with the input data and illustrates the potential of dynamical SR in macroscale chemical reaction dynamics. The Reactive SINDy approach does have limitations that reduce its utility and robustness including: (1) the user must propose a collection of reaction ansatz, i.e., the user must guess what reactions are present in a process. This requirement can be cumbersome and time-consuming especially if the number of species being studied is large or little is known about the physical process being examined. (2) There are no constraints on the reaction rates in the derived mechanism which can lead to unphysical results such as negative concentrations when the system is integrated. (3) Fast-slow dynamics54 are not well-described. For example, if there are rate constants that differ by multiple orders of magnitude the SINDy approach will generally prune the slow process from the derived dynamical system. This poses a problem because elimination of reactions with small rate constants can eliminate reaction pathways that are vital to the overall mechanism. While these problems are not present in all chemical reaction networks/mechanisms, they do limit the applicability and utility of Reactive SINDy in some cases.41 Other approaches such as SINDy – CRN42 where CRN stands for Chemical Reaction Network, and the one defined in ref. 55 seek to alleviate some of these problems.
Other methodologies for extracting dynamical systems from data have been developed and applied to good effect.56–63 In the context of chemical reaction mechanism discovery, it would be advantageous to developing a method that (a) gives the explicit individual reactions involved in a process, (b) gives the stoichiometry of those reactions—a fundamental property in the analysis of chemistry and chemical reactions, (c) does not rely on neural network formulations of the chemical reaction network, as they can reduce interpretability and accurate extrapolation (i.e., accurate time-series forecasting), (d) can detect hidden variables such as unknown chemical intermediates in a reaction mechanism, (e) does not require a postulated set of potential reactions be included as reaction ansatz, (f) accurately returns the rate constants for each reaction, and (g) is robust to noise in data.
In this work, we develop and apply a stoichiometrically informed symbolic regression (SISR—pronounced “scissor”) tool to determine chemical reaction mechanisms and chemical kinetic equations from time-series concentration data. Our specific technical advance is to apply a physics-informed mathematical formalism that accounts for intrinsic stoichiometry in a chemical reaction to automate the discovery of accurate chemical reaction mechanisms from data. The developed method returns sparse and interpretable analytical forms for a reaction mechanism discovered from data. A genetic optimization approach is employed to search the symbolic space of possible reaction mechanisms to find the one that best matches a time-series dataset of chemical concentrations. That genetic approach is coupled with the stoichiometrically-informed method to fit the rate constants in a reaction mechanism through differential optimization. Applying the method results in excellent agreement between true and predicted mechanisms over data from multiple linear and nonlinear reaction schemes. The agreement is shown to persist over sparse and noisy datasets, such as those that would typically be obtained from experiments.
The remainder of this article is organized as follows: Section 2 contains details of methods and the formalism that are applied including the genetic search procedure over the symbolic reaction space and the numerical procedures used to fit the rate constants in those reactions. In Section 3, the results of the method for several model reactive schemes and chemical reaction networks are presented. Conclusions and future directions are discussed in Section 4.
The mechanisms are fit and constructed in the derivative space of the concentration data. Here, the numerical derivatives of the concentration data, obtained using finite difference methods, are represented by
Fitting to derivatives is a common approach used to discover symbolic dynamical systems from data because it allows the derivative functions, for example,
, to be constructed directly as opposed to constructing the primary functions, x(t), y(t), z(t), and then deriving the dynamical system from those functions.40,64 We apply this same approach.
A symbolic reaction mechanism M is a collection of chemical reactions
| M = [rxni, rxnj, rxnk,…], | (1) |
![]() | (2) |
The size of a mechanism is given by its cardinality |M| which is the number of reactions in the mechanism.
To illustrate the SISR method, consider the abstract chemical reaction example
![]() | (3) |
| rxn(r) = [s(r)1, s(r)2,…, s(r)N], | (4) |
| rxn(p) = [s(p)1, s(p)2,…, s(p)N], | (5) |
![]() | (6) |
![]() | (7) |
The reactant vector for this reaction is
| rxn(r) = [2, 3, 1], | (8) |
| rxn(p) = [1, 2, 0], | (9) |
| rxn = [2, 3, 1] ⊕ [1, 2, 0] = [2, 3, 1, 1, 2, 0]. | (10) |
Using this formalism, chemical reactions can be expressed in convenient mathematical form that maps to a symbolic representation for analysis and optimization.
Each total mechanism has a corresponding system of symbolic kinetic equations defined by:
![]() | (11) |
The kinetic equations are integrated to determine the time evolution of the concentration of each species.
![]() | (12) |
This is a mathematical statement that a reaction on the reaction list must not exceed the predefined reaction order. Typical values for O will be O = 1 if only first order (and optionally zeroth order) reactions are to be included or O = 2 if first and second order reactions (and optionally zeroth order) are to be included, although higher reaction orders can also be used. The last N elements in the reaction vector are the stoichiometric coefficients of the products, the sum of which in the SISR method is constrained by:
![]() | (13) |
The number of possible reactant vectors is given by N + O Choose O, i.e.,
, and the number of possible product vectors is (O × R)N. The final list of reactions is formed by combining every possible reactant vector with all possible product vectors and eliminating redundant or stoichiometrically prohibited reactions.
We utilize an islanding procedure in the SISR genetic algorithm, where multiple islands are created, each containing mechanisms with a fixed number of reactions, |M|. Islanding is a technique used in genetic algorithms where the population is divided into distinct subpopulations (here based on the number of reactions in a mechanism), each evolving separately on different “islands”. The mechanisms in each island maintain a constant number of reactions throughout the evolutionary process. The genetic algorithm is then applied separately to each island, allowing independent evolution of mechanisms with their respective constraints on the number of reactions. No information is transferred between islands during the search procedure. The islanding approach promotes diversity among potential solutions and mitigates the need to address mechanism complexity (or other factors) for optimal mechanism selection during the search on each island. This ultimately leads to more efficient optimization and also makes the search easier to parallelize.
| G1 = [M1, M2, M3,…], | (14) |
After constructing symbolic expressions for each reaction mechanisms in the generation, the next step is to fit the rate constants for each mechanism to the data. Our aim is to minimize the discrepancy between the true concentration derivatives
and the predicted derivatives
for each species i where the hat notation signifies that the derivatives arise from fits to the data. The time derivatives are computed using a second-order accurate central difference for interior points and first-order accurate forward/backward differences at the boundaries. This approach is able to handle both uniformly-spaced and nonuniformly-spaced time-series datasets.
The rate constant fitting is achieved using the mean squared error (MSE) defined through the loss function in the derivative space
![]() | (15) |
![]() | (16) |
The time derivatives for each species in a mechanism are fit to expressions that encode stoichiometric information in the form:
![]() | (17) |
| G(fit)1 = [M(fit)1, M(fit)2, M(fit)3,…], | (18) |
![]() | (19) |
The initial generation of mechanisms is then sorted and ranked based on fitness defined by the MSE in the derivative space, i.e.,
.
fittest mechanisms from the previous generation Gi are kept for the next generation Gi+1 using an elitism fraction
. Therefore, in each subsequent generation after the first, nnew = nmech − nbest new mechanisms must be created using information from the best mechanisms from the previous generation. This process is performed using crossover methods and then mutation methods.Crossover involves taking information (reactions, reactant vectors, and/or product vectors) from the best performing mechanisms and using that information to generate new mechanisms. The probability of a mechanism from the previous generation being involved in a crossover event comes from the ranked-based selection:
![]() | (20) |
The crossover mechanism is generated by choosing two mechanisms (the parents) randomly according to the probability pi. The reactions contained in the two selected parent mechanisms are then combined into a gene pool which is a collection of all the reactions involved in the chosen mechanisms. The gene pool is edited so reactions only appear once. Two offspring mechanisms are then created by randomly selecting reactions from this gene pool. The offspring are created by randomly and sequentially selecting reactions from the gene pool until all the chemical species that are involved in a process are included in children mechanisms and the number of reactions in the mechanism is equal to the cardinality of the specific island size being evolved. For example, consider the two parent mechanisms from an |M| = 4 island:
| M1 = [rxn1, rxn2, rxn3, rxn4], | (21) |
| M2 = [rxn5, rxn6, rxn7, rxn8]. | (22) |
The gene pool for these parents is
| GP = [rxn1, rxn2, rxn3, rxn4, rxn5, rxn6, rxn7, rxn8]. | (23) |
Using this gene pool to generate two children, results in mechanisms such as:
| Mchild1 = [rxn1, rxn2, rxn5, rxn6]. | (24) |
| Mchild2 = [rxn1, rxn5, rxn7, rxn8]. | (25) |
Crossover can be chosen to occur over specific reactions as described above, or in the reactant and/or product vectors separately. The latter will be advantageous when examining chemical processes with large number of chemical species.
Once the new generation of mechanisms is created using crossover, a random number of the newly-generated mechanisms are selected to be mutated. Mutation is not performed on the nbest elite mechanisms, only on the nnew new mechanisms generated using crossover. This is to retain the best solutions, otherwise the mutation could take an optimal solution and change it, removing important reaction information from the overall gene pool. Throughout this work we use a mutation rate
of 0.1, meaning 10% of the new mechanisms are mutated. In the mutation procedure, first, a random mechanism is selected from nnew mechanisms. Next, a random reaction is selected from the selected mechanism and is substituted for another reaction from the original reaction list, with all reactions on this list being equally probable. Finally, the new mechanism with the substituted reaction is checked to see if it satisfies the constraint that all chemical species in the dataset are included. If so, the new mechanism is substituted with original mechanism before mutation. If not, then the mutation procedure starts over by selecting a new mechanism to mutate, and the original mechanism stays on the list of mechanisms.
The rate constants in the new generation are then fit to constrict the new generation of fit mechanisms G(fit)i+1 and a corresponding set of values for the loss function
. Then, the sort → crossover → mutate algorithm starts again until a set number of generations are evolved.
![]() | (26) |
The complexity metric applied in this work arises from counting the nodes in an expression tree that represents the kinetic equations for a mechanism (see eqn (11)). Symbolic expression trees represent analytical functions in a hierarchical structure, where each node represents an operation or operand. The overall complexity metric is a sum over the number of nodes in the expression tree for each species in a mechanism:
![]() | (27) |
![]() | ||
| Fig. 2 Expression tree complexity analysis for the example mechanism shown in eqn (28). | ||
Several modifications to typical expression tree construction are implemented:
(1) We want larger stoichiometric coefficients to contribute more to the complexity in comparison to smaller coefficients. Therefore, all stoichiometric coefficients are written as a separate node or subtree. If the stoichiometric coefficient s = 1, it is written as a single node. If the stoichiometric coefficient s > 1, it is expressed as a subtree with s + 1 nodes where the additional node is due to the “+” operation.
(2) The “power” operation is not used to write nonlinear terms. Instead we write nonlinear terms such as [A]2 using the product operation. This is because terms like [A]2 should contribute the same complexity as terms like [A][B].
In general, chemical reaction mechanisms have specific mathematical forms that contain polynomials but do not include other types of functions such as trigonometric or exponential functions. This simplifies the complexity calculation because we do not have to decided how to weight these different functions in the complexity hierarchy.
To illustrate how the complexity metric is implemented, consider the example mechanism
![]() | (28) |
![]() | (29) |
Fig. 2 illustrates the expression trees for each of the three equations in this system. Three key points are:
(1) In the subtree for the term 2k2[B] in the
equation, note that the stoichiometric coefficient is broken into a sum. This adds a penalty on the stoichiometric values in the expression complexity.
(2) Nonlinear terms such as k1[A]2 contribute more complexity than linear terms such as k1[A]. This agrees with physical intuition that nonlinear expressions are more complex than linear expressions.
(3) All rate constants contribute the same to the complexity. Meaning no penalty is placed on the values of the rate constants. This is to avoid problems with coefficient thresholding that can arise in other methods.
The overall goal of the SISR procedure is to solve the multiobjective optimization problem
![]() | (30) |
![]() | (31) |
![]() | (32) |
100
000] in units of seconds was fed into the SISR algorithm. For this mechanism, the SISR method was evolved for 10 generations over 5 islands with number of reactions |M| = 2–6. A population size of 2000 was used for the |M| = 3–6 islands while a population size of 500 was used for |M| = 2 island because of the smaller number of possible mechanisms on that island. The maximum reaction order was O = 2. The mutation rate was
. The elitism was
(meaning the top 10% of the solutions were retained across generations) except for the |M| = 2 island where the elitism was
due to the smaller population size.
The symbolic reaction extracted by SISR is:
![]() | (33) |
1 = 6.310 × 10−5 s−1,
2 = 1.262 × 10−4 s−1, and
3 = 3.156 × 10−5 s−1 in excellent agreement (0.0317% error or better) with the true rate constants. The results of the fitting are shown in Fig. 3. The concentrations predicted by the SISR method closely match the time evolution of the true mechanism, as illustrated in Fig. 3(a). In fact, at the presented level of visual fidelity, the ground truth data and the SISR result are indistinguishable. An important observation is that even though SISR searches over a symbolic space of nonlinear functions, the true linear mechanism is selected as the best model. Fig. 3(b) illustrates a comparison between ground truth and SISR results for the scaled numerical derivatives,
![]() | (34) |
![]() | ||
| Fig. 3 Time evolution of the (a) concentrations and (b) scaled derivatives—eqn (34)—of each species in the sequential linear mechanism given in eqn (32). The solid lines are the results of the SISR method and the corresponding black markers are a subset of the data used by SISR to extract the reaction mechanism and fit the rate constants. The concentrations are shown in units of millimolar and time is shown in units of seconds. | ||
We have found that for the linear sequential mechanism, the SISR search converges to the true symbolic mechanism using as few as 20 data points in the fitting procedure. This supports the possibility that SISR will perform well on sparse kinetic data, for example, on the types of data that could be generated in some experimental setups. As expected, the more data points that are used in the fitting the closer the extracted rate constants become to the true rate constants. It is also interesting to note that the fitting method gives good agreement even when a relatively small number of data points are used.
Fig. 4 illustrates how the minimum derivative error for each generation i, i.e.,
, changes over 10 generations of the SISR genetic algorithm. In this case, the data that was fed into the SISR method was concentration data from 50 equally spaced time points over the time interval [0,
100
000]. Each color curve in Fig. 4 represents the results for a different island size in the SISR algorithm. Fig. 4 demonstrates that the genetic evolution of mechanisms results in monotonically decreasing error over each generation. This shows that the SISR algorithm is optimizing the symbolic mechanism and the genetic search tends toward an optimal mechanism on each island. For |M| = 2, only two reactions are included in each mechanism, and so the search space of possible mechanisms is small—approximately 104 mechanisms. Therefore, the optimal solution for that island is found after only a couple of generations because each generation includes a large portion of the total search space of possible mechanisms. However, the error generated on the |M| = 2 island is large compared to the other islands. The ground truth mechanism has |M| = 3 reactions, and there is a dramatic (approximately two orders of magnitude) drop in error when going from the |M| = 2 island to the |M| = 3 island. This illustrates an important result because to identify the correct mechanism a heuristic argument is to look for the optimal mechanism on the island that occurs immediately after a steep drop in error with respect to variation in mechanism size. Later we will give a more rigorous definition for mechanism selection, although we have found this heuristic observation to be an accurate identifier over the systems examined in this work.
![]() | ||
Fig. 4 Minimum derivative error, , as a function of generation using SISR to extract mechanisms from the ground truth data generated using the mechanism given in eqn (32). The y-axis is shown on a log scale. Each curve is the result of a different island in the SISR method, where each island contains mechanisms with the number of reactions |M| shown in the legend. | ||
In the derivative space, we see an interesting result which is that while the true mechanism has |M| = 3 reactions, the islands with |M| = 4, |M| = 5, and |M| = 6 reactions actually generate lower error than the true mechanism, although this is not the case in the concentration space as we will show next. The cause of this reduction in error when adding spurious and erroneous reactions to the true mechanism size is that as more reactions are added, the algorithm has more coefficients to adjust in the fitting procedure, thereby fitting the data more closely. This means that the mechanisms on those islands are overfitting the data and doing so on symbolic functions that do not best fit the data. This also illustrates the need for a complexity measure when making the final mechanism selection, because if only minimum error was used then the wrong mechanism would be chosen in this case.
Fig. 5 is a plot of the value for the loss function in the concentration space
as a function of the mechanism complexity. Each marker in the plot represents the mechanism with the lowest derivative error on each island. This Pareto front plot illustrates that the |M| = 3 mechanism is the point on the Pareto front where increasing complexity results in limited improvement in the accuracy of the mechanism. Therefore the |M| = 3 mechanism is the optimal solution. This can be observed because there is a steep drop in error when going from the |M| = 2 point to the |M| = 3 point, and while the complexity for |M| = 2 is smaller, the tradeoff in concentration error is too dramatic to make it the optimal solution. When comparing the |M| = 3 to |M| = 4 mechanisms, there is limited improvement in concentration error when adding a new reaction, however there is a large increase in complexity. Interestingly, the |M| = 5 and |M| = 6 mechanisms result in an increase in error with respect to the |M| = 3 mechanism. This illustrates that including more terms in the symbolic mechanisms, i.e., including more reactions, does not necessarily result in improved accuracy but will generally result in a higher complexity. The error metrics used to construct the Pareto front in Fig. 5 are shown in Table 1 along with the corresponding derivative error for each mechanism.
to the same dataset used in the deterministic solutions of the sequential linear mechanism given in eqn (32). A Savitzky–Golay (SG) filter was applied to the concentration data, and that filtered data was used as the SISR input. This is a general approach that has worked previously due to known problems with numerical derivative calculations in the presence of noise.64 The derivatives are computed after the smoothing is performed. The result of the SISR on the noisy data is shown in Fig. 6, with excellent agreement observed between the noisy data and the mechanism predicted by SISR. The SISR method is able to find correct mechanism (the same as in eqn (32)). The rate constants extracted from the data were
1 = 6.346 × 10−5 s−1,
2 = 1.279 × 10−4 s−1, and
3 = 3.091 × 10−5 s−1 in strong agreement with the true rate constants. Note the specific rate constant values extracted from noisy data will depend on the specific realization of the noise. An important point is that the SISR method is stochastic and not guaranteed to find the optimal solution (in this case meaning the correct mechanism) for each realization of the noise. Overall, this result illustrates the ability of SISR to extract the true reaction mechanism on data with high levels of noise.
![]() | ||
| Fig. 6 Time evolution of the concentrations of each species in the sequential linear mechanism given in eqn (32). The SISR fit for each species extracted from noisy data is shown by a solid curve. The markers are a subset of the noisy data used by SISR. | ||
Fig. 7(a) is the SISR result for the case in which data from species A, C, and D are used (the unknown intermediate in this case is species B), and SISR is applied to find a mechanism containing only those species. In this case, a poor fit and mechanism is obtained for the available data. The reason for this is that, without including the hidden intermediate, the model cannot reproduce the correct time evolution of the concentrations or reaction pathways that connect the observed species. Now, compare those results with the results in Fig. 7(b) where SISR is allowed to search for mechanisms that include one additional (previously hidden) species. In this case, the recovered mechanism correctly identifies the presence of the missing intermediate and yields a significantly improved fit to the data. Specifically, the error computed using eqn (15) drops by a factor of approximately 105 compared to the case shown in Fig. 7(a) that does not include an intermediate in the reaction mechanism. It is interesting to note that even though only data from three species is used, the SISR approach converges to the exact correct underlying mechanism. Therefore, including the intermediate species in the reaction mechanism, despite the absence of direct data for that species, yields a significantly improved mechanism. This demonstrates that the dataset implicitly contains evidence of a hidden intermediate, which is consistent with the known (see eqn (32)) underlying reaction mechanism.
![]() | ||
| Fig. 7 Time evolution of the concentrations of the species in the sequential linear mechanism given in eqn (32). Panel (a) shows the case in which data for species A, C, and D are fed into SISR but no hidden variable (a new chemical species) is allowed in the mechanism and panel (b) shows the same case but with a hidden variable being allowed. In both cases, the error function only includes species A, C, and D. The solid lines are the SISR fit and the dashed lines are the true data. | ||
We have also confirmed that a similar level of improvement is observed using hidden variables in the reaction mechanism when other species are removed from the data set. So, for example, when data for species A, B, and D are used with C being excluded and when data from species A, B, and C are used with D being excluded. In all cases we have studied for this mechanism, the application of SISR yields the exact true mechanism and also detects the presence of a hidden variable.
This proof-of-concept illustrates how SISR can be used to detect and then fit hidden chemical intermediates. Further work in this area will focus on constructing a multidimensional Pareto front that also includes the number of species involved in the process as an optimization dimension. In a many scenarios, for example in many biological systems, in combustion modeling, and in microkinetic modeling, there could be a large number of hidden variables. The presented results detail the case when there is only a single hidden variable, and therefore opportunities exist for future work to examine the ability of SISR to handle more complex cases with multiple hidden variables.
![]() | (35) |
![]() | (36) |
To generate this mechanism, a population of mechanisms was evolved for 20 generations over 6 islands with |M| = 2–7 reactions. The population size was 2000 for the islands with |M| = 3–7 reactions and 500 for the |M| = 2 island. The mutation rate was 0.1 and the elitism was 0.1 for all islands except the |M| = 2 island where the elitism was 0.4 due to the smaller population size. The maximum reaction order was O = 2. The ground truth data was generated using 1000 equally-space points over the time interval [0, 20]. The search space involves 57 reactions that can be combined into ∼3.1 × 108 possible mechanisms considering mechanisms with 2–7 reactions.
The model extracted from the data by the SISR method is
![]() | (37) |
1 = 0.1003,
2 = 0.1029,
3 = 1.001,
4 = 1.001,
5 = 0.996, in strong agreement (2.81% error or better) with the true rate constants.
The SISR results are shown in Fig. 8 with panel (a) containing a comparison between the true and predicted concentration data and panel (b) containing the same comparison for the scaled derivative data. Excellent agreement is observed between the SISR result and the ground-truth data in both cases. The symbolic reaction mechanism is fitted on the derivative data, and Fig. 8(b) illustrates a primary advantage in using SISR: due to the inclusion of stoichiometric constraints, there is a distinct lack of overfitting the discovered mechanism, i.e., in the dynamical system that is extracted from the data.
![]() | ||
| Fig. 8 Time evolution of the (a) concentrations and (b) scaled derivatives—eqn (35)—of each species in the Lotka–Volterra mechanism with social friction given in eqn (36). The solid lines are the results of the SISR method and the corresponding black markers are a subset of the data used by SISR to extract the reaction mechanism and fit the rate constants. The concentrations and time are shown in arbitrary units (a.u.). | ||
Fig. 9 illustrates how the minimum derivative error changes over 10 generations when applying SISR to the Lotka–Volterra mechanism. Again, as in the previous mechanism, the genetic evolution results in monotonically decreasing error over each generation for every island size. The error generated on the |M| = 2 island is the largest. As more reactions are added to the mechanism, the error decreases, which can be observed by comparing the results for each island size. The ground truth mechanism has |M| = 5 reactions, and the most dramatic (approximately two orders of magnitude) drop in error is observed when going from the |M| = 4 island to the |M| = 5 island. So for this reaction mechanism, the heuristic argument that the optimal mechanism occurs on the island that follows immediately after a steep drop in error with respect to variation in mechanism size would yield the selection of the true mechanism. While the true mechanism has |M| = 5 reactions, the island with |M| = 6 reactions generates a lower error than |M| = 5 island, although this difference is small (1.35 × 10−5 for |M| = 5 compared to 1.33 × 10−5 for |M| = 6). The reduction in error when adding spurious reactions to the mechanism is due to the algorithm having more coefficients to fit in the symbolic model.
![]() | ||
Fig. 9 Minimum derivative error, , as a function of generation using SISR to extract mechanisms from the ground truth data generated using the Lotka–Volterra mechanism given in eqn (35). Each curve is the result of a different island in the SISR method, where each island contains mechanisms with the number of reactions |M| shown in the legend. | ||
Fig. 10 is a plot of the value for the loss function in the concentration space
as a function of the mechanism complexity for the Lotka–Volterra mechanism. The Pareto front shown in the plot illustrates that the |M| = 5 mechanism is the point where increasing complexity by adding more reactions results in limited improvement in the accuracy of the mechanism. Therefore the |M| = 5 mechanism was chosen as the optimal solution. There is a steep drop in error when going from the |M| = 2 point to the |M| = 3 point and from the |M| = 3 point to the |M| = 4. Comparing the |M| = 4 point and the |M| = 5 point (the true mechanism size) we see approximately a three orders of magnitude decrease in error. Therefore, despite the smaller mechanisms having less complexity, the tradeoff in increased error is too dramatic to make the smaller mechanisms the optimal solution. Compare this with the results for the |M| = 5 and |M| = 6 mechanisms where there is limited improvement in concentration error when adding a new reaction, but there is a significant increase in complexity. The |M| = 7 mechanism shows an increase in error with respect to the |M| = 5 and |M| = 6 mechanisms. The error metrics used to construct the Pareto front in Fig. 10 are shown in Table 2 along with the corresponding derivative error for each mechanism.
![]() | (38) |
![]() | (39) |
Applying the Reactive SINDy method will result in similar problems to regular SINDy unless hyperparameters are tuned for the specific mechanism and a suitable basis set of candidate reactions are constructed. Although it should be noted that after performing these tuning and construction tasks, Reactive SINDy generates an overall mechanism (reactions and rate constants) in excellent agreement with the true Lotka–Volterra mechanism.41
was added to the deterministic solutions of the system in eqn (36). We then applied a SG filter to the concentration data. The result of using this data as input to SISR is shown in Fig. 11, with excellent agreement observed between the noisy data and the mechanism predicted by SISR. This illustrates the ability of SISR to perform on concentration data with noise for a oscillatory reaction, a common situation in biological systems.
![]() | ||
| Fig. 11 Time evolution of the concentrations of each species in the Lotka–Volterra mechanism given in eqn (36). The SISR fit for each species extracted from noisy data is shown by a solid curve. The markers are a subset of the noisy data used in the SISR process. | ||
![]() | (40) |
The kinetic equations for this system are
![]() | (41) |
![]() | ||
| Fig. 12 Time evolution of the concentrations of each species in the nonlinear mechanism given in eqn (41). The solid lines are the results of the SISR method and the corresponding black markers are a subset of the data used by SISR to extract the reaction mechanism and fit the rate constants. The concentrations are shown in units of millimolar and time is shown in units of seconds. | ||
The data fed into SISR was 2000 equally spaced time points over the interval [0, 20
000] in units of seconds. The density of data was the same in the fast region as in the slow region. For this mechanism, the search space involves 54 reactions that can be combined into ∼2.9 × 107 mechanisms. Respective population sizes of 2000 and 500 were evolved for 20 generations using islands with |M| = 3–5 and |M| = 2 reactions. The mutation rate was
and the elitism was
. The maximum reaction order was O = 2. The best fit model found by the SISR method is
![]() | (42) |
Again, in exact agreement with the true mechanism. This mechanism was selected from the Pareto front in the complexity vs.
space using the same multiobjective procedure described previously. The rate constants in the found reaction mechanism are
1 = 1.261 × 10−6 mM−1 s−1,
2 = 9.121 × 10−6 s−1, and
3 = 2.750 × 10−8 mM−1 s−1 which very closely match the true rate constants (4.39% error or better). The results of the SISR model are shown in Fig. 12 with excellent agreement observed between the concentrations predicted by the SISR method and the concentrations calculated using the true mechanism. For times above 20
000, the SISR model is forecasting, meaning no training data was used from that time interval. Excellent agreement is observed in the forecasted region.
000]. This resulted in a set of kinetic equations:
![]() | (43) |
equations involving a first-order reaction in species B, but the SINDy result does not. Fig. 13 illustrates a comparison between the true data and the dynamical system derived using SINDy. Notice that the slow rise of species C is not captured in the SINDy model. Compare this with the SISR result in Fig. 12 where that slow process is well captured. It should be noted that SINDy was not developed for chemical reaction mechanisms and is overall agnostic to the specific physics of a system. Although this can be a strength in certain situations, it hinders the discovery of accurate chemical reaction mechanisms. The SISR method, however, encodes physical information about chemical reactions and stoichiometry and therefore performs better with respect to accuracy on the selected examples in this work.
![]() | ||
| Fig. 13 Time evolution of the concentrations of each species in the nonlinear mechanism given in eqn (41). The solid lines are the results of the SINDy method and the dashed lines are the true data. The concentrations are shown in units of millimolar and time is shown in units of seconds. | ||
![]() | (44) |
![]() | (45) |
The data fed into SISR was 400 equally spaced time points over the interval [0,2] in units of seconds. For this mechanism, the search space involves 182 reactions. A population size of 4000 was evolved for 20 generations for islands with |M| = 3–5 reactions and a population size of 500 was used for the |M| = 2 island. The mutation rate was
and the elitism was
. The population size for the |M| = 3–5 islands was doubled from previous examples due to the larger number of possible reactions. The maximum reaction order was O = 2. The SISR result is
![]() | (46) |
1 = 0.993 mM−1 s−1,
2 = 0.084 s−1, and
3 = 1.000 s−1, compared to the true rate constants.
The results of SISR on the MM model is shown in Fig. 14, with excellent agreement observed between the SISR-predicted concentrations and the ground truth data. The dynamics of the MM model are complex and involve competing processes that occur over different time scales. There are fast processes such as the fast decay of species E and the corresponding fast rise of species ES, and also slow processes such as the rise of species E and P to the steady-state values. SISR captures these processes well.
![]() | ||
| Fig. 14 Time evolution of the concentrations of each species in the Michaelis–Menten kinetic mechanism given in eqn (45). The solid lines are the results of the SISR method and the corresponding black markers are a subset of the training data used by SISR to extract the reaction mechanism and fit the rate constants and the gray markers are testing data that was not used to fit the SISR model. Data from the light blue region is used to train SISR while data from the white region is forecasted. The concentrations are shown in units of millimolar and time is shown in units of seconds. | ||
The region shown in blue in Fig. 14 is the training region—data from this region was used to extract the mechanism and fit the rate constants. The white region in Fig. 14 shows the SISR result on data that was not used for mechanism discovery. Data in this region can be used to validate the SISR mechanism on unseen data and to assess the capability of SISR for time-series forecasting of chemical concentrations. Excellent agreement is observed between the ground truth data and the forecasted concentrations. This illustrates one of the principal advantages of using SISR for reaction discovery—because of the stoichiometrically-informed construction of the reaction mechanism, it is able to accurately forecast concentrations at future time points while avoiding the overfitting and extrapolation errors that are common in black-box machine learning.
![]() | (47) |
![]() | (48) |
We assume that the enzyme has a steady state concentration of 1 µM and that we have a priori knowledge of the enzyme interactions. Two sets of rates constants are examined. The values in the first set are k1 = 4.45 × 10−4 s−1, k−1 = 3.03 × 10−4 s−1, and k2 = 2.78 × 10−3 µM−2 s−1. In the second set, the value of k2 is increased by an order of magnitude to k2 = 2.78 × 10−2 µM−2 s−1. These values are based on work in ref. 74.
For the first set of rate constants, using SISR with a population size of 2000 and with 500 data points over the interval [0, 10
000], SISR discovered the correct reactions and reaction mechanism:
![]() | (49) |
The extracted rate constants were
1 = 4.45 × 10−4 s−1,
−1 = 3.23 × 10−4 s−1, and
2 = 2.73 × 10−3 µM−2 s−1 in strong agreement with the true values. A comparison between input data and the SISR fit is shown in Fig. 15(a).
![]() | ||
| Fig. 15 Time evolution of the concentrations of chemical species in the glucose oxidation reaction mechanism given in eqn (48). Panel (a) shows the case with k2 = 2.78 × 10−3 µM−2 s−1 and panel (b) shows the case with k2 = 2.78 × 10−4 µM−2 s−1. The solid lines are the results of the SISR method and the corresponding black markers are a subset of the data used by SISR to extract the reaction mechanism and fit the rate constants. The concentrations are shown in micromolar and time is shown in seconds. | ||
For the second set of rate constants, a population size of 4000 with 1000 data points over the time interval [0, 20
000] in units of seconds was used. Again, the exact reaction mechanism was found using SISR, illustrating the ability of SISR to account for fast/slow dynamics. The extracted rate constants were
1 = 4.40 × 10−4 s−1,
−1 = 2.56 × 10−4 s−1, and
2 = 2.63 × 10−2 µM−2 s−1 in good agreement with the true values. The result of the SISR fit is shown in Fig. 15(b).
With respect to the performance of SISR on fast/slow systems, three points are of note: (1) compared to coefficient thresholding-based methods in the literature (such as SINDy-based approaches), SISR appears better able to extract fast/slow dynamics, (2) samples are taken in the fast region in the training data which is why the SISR method can recover the dynamics in that region, however inference in the absence of training data may capture these dynamics, and (3) further investigation on the use of SISR for fast/slow and stiff chemical kinetic systems is an important line of inquiry.
Compared to existing approaches for symbolic dynamical system discovery such as SINDy and its variants,40–42 the SISR method alleviates key limitations by (a) incorporating stoichiometric constraints, (b) reducing the reliance on predefined reaction assumptions, and (c) improving robustness in handling fast-slow dynamics due to the reduction of hyperparameter tuning and thresholding procedures. While SINDy-based approaches can offer significant computational advantages in comparison to SISR, for situations in which physical insight, interpretability, accuracy, and the ability to forecast future system states are important, SISR offers strong advantages. All the SISR computational operations presented in this work can be performed on personal computers with standard commercial hardware. Generally and broadly, we found that to complete the SISR search over the defined spaces would take hours to a day in wall time. This computational speed will be sufficient for analysis of many datasets. When a large number of species are included in the data, the SISR process can be scaled up simply using parallel computing. For example, for each generation in the genetic algorithm, the evaluation of each candidate mechanism on each island is not correlated with the other candidate mechanisms, and therefore the search process can be parallelized.
A key limitation of the current approach is that the rate constants extracted from the data are purely numerical values and not symbolic functional forms that can be applied to account for changes in environmental and/or thermodynamic conditions such as changes in temperature or pressure. This can be a problem if the derived mechanism will be used to forecast the outcome of the chemical process at a thermodynamic state other than the state used to generate/measure the data. Additionally, due to this limitation, nonequilibrium cases in which the rate constants are varying in time as parametric coefficients12,48,75 or over thermodynamic conditions are currently outside the scope of the current approach. Future work will address nonequilibrium chemical reaction mechanisms.14,76,77 Improving the computational efficiency of the method using different optimization strategies is an important next step. Additionally, validation across a broad set of experimental datasets would further establish the utility of SISR on noisy, incomplete, and complex data, and work in this direction is currently underway.
| This journal is © The Royal Society of Chemistry 2026 |