Open Access Article
Shun
Hayashi
*
Division of Physical Sciences, Department of Science, National Museum of Nature and Science, Ibaraki 305-0005, Japan. E-mail: s-hayashi@kahaku.go.jp
First published on 28th July 2025
Automating the discovery of chemical reaction mechanisms can increase the efficiency of using experimental data to obtain chemical knowledge. In this study, a sparse identification approach was employed to determine reaction mechanisms, providing accurate and interpretable kinetic models while preventing overfitting. The main advantage of the proposed approach over conventional sparse identification algorithms is that it can be applied to cases with limited concentration profiles, which often occur in chemical reactions involving untraceable intermediates. To demonstrate its applicability to complex reaction mechanisms beyond the reach of classical kinetic analysis, the autocatalytic reduction of manganese oxide ions was selected as the target reaction. Although the concentrations of only two manganese species could be monitored via UV-vis absorption spectroscopy, the experimental data were sufficiently represented by 11 elementary steps involving 8 chemical species. This strategy enables the automated discovery of reaction mechanisms without relying on heuristic kinetic models, as the only assumption required is the composition of the intermediates.
Recent advances in both experimental and computational technologies have provided chemists with enhanced tools for analyzing reaction mechanisms. The development of analytical techniques has enabled in situ/operando identification of intermediates, providing crucial insights for manually constructing reaction mechanisms.9 The use of both automated experiments and reaction monitoring techniques allows for the discovery and optimization of reactions and the collection of kinetic data.10–13 The integration of reaction monitoring techniques with conventional kinetic analysis methods enables automated mechanistic studies.14 Computational techniques, such as kinetic Monte Carlo simulations15,16 and quantum chemical calculations,17,18 can also be used to identify plausible reaction paths. Furthermore, machine learning technologies have shown potential for predicting reactivity on the basis of reaction and mechanism databases19–21 and for automatically classifying reaction mechanisms.22 However, with respect to discovering reaction mechanisms on the basis of experimental data, the existing theoretical frameworks are insufficient, particularly considering advances in data collection techniques. Automating the exploration of reaction mechanisms requires selecting the minimal number of elementary steps from complexly entangled reaction networks and estimating the rate constants from experimental data.
Recently, equation discovery, a computational approach for extracting governing ordinary differential equations (ODEs) from experimental data, has emerged as a powerful tool for understanding nonlinear dynamical systems.23,24 Notably, an approach involving sparse regression, particularly sparse identification of nonlinear dynamics (SINDy), has attracted considerable attention because it allows for the identification of compact and interpretable models that capture the core dynamics of complex systems.25 Since chemical reaction mechanisms can be uniquely converted into ODEs or rate equations, according to the law of mass action, sparse regression offers a promising approach for determining the reaction mechanisms. When concentration profiles for all chemical species are available, this approach can be applied to identify chemical reaction mechanisms.26 However, only limited concentration profiles are typically available, as monitoring the concentrations of all chemical species, including dilute and reactive intermediates, is difficult.
Inspired by the sparse identification approach, this study presents a simple yet effective method for the data-driven discovery of chemical reaction mechanisms from limited concentration profiles. Since concentration profiles can be simulated by numerically solving initial value problems of rate equations,27–29 the rate constants are estimated by fitting the simulated profiles to experimental data. The important elementary steps are selected by eliminating those with negligibly small rate constants. In summary, at the cost of numerically solving rate equations, the discovery of chemical reaction mechanisms from limited concentration profiles can be viewed as a sparse regression problem for determining rate constants. To demonstrate the potential of this strategy, the mechanism of the autocatalytic reduction of manganese oxide ions was studied. The experimental data can be sufficiently represented by 11 elementary steps involving 8 chemical species. The proposed approach holds promise as a framework for model discovery in various fields of physics where comprehensive measurements are impractical.
The only requirement for our approach is the assumption of the compositions of the intermediates, which are used to generate the initial kinetic models. On the basis of the compositions of five known compounds (Mn7+, Mn3+, Mn2+, C2O42−, and CO2), we assumed the presence of five possible intermediates, namely, partially reduced/oxidized species of Mn7+ and C2O42− (Mn6+, Mn5+, Mn4+, C2O4−, and CO2−). The initial kinetic model was then generated by enumerating all possible elementary steps that satisfy mass conservation (Fig. 1b). The following two constraints were applied: (i) only steps involving 1 or 2 reactants were considered; (ii) certain steps were excluded on the basis of chemical knowledge. The first constraint is reasonable because reactions involving three or more reactants have lower collision probabilities than do reactions involving 1 or 2 reactants. According to the second constraint, the steps that involve CO2 as a reactant are omitted. While knowledge of molecular structures and chemical properties is not essential for developing a kinetic model, this information is useful for reducing the number of elementary steps, which is crucial in the model optimization process. Among the 64 steps prepared with this approach, the key elementary steps were identified via sparse regression.
The kinetic models describing chemical reaction mechanisms can be represented via three arrays. A set of elementary steps can be denoted by two integer matrices: one for the left-hand side (l) and one for the right-hand side (r) of the chemical equations. The corresponding rate constants are stored in a vector (k). The rate equations are uniquely generated from the three arrays according to the law of mass action (eqn (1)).
![]() | (1) |
![]() | (2) |
![]() | (3) |
The sparse regression approach discovers the intrinsic elementary steps from a comprehensive set of candidates. The steps with minimal contribution can be eliminated, as the rate constant becomes negligibly small in the L1-regularization algorithm. Among a variety of potentially analogous reaction paths, those involving traceable species are preferentially selected over those consisting solely of intermediates. This is because of the large rate constants associated with steps involving intermediates. Therefore, this approach eliminates redundant paths and generates models with reduced dependence on intermediates. The covariance matrix adaptation evolution strategy (CMA-ES),39 a state-of-the-art optimizer for continuous black-box functions widely used for hyperparameter tuning in deep neural networks,40 was used to solve the minimization problem. By gradually increasing λ, the elementary steps with rate constants below a certain threshold are progressively eliminated from the resulting models, thereby reducing the total number of elementary steps (Nstep). Starting from the “full” model (λ = 0), the λ value is continuously increased to find the “best” model and the “core” model (Nstep = 6–7) (Fig. 1D). The “best” model is defined as the model with the minimum Nstep that satisfies the condition MRSE < MRSEfull × 1.2.
As λ increases, the number of elementary steps decreases, whereas the residuals for the training data increase monotonically. The residuals for the test data initially showed little change but increased after λ reached 1.3 × 10−4 (Nstep = 13). This finding indicates that the full model (Nstep = 64) was overfitted, highlighting the importance of applying the sparse regression approach to identify generalized kinetic models. As λ is increased further, a core model with a minimal number of elementary steps is obtained, which is beneficial for interpreting the reaction mechanism. The advantage of this approach lies in its ability to provide both the generalized model and the minimal model from a set of all possible elementary steps.
The best 4 combinations shared the same three intermediates: Mn6+, Mn4+, and C2O4− (Fig. 2A and S4†). Interestingly, among the best 4 models, the model with fewer intermediates better fits the test data, whereas the model with more intermediates better fits the training data. These findings suggest that three intermediates are sufficient to explain the experimental data. The 3 worst models among the 12 models do not include Mn6+, implying that Mn6+ plays an important role in this reaction. Furthermore, for mechanisms involving 3 or 5 intermediates containing Mn6+, Mn4+, and C2O4−, simplified models were generated by increasing λ, and the residuals were plotted as a function of Nstep (Fig. 2B and S6–11†). These plots displayed similar trends regarding the residuals' dependence on Nstep: while the residuals remained relatively stable for Nstep > 11–13, they increased considerably with each increase in Nstep at smaller Nstep values. This suggests that the accuracy of the models depends more on the number of elementary steps than on the number of intermediates considered. While redundancy in the reaction mechanism does not improve the model's accuracy, omitting even a single critical reaction can lead to a significant decrease in accuracy. The core model with three intermediates (Mn6+, Mn4+, and C2O4−) can be interpreted as a combination of an autocatalytic reaction (Mn7+ + 2 Mn2+ → 3 Mn3+) and a sequential (Mn7+ → Mn3+) reduction of Mn7+. This leads to the formation of Mn3+, which is subsequently reduced to Mn2+ (Fig. 2C).
![]() | ||
| Fig. 2 Determination of key intermediates by comparison of the kinetic models. (a) The residuals (mean relative squared error, MRSE) are estimated for 3, 4, and 5 intermediate mechanisms for the test and training datasets with λ = 0. The points marked with asterisks in the legend represent the averages of 10 calculations (Fig. S4†). (b) Residuals plotted as a function of the number of elementary steps for 3 and 5 intermediate mechanisms containing three key intermediates, Mn6+, Mn4+, and C2O4−. (c) A set of elementary steps in the best model with 3-intermediate mechanisms. The six steps marked with asterisks represent the core model. | ||
To highlight the advantages of the proposed method over empirical approaches, the kinetic models generated in this study were compared with previously developed models in the literature and numerical models (Fig. 3 and S12–15†). The reported mechanism, consisting of 7 elementary steps involving three intermediates (Mn6+, Mn4+, and partially oxidized C2O42−), was heuristically proposed based on existing chemical knowledge and kinetic studies of various manganese oxide species.37 In this study, the initial literature model was developed by incorporating an essential step (Mn7+ → Mn6+) into the reported 7-step mechanism (Fig. S12†). The accuracy of the best model, which was developed based on the literature model, was better than that of the full model, indicating that our approach can be applied to heuristic models for estimating rate constants and reducing redundancy. The best model developed based on the 3-intermediate mechanism showed a better fit than the best model of the literature mechanism did, highlighting the importance of considering all possible elementary steps while avoiding overfitting. The numerical mechanism denotes the apparent kinetic model, which uses only the concentrations of known manganese oxide species (Mn7+, Mn3+, and Mn2+). The numerical model allows for the inclusion of up to five reactant elementary steps and the displacement of the sum of the oxidation states in each elementary step (Fig. S14†). The advantage of numerical models is that the absence of untraceable species enables model optimization via the conventional SINDy method,25,26 which does not require numerically solving ODEs. Nonetheless, the same algorithm was applied to our approach for comparison. The numerical model poorly fit the experimental data, suggesting the difficulty of constructing kinetic models from only the concentrations of traceable species. This suggests that our approach, which employs a composition-based assumption for undetected intermediates, is effective, as it accurately represents the underlying chemical reaction networks.
The proposed approach transformed the determination of reaction mechanisms into a minimization problem involving rate constants. This strategy is theoretically applicable to kinetic studies of all types of reactions. However, a key challenge arises from the dimension of the minimization problem (Nstep). Since the minimization problem was solved stochastically by CMA-ES, the reproducible convergence could not be guaranteed, especially in problems with large dimensions. According to ten independent calculations, the full model with 3 intermediates (Nstep = 20) converged to solutions with similar MRSE values, whereas that with 5 intermediates (Nstep = 64) showed some variability (Fig. S4a†). However, the sparse models with 3 intermediates (Nstep = 11) did not necessarily converge to mechanisms with the same set of elementary steps (Fig. S4b†). Starting from the full model (Nstep = 20), seven steps were consistently eliminated, five were retained with similar rate constant values (coefficient of variation of log10(k + 1) < 0.02), five were consistently retained but exhibited varying values, and one of the remaining three was included in each model, although the specific step varied across the ten calculations. These imply that the proposed approach is readily applicable to the discovery of simple mechanisms, although the resulting sparse model was not always consistent across calculations. For example, in a catalytic organic reaction involving two substrates (A and B), a coupled product (A–B), a catalyst (C), and three intermediates (A–C, B–C, and A–B–C), the total number of possible elementary steps is 29 (Fig. S5†). Therefore, the dimensionality of the optimization problem is not a critical constraint in this case. However, for the discovery of more complex reaction networks, such as those found in combustion chemistry and astrochemistry,41,42 reducing the dimension through chemical knowledge and other computational approaches is needed.
Footnote |
| † Electronic supplementary information (ESI) available: Detailed experimental procedures, computational methodologies, and optimized kinetic models. See DOI: https://doi.org/10.1039/d5dd00293a |
| This journal is © The Royal Society of Chemistry 2025 |