Open Access Article
M. Á. de Carvalho Servia
a,
K. K. (M. ) Hii
b,
K. Hellgardta,
D. Zhangc and
E. A. del Rio Chanona
*a
aDepartment of Chemical Engineering, Imperial College London, South Kensington, London, SW7 2AZ, UK. E-mail: m.de-carvalho-servia21@imperial.ac.uk; k.hellgardt@imperial.ac.uk; a.del-rio-chanona@imperial.ac.uk
bDepartment of Chemistry, Imperial College London, White City, London, W12 0BZ, UK. E-mail: mimi.hii@imperial.ac.uk
cDepartment of Chemical Engineering, The University of Manchester, Manchester, M13 9PL, UK. E-mail: dongda.zhang@manchester.ac.uk
First published on 11th August 2025
Microkinetic models are key for evaluating industrial processes' efficiency and chemicals' environmental impact. Manual construction of these models is difficult and time-consuming, prompting a shift to automated methods. This study introduces SiMBA (Simplest Mechanism Builder Algorithm), a novel approach for generating microkinetic models from kinetic data. SiMBA operates through four phases: mechanism generation, mechanism translation, parameter estimation, and model comparison. Our approach systematically proposes reaction mechanisms, using matrix representations and a parallelized backtracking algorithm to manage complexity. These mechanisms are then translated into microkinetic models represented by ordinary differential equations, and optimi/zed to fit available data. Models are compared using information criteria to balance accuracy and complexity, iterating until convergence to an optimal model is reached. Case studies on an aldol condensation reaction, and the dehydration of fructose demonstrate SiMBA's effectiveness in distilling complex kinetic behaviors into simple yet accurate models. While SiMBA predicts intermediates correctly for all case studies, it does not chemically identify intermediates, requiring expert input for complex systems. Despite this, SiMBA significantly enhances mechanistic exploration, offering a robust initial mechanism that accelerates the development and modeling of chemical processes. By automating microkinetic model generation from a data-first approach, SiMBA opens new avenues for future research in automated mechanism discovery.
Despite their importance, the manual construction of microkinetic models is a complex, time-consuming, and errorprone process.15,16 Traditional methods require experts to manually identify possible reaction steps and intermediates, a task that can involve analyzing hundreds of thousands of potential interactions. This meticulous process is not only slow but also susceptible to human error and often results in models that are either overly simplified or unnecessarily complex. The increasing complexity of modern chemical systems further exacerbates these challenges, highlighting the need for more efficient and reliable approaches. Consequently, there has been a significant shift towards the development of automated methods for constructing these models,17–27 exploiting advances in data-driven methodologies and computational resources to streamline and enhance the accuracy of the modeling process.
The general trend towards automation, or scientific machine learning, offers substantial benefits, including increased efficiency and reduced error rates, compared to traditional manual methods.28 Various algorithms for generating mechanisms have been developed, typically falling into two categories: combinatorial algorithms and algorithms based on reaction classes.16 In the former approach, the generation of the entire set of possible reactions is based solely on the congruence of the electronic configurations of reactants and products, utilizing graph theory and bond-electron matrix representations of molecules.18,29–31 These methods can produce highly detailed and comprehensive reaction networks. The latter approach involves algorithms that, after recognizing the compounds as belonging to a certain class, generate only those reactions known to be characteristic of that class. While this method produces more compact networks, it requires prior knowledge of existing reaction classes.32–35 Combinatorial algorithms often yield overly complex mechanisms that can hinder computational efficiency and interpretability whilst making experimental validation and parameter estimation challenging (and often impossible). Conversely, algorithms based on reaction classes are limited by the availability of pre-existing reaction knowledge which may be limited within the scope of mechanism discovery for novel reactions. For a more in-depth discussion of these methodologies, reviews by Ratkiewicz and Truong,34 Van de Vijver et al.36 provide valuable insights.
Recently, alternative automated approaches leveraging artificial intelligence have also emerged, such as the method introduced by Burés and Larrosa, which employs deep learning techniques to classify organic reaction mechanisms directly from kinetic data without requiring explicit derivation of rate laws or enumerating reaction structures beforehand.37 While conceptually distinct from our proposed method, their method similarly addresses the complexity and interpretability challenges inherent to mechanistic discovery, highlighting the growing role and complementary potential of machine learning-based strategies in reaction engineering.
In this work, we propose a new approach, the Simplest Mechanism Builder Algorithm (SiMBA), which is designed to circumvent the necessity for substantial prior knowledge required by reaction class approaches, and to avoid the proposal of overly complex mechanisms yielded by combinatorial approaches. This is done by tackling the problem of automated generation of mechanisms from a data-first perspective, ensuring that whatever mechanism is proposed, is both physically reasonable and only as complex as the data allows. SiMBA generates microkinetic models that progressively increase in complexity based on the provided data. The algorithm begins with the simplest possible mechanism, yielding the most straightforward microkinetic model. The complexity of the mechanism is then incrementally increased, thus increasing the number of parameters of the corresponding microkinetic model. This process continues as long as there is informational gain from the added parameters, which is evaluated using the Akaike Information Criterion (AIC). By balancing model simplicity and accuracy, SiMBA ensures the generation of robust and sensible microkinetic models, effectively bridging the gap between theoretical exploration and practical applicability. While alternative model discrimination measures could be employed, we chose the AIC based on prior work demonstrating its superior performance in selecting data-generating kinetic models from a set of candidates.38 This minimalist approach is structurally and fundamentally different than previous methods, in that the main objective is to discover the most accurate and parsimonious mechanism given the dataset available, with as little prior information as possible.
This research represents an advancement in the field of microkinetic modeling, offering a novel approach that overcomes many of the challenges associated with existing automated methods. By systematically generating, refining, and evaluating microkinetic models, SiMBA provides a robust framework for developing accurate and experimentally viable reaction mechanisms. The algorithm's ability to distill complex chemical processes into simple, yet precise models has the potential to accelerate the design and optimization of chemical processes across various industries. Ultimately, SiMBA can become a useful tool for chemists and engineers, facilitating the rapid discovery and refinement of microkinetic models, thereby advancing our understanding of chemical reactions in diverse contexts.
The rest of the paper is organized as follows: in Section 2 our proposed method is motivated and described in detail; in Section 3 we introduce the three case studies that are used to analyze the performance of SiMBA highlighting the data-generation procedure and the results of the study are presented and amply discussed along with the shortcomings of the proposed methodology; and in Section 4 the key findings are presented with a brief outlook on future research.
SiMBA is comprised of four key phases:
1. Mechanism generation phase: utilizes a parallelized backtracking algorithm to generate all physicallysensible mechanisms for a given set of complexity parameters: the number of elementary steps and intermediates. This phase ensures that only feasible reaction pathways are considered, significantly reducing the computational burden;
2. Mechanism translation phase: the proposed mechanisms, represented by a matrix, are converted into executable microkinetic models, specifically systems of ordinary differential equations (ODEs). This translation is crucial for transforming reaction networks into practical models that can be analyzed and simulated;
3. Parameter estimation phase: the kinetic parameters of the proposed microkinetic models are estimated by minimizing the error between the model predictions and the observed kinetic data. This is achieved using the Broyden-Fletcher-Goldfarb-Shanno (BFGS) optimization algorithm;
4. Model comparison phase: involves evaluating the generated models using the AIC to determine the best microkinetic model for a given iteration. This phase also decides whether further iterations and additional complexity provide enough informational gain to justify continuing the algorithm.
By systematically progressing through these phases, SiMBA ensures the development of robust, accurate, and computationally efficient microkinetic models.
Our methodology also offers a closed-loop approach for refining models if the SiMBA's output is unsatisfactory, whether due to conflicts with prior knowledge (e.g., belief that the microkinetic model should involve more/less chemical species) or due to poor model fitting (e.g., the model failing to accurately capture the non-linearities in the kinetic data). In such cases, the modeler can opt to conduct an optimal experiment specifically designed to enhance model discovery – using model-based design of experiments (MBDoE), more specifically the Hunter–Reiner criterion39 – and then integrate this new data with the initial dataset. With the additional experimental data, the methodology can be re-applied, allowing for iterative refinement and re-evaluation of the output. Practically, this discriminatory experiment could also serve to validate the models proposed in earlier iterations, rather than relying solely on the AIC. The process can be repeated as many times as necessary or until the experimental budget is exhausted. Fig. 1 visually represents the SiMBA workflow, highlighting the key phases of the methodology. The following subsections provide a detailed account of each of these phases.
The algorithm utilizes matrix representations to model molecular transformations, where each matrix corresponds to a potential reaction mechanism (i.e., each row accounts for an elementary step and each column accounts for a chemical species). This formalism allows the algorithm to handle complex molecular interactions in a structured manner, making it easier to apply checks and balances on the proposed mechanisms.
SiMBA is designed to ensure that only chemically sensible and stoichiometrically balanced reactions are proposed. It does so by adhering to specific rules, which will be elaborated on later in this section. These checks are crucial in maintaining the physical plausibility of the generated mechanisms. But before initiating the mechanism generation process, several key inputs are required:
• Number of elementary reactions: this input defines the smallest possible number of reactions, or elementary steps, that could lead to a feasible microkinetic model. These elementary steps are constrained by physical principles, such as the requirement that reactions typically involve at most two molecules (bimolecular interactions) and usually produce a maximum of two product molecules (i.e., four possible elementary reactions: (i) A → B, (ii) A + B → C, (iii) A → B + C, and (iv) A + B → C + D). This consideration significantly reduces the complexity of the potential mechanisms and aligns the generated reactions with known ones.
• Number of chemical species: this input specifies the minimum (i.e., lower limit) number of chemical species needed to form the smallest possible mechanism. The number of species is critical because it defines the scope of the mechanism generation, ensuring that all necessary reactants, products, and intermediates are considered. This parameter also helps in maintaining the balance between complexity and feasibility in the proposed mechanisms. We emphasize that these reactant/product limits are defaults chosen for the present case studies; the SiMBA code allows these bounds to be increased trivially. If multi-molecular complexes or fast pre-equilibria are suspected, the user may simply raise the maximum number of reactants or products per step, and SiMBA will enumerate and analyze the resulting higher-order elementary reactions without further modification.
• Stoichiometry: the stoichiometry input dictates the overall chemical reaction being analyzed. It specifies the roles of different species in the reaction, with negative numbers representing reactants, positive numbers representing products, and zeros indicating intermediates (species that do not appear in the overall reaction). This ensures that the generated mechanisms adhere to the correct chemical balance and respect the conservation of mass.
• Time budget: the time budget defines the amount of computational time allocated to the mechanism generator algorithm for exploring the search space and identifying physically feasible reaction mechanisms at each iteration. This constraint helps in managing computational resources effectively and ensures that the generation process is both thorough and time-efficient.
In principle, SiMBA could operate with only the stoichiometry of the reaction provided by the user. From the stoichiometric information alone, the smallest feasible mechanism – defined by the minimum number of elementary reactions and chemical species – could technically be derived automatically. However, the current version of SiMBA deliberately retains the options for users to input these parameters explicitly. This design choice stems from recognizing that users may possess valuable prior knowledge or estimates about their reaction systems, including a realistic minimal number of elementary reactions or chemical species. Starting from an informed position can greatly enhance efficiency, allowing SiMBA to focus computational resources effectively. Thus, while reducing inputs is feasible, maintaining these user-defined inputs provides critical flexibility and practical advantages in guiding the mechanism discovery process.
Once the inputs are defined, we can allow the backtracking algorithm to explore the mechanistic possibilities. The backtracking algorithm – a branch-and-prune method within the field of constrained optimization – is used to systematically explore the vast search space of possible reaction mechanisms by incrementally building potential solutions and backtracking when a solution is found to be infeasible. In the context of mechanism generation, this algorithm starts with an empty matrix representation (mechanism) and progressively starts filling the matrix with possible numbers, ensuring at each step that the proposed mechanism adheres to physical and chemical constraints, such as stoichiometry and the limits on the number of reactants and products in each step. When the algorithm encounters a dead end, where a proposed mechanism violates any of the predefined constraints, it backtracks to the previous step and tries an alternative pathway. This process allows the algorithm to efficiently prune the search space, focusing only on chemically valid and feasible mechanisms, thereby avoiding the exhaustive and brute-force approach through enumeration of all possibilities. This is of particular importance because of the combinatorial nature of the problem. For example, for a small 4 × 5 matrix, there are 95,367,431,640,625 possible combinations assuming that xi,j ∈ {−2,−1,0,1,2}, where a brute-force approach would be intractable. Thus, employing smart methods for an efficient exploration of the space is paramount, even when dealing with small problems. For a more detailed discussion on backtracking, the interested reader should refer to Chapter 2 of Erickson.40 Fig. 2 gives an illustrative example of how the backtracking algorithm works.
To further improve the efficiency of the exploration of the possible reaction pathways, we employ a parallelized version of the backtracking algorithm. This allows multiple ‘trees’ or potential mechanisms to be explored simultaneously, significantly accelerating the search process. The degree of parallelization is primarily constrained by the number of available processors, making this approach highly scalable with increased computational power. Parallelizing the backtracking algorithm offers significant benefits in terms of computational efficiency and scalability. By exploring multiple potential pathways concurrently, the algorithm can cover a much larger portion of the search space within the same amount of time, making it feasible to generate comprehensive sets of candidate mechanisms even for complex reactions.
The algorithm includes several rules to ensure that the generated mechanisms are chemically plausible, to name a few:
• Stoichiometric consistency: the proposed mechanisms must adhere to the stoichiometry defined by the input, ensuring that the overall reaction remains balanced.
• Elementary step constraints: each elementary step is restricted to having at most two reactants and two products, with at least one of each, reflecting the typical nature of an elementary step and ensuring that there is no redundant elementary steps (i.e., a row full of zeros).
• Intermediate formation: intermediates must be generated in the reaction network before they are consumed, maintaining a logical and sequential flow of the reaction mechanism.
These rules allow for SiMBA to filter out unfeasible or non-physical mechanisms, ensuring that the outputs are not only mathematically valid but also chemically meaningful. Thus, at the end of this phase, a comprehensive set of candidate reaction mechanisms is generated, each represented by a matrix and each having the same level of complexity (i.e., same number of elementary steps and chemical species). These matrices serve as the basis for further analysis in subsequent phases of the SiMBA methodology. This phase lays the groundwork for the rest of the SiMBA methodology, providing a robust and physically plausible set of reaction mechanisms that will be refined, validated and compared in the subsequent phases.
A noteworthy scenario arises when the overall chemical reaction is incompletely characterized due to unknown side products, such as when significant yield losses occur through mechanisms like coking or volatile byproduct formation. In such cases, SiMBA may encounter difficulty in proposing chemically meaningful reaction mechanisms if kinetic data from these unknown pathways are unavailable. However, two practical workarounds could partially mitigate this challenge. First, one could adjust the kinetic data for reactants by subtracting the fraction lost to side reactions, thereby defining an “effective reactant” profile. This manipulation would focus SiMBA's mechanism generation exclusively on the known reaction pathway of interest. Alternatively, one could introduce “pseudo-side product” variables to represent all mass lost through unidentified side reactions, thus preserving mass balance. SiMBA would then propose mechanisms involving both the target reaction and a generalized pathway to the pseudo-side products. Although helpful, this second method implicitly assumes an arbitrary number of side pathways, potentially oversimplifying the actual chemical processes involved. Consequently, both approaches have inherent limitations and should be applied with careful consideration of the reaction system under study.
It is important to note, however, that despite SiMBA's systematic approach and rigorous filtering criteria, the mechanisms returned are fundamentally justified by the kinetic data rather than guaranteed to represent the actual underlying chemical pathways. The inherent limitation here is that concentration–time datasets, especially when incomplete, inherently underconstrain the reaction network. Consequently, SiMBA-generated mechanisms should be interpreted as being consistent with the available kinetic data and chemically plausible within the defined constraints (which can always be augmented), but not necessarily as uniquely true representations of the fundamental reaction mechanisms. This caveat is particularly critical when applying SiMBA to real-world experimental datasets, which may be incomplete or subject to measurement uncertainties. Users should therefore view the generated mechanisms as robust hypotheses warranting further experimental verification and refinement.
In the matrix representation, each row corresponds to an elementary reaction, while each column represents a chemical species. The elements within the matrix indicate stoichiometric coefficients: negative for reactants, positive for products, and zero for species not involved in the elementary step.
In the first step, each reaction string is generated by identifying the reactants, products and the stoichiometric coefficients for every row of a given matrix. For example, the below matrix would be converted into the following reaction strings, where A and B are reactants, C is a product and D is an intermediate:
![]() | (1) |
Using mass-action kinetics, the reaction strings are then converted into ODEs. The rate of a reaction is proportional to the product of the concentrations of the reactants. For example, for the reaction string
the rate equation is expressed as r = k1CACB, where k1 is the rate constant. Our code automates the translation of these reaction strings into ODEs by systematically identifying unique species, constructing rate equations, and assembling the differential equations into a comprehensive kinetic model. For instance, the system of ODEs for the above matrix representation is:
![]() | (2) |
The translation process is automated using a Python script, primarily leveraging the regex library.41 Regex is employed to parse and extract key components from reaction strings, such as chemical species and reaction operators. Specifically, regex identifies species by matching patterns of letters and helps distinguish between different parts of the reaction strings, including reactants, products, and the reaction arrow (→). This parsing ensures accurate separation and interpretation of reactants and products, facilitating the automated construction of corresponding rate equations in the ODE system.
Given the potential to generate a vast number of models in any iteration of SiMBA, automating the translation from matrix notation to executable Python functions is crucial. Manual conversion would be impractical, if not impossible, due to the sheer volume of candidate mechanisms. Therefore, this automated approach not only improves efficiency but also ensures that the subsequent phases of the SiMBA algorithm can proceed smoothly. Much of the code for the translation of reaction strings to systems of ODEs has been adapted from the work of Jiscoot et al.42
To solve the parameter estimation problem, we use simulated concentration–time profiles as the dataset. These profiles provide time-series data of species' concentrations, which are critical for fitting the kinetic models.
The parameter estimation problem is defined in eqn (3), where ŷm(i) denote the prediction of a value coming from a proposed model m at a given time t(i)) (i.e., ŷm(i) = m(t(i)|θm)), and y(i) represents the target value at a given time t(i) (i.e., in silico data, in this study). Furthermore, SSE represents the sum of squared errors and nt is defined as the sampling times, which are set within the fixed time interval, t(i) ∈ Δt where Δt = [t0,tf].
![]() | (3) |
The limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm is employed for solving the parameter estimation problem.43 L-BFGS is well-suited for handling this problem due to its performance in tasks pertaining to parameter estimation and optimization.43,44
To ensure a thorough exploration of the parameter space, we may use expert-informed or random initial guesses for the parameters, with bounds set within the range [0,10] to maintain physically meaningful values in the chosen case studies (this can be changed on as-needed basis). The stopping criteria for the optimization are left to the default options in the Scipy package,45 and a multi-start approach is employed, where multiple runs are initiated with different starting points, and the best solution is retained.
We use the Akaike Information Criterion (AIC) because in previous work we compared various information criteria to determine if any offered superior performance. We found that AIC consistently outperformed other criteria in the context of kinetic discovery; further details of these studies can be found in de Carvalho Servia and del Rio Chanona.38 However, SiMBA's architecture is entirely agnostic to the particular metric used for selecting the best mechanism. Should a user prefer a classical train/test error, cross-validation score, Bayes factor, or any other statistic instead of AIC, they can simply swap in their desired metric via the “metrics.py” module without altering the core enumeration, translation, or parameter-estimation workflows. Users can thus tailor model selection to their data and domain requirements.
Given a model m with parameters θm of dimension dm, the AIC is defined as:
| AICm = 2NLL(θm|D) + 2dm, | (4) |
If iteration n + 1 in the SiMBA algorithm results in an improvement of the AIC value compared to iteration n, SiMBA will continue running, further refining the model output by considering more complicated mechanisms. This approach ensures that the algorithm is consistently moving towards a model that better balances complexity with goodness of fit. However, if the best model in iteration n + 1 displays a worsening in the AIC value compared to the best model in iteration n, indicating that the model has become less optimal, SiMBA will terminate the process. In this scenario, the algorithm concludes that additional iterations are implausible to produce a superior model, and it will return the best solution found during iteration n, which is considered the most accurate and parsimonious model according to the AIC evaluation.
![]() | (5) |
![]() | (6) |
![]() | (7) |
The selection of the case studies was made to demonstrate SiMBA's effectiveness across a range of scenarios. The case studies include a hypothetical reaction, an aldol condensation between benzaldehyde and acetophenone,48 and the dehydration of fructose to 5-hydroxymethylfurfural (HMF).49,50 These studies were chosen to showcase SiMBA's ability to distill complex kinetic behaviors into simple, accurate models.
The hypothetical reaction serves as an initial proof-of-concept, illustrating whether SiMBA can generate microkinetic models purely from fundamental principles without relying on prior knowledge. This first case study also demonstrates SiMBA's versatility in handling both first-order and second-order elementary steps. Next, the aldol condensation – a classic reaction with a well-understood mechanism – tests SiMBA's ability to reconstruct mechanistic pathways solely from dynamic data on main reactants and products. By successfully modeling this reaction, SiMBA shows that it can go beyond hypothetical examples to accurately capture established mechanistic pathways. Finally, the dehydration of fructose to HMF introduces a different challenge: rather than relying on a microkinetic simulation of stoichiometric reactants and products, the in silico dataset comes from a rate model that has been experimentally validated. In this scenario, SiMBA is challenged to discover a plausible kinetic mechanism, aligning with established literature and demonstrating its capacity to derive robust reaction models from realistic data sources.
Although our case studies focus on homogeneous reaction networks, SiMBA's matrix-and-ODE framework naturally accommodates both homogeneous catalytic cycles and heterogeneous surface chemistry. In a homogeneous catalytic cycle, one treats the catalyst resting state and any activated forms (e.g., Cat–A, Cat–B) as additional species in the matrix. Turnover steps – substrate binding (A → Cat–A), intramolecular transformation (Cat–A → Cat–B), and product release (Cat–B → Cat + P) – are handled just like any elementary reaction. Likewise, in heterogeneous catalysis, adsorption/desorption steps (A ⇌ Z) and surface–surface reactions (X + Y → Z) map directly onto bimolecular or unimolecular matrix rows: free species adsorb (A → X, B → Y), surface intermediates react (X + Y → Z), and products desorb (Z → P). Because every elementary step – whether substrate isomerization, catalyst turnover, or surface adsorption – is represented in the same stoichiometric matrix and translated automatically into ODEs, SiMBA requires no algorithmic changes to discover, fit, and rank mechanisms in either homogeneous or heterogeneous catalytic systems.
| 4A → B + C | (8) |
![]() | (9) |
Starting from the ODE system in eqn (9), an in silico dataset is generated wherein Δt = [0,10] h and nt = 30. This dataset is composed of five different experiments, each ran at different initial conditions (in molar units: (CA(t = 0), CB(t = 0), CC(t = 0), CD(t = 0), CE(t = 0)) ∈ {(10,0,2,0,0), (10,2,0,0,0), (10,2,2,0,0), (5,0,0,0,0), (10,0,0,0,0)}); these experiments were randomly picked from a 2k factorial design.51
For all experiments, the system is assumed to be both isochoric and isothermal, and Gaussian noise is added to the in silico measurements to simulate a chemical experiment. The added noise had zero mean and a standard deviation of 0.15 for A, B and C. To further approximate a realistic system, we assume that we cannot measure the intermediates D and E. The generated dataset for the one of the experiments are presented in Fig. 4(a). The dataset, providing 150 datapoints, has a realistic size for kinetic studies,52–54 especially considering recent advancements in high-throughput setups.
In this case, given the stoichiometry of the hypothetical reaction, shown in eqn (8), the simplest possible mechanism involves two elementary steps. This is due to the fact that termolecular, and higher order interactions, are relatively rare – for the purpose of SiMBA, we consider them as impossible – as the simultaneous collision of three or more molecules in the correct orientation is a very unlikely occurrence. Thus, based on that constraint, we would need at least two elementary steps to react four moles of A. As such, in the first step, two moles of species A react to produce one mole of B, while in the second step, two moles of A react to produce one mole of C. Notably, the order in which B and C are produced is interchangeable, without affecting the model's performance. These two configurations represent the only physically feasible mechanisms that could be formed in the first iteration of SiMBA, based on a 2 × 3 matrix (representing two elementary steps and three species).
Upon identifying all possible permutations of mechanisms represented by this 2 × 3 matrix (in this case, two permutations), SiMBA translated them into ordinary differential equation (ODE) models that could be optimized. Through parameter estimation, we optimize the kinetic parameters of the model, which enable us to calculate the AIC values for each model and selected the best-performing one. Table 1 shows the optimal mechanism discovered in the first iteration, including the corresponding microkinetic model, and the AIC value which amounted to 1110.34. Fig. 6 presents the model's fit against an arbitrary training experiment, visually illustrating its accuracy.
Following this initial step, SiMBA automatically proceeded to iteration 2, which leads to an increase of complexity by allowing an extra elementary step and an extra intermediate to be present in the modeling task. Consequently, in this iteration, the algorithm began with an empty 3 × 4 matrix representation of the reaction mechanism. Using the same process as in iteration 1, we identified and optimized the potential mechanisms for this iteration. To echo the point made in Section 2 regarding the importance of smart explorative methods, iteration 2 could generate 244,140,625 different matrix configurations; with backtracking, we only check the 31 configurations that make physical sense (0.00001% of all possibilities). The AIC value was again used to select the best model for iteration 2, which now amounted to an improvement to 106.28. A decrease in AIC from 1110.34 to 61.67 indicates a substantial improvement in model accuracy while maintaining parsimony.
SiMBA converged in four iterations, at which point the termination criterion was met, meaning that the informationally optimal mechanism was discovered in iteration 3. At iteration 4, no further reduction in AIC was observed, which met the termination criterion for model refinement, indicating that the added complexity did not yield better predictive accuracy. Table 1 summarizes the best mechanism identified in each iteration, along with the corresponding microkinetic models and AIC values. Fig. 6 illustrates the model fit for each selected mechanism against an arbitrary training experiment, further demonstrating the progressive refinement of the models across iterations (with exception of the last one).
In iteration 1, the 2 × 3 matrix yields 56 = 15
625 possible combinations, of which SiMBA identifies and evaluates all feasible candidates in just 3.40 s. In iteration 2, the search space jumps to 512 ≈ 2.44 × 108 matrices, yet SiMBA finds and evaluates all feasible mechanism arrangements in 14.91 s by pruning infeasible branches early. By iteration 3 (520 ≈ 9.54 × 1013, 1110.34 s) and iteration 4 (530 ≈ 9.31 × 1020, 5911.28 s), the exponential growth in possibilities becomes evident – even though the number of feasible matrices remains a tiny fraction and the execution of SiMBA stays tractable.
A comparison of the final selected model against the data-generating model indicates that SiMBA successfully uncovered the underlying mechanism driving the hypothetical reaction. This case study serves as a proof-of-concept, showcasing SiMBA's ability to generate accurate microkinetic models even in the absence of direct data on intermediates and with limited in silico data. The results demonstrate the robustness of SiMBA in handling systems that feature both firstand second-order elementary steps, confirming its potential for more complex chemical systems and broader industrial applications, as demonstrated in the next subsections.
Eqn (10) provides a simplified description of the mechanism of the reaction as well as the ODE system underpinning the dynamics of the reaction (which was directly derived from the proposed mechanism).48 It is worth noting that we assume that the elementary steps are all irreversible. The kinetic parameters (rate constants) were defined as: k1 = 0.759 h−1, k2 = 0.293 M−1 h−1 and k3 = 0.681 h−1. Although the sequence of elementary steps is taken from literature, we likewise randomized the three rate constants within [0.1, 1], imposing only the constraint that the C–C bond-forming step (i.e., second step) is the slowest, consistent with mechanistic studies showing carbon–carbon coupling as rate-determining.48
![]() | (10) |
Starting from the ODE system in eqn (10), an in silico dataset is generated wherein Δt = [0,10] h and nt = 30. This dataset is composed of five different experiments, each ran at different initial conditions (in molar units: (CA(t = 0), CB(t = 0), CC(t = 0), CD(t = 0), CE(t = 0), CF(t = 0)) ∈ {(5,10,0,0,0,0), (5,5,2,0,0,0), (5,10,0,2,0,0), (10,10,0,2,0,0), (10,10,2,2,0,0)}); these experiments were randomly picked from a 2k factorial design.51
For all experiments, the system is assumed to be both isochoric and isothermal, and Gaussian noise is added to the in silico measurements to simulate a chemical experiment. The added noise had zero mean and a standard deviation of 0.15 for A, B, C and D. To further approximate a realistic system, we assume that we cannot measure the intermediates E and F. The generated dataset for one of the experiments are presented in Fig. 4(b).
In the first iteration, based on the stoichiometric relationship between the reactants and products, SiMBA identified a single elementary step where one mole of acetophenone (A) reacted with one mole of benzaldehyde (B) to produce chalcone (C) and water (D). The simplicity of this mechanism, represented by a 1 × 4 matrix (one step involving four species), aligned with the overall stoichiometry of the aldol condensation reaction and was deemed the only feasible configuration for this initial phase.
Following this, SiMBA translated the 1 × 4 matrix into a set of ODEs that could be optimized computationally. Parameter estimation was performed, and the AIC value was calculated and used to gauge the model's fit. Table 2 does not present the mechanism identified during this initial iteration for spacing reasons, but Fig. 6 shows how well the model predictions aligned with experimental data from a selected training experiment.
Upon completion of iteration 1, SiMBA advanced to iteration 2, where an additional elementary step and species were incorporated. This expanded the search space to a 2 × 5 matrix, increasing the complexity of the possible mechanisms. As in the previous step, the new sets of ODEs were optimized, and AIC values were calculated. The results from iteration 2 showed a notable improvement, as the added complexity contributed to a better overall fit, without overfitting the system. Table 2 and Fig. 6 detail the refined mechanism and its improved accuracy.
The iterative process continued, and SiMBA reached its optimal solution at iteration 3, where no further reduction in the AIC value was observed in subsequent iterations. The complexity added in iteration 4 did not yield a better AIC value, signaling that the best mechanism had already been identified, since further complexity was improving the fit negligibly. The termination criterion was therefore met after iteration 4, confirming that iteration 3 provided the most accurate and parsimonious model. The results from the second, third and fourth iterations are summarized in Table 2, while Fig. 6 illustrates the fit for every iteration.
A similar pattern to that of the first case study emerges for the aldol case regarding the execution of SiMBA: iteration 1 (54 = 625 total possible matrices) executes in 3.06 s; iteration 2 (510 ≈ 9.77 × 106) in 4.41 s; iteration 3 (518 ≈ 3.81 × 1012) in 75.18 s; and iteration 4 (528 ≈ 3.73 × 1019) in 7989.69 s. These results confirm that backtracking keeps the search tractable even as the combinatorial possibilities skyrocket.
The comparison between the selected model and the original data-generating mechanism demonstrates SiMBA's ability to successfully uncover the fundamental dynamics of the aldol condensation reaction, even when working with limited data. This case study serves as further evidence of SiMBA's strength in identifying complex reaction mechanisms in realistic systems. The results not only validate SiMBA's accuracy but also highlight its potential for broader application in mechanistic discovery across diverse chemical reactions.
| A → 3B + C | (11) |
| r = kCACacid | (12a) |
![]() | (12b) |
and generate an in silico dataset wherein Δt = [0,90] min and nt = 30. This dataset is composed of five different experiments, each ran at different initial conditions (in molar units: (CA(t = 0), CB(t = 0), CC(t = 0)) ∈ {(4,0,0), (6,2,1), (4,2,0), (4,0,1), (6,2,0)}); these experiments were randomly picked from a 2k factorial design.51
For all experiments, the system is assumed to be both isochoric and isothermal, and Gaussian noise is added to the in silico measurements to simulate a chemical experiment. The added noise had zero mean and a standard deviation of 0.2 for A, B and C. In this example, resembling a real system, we do not have any measurement on possible intermediates. The generated dataset for the one of the experiments are presented in Fig. 4(c).
In the first iteration, SiMBA identified all permutations of the simplest possible reaction configuration consistent with the overall stoichiometry. For this case study, these permutations resulted in 10 candidate reaction matrices, each describing a minimal three-step mechanism involving five total species. For an in-depth discussion of these initial candidates, please refer to the SI. The best of these initial models, shown in the first row of Table 3, achieved an AIC value of −166.18, indicating a reasonable fit to the in silico data. As seen in the top right plot of Fig. 6, this initial mechanism satisfactorily captures the concentration profiles of A, B, and C, yet leaves open the possibility that additional chemical complexity could yield a still better match to the observed dynamics.
It is interesting to see that in iteration 1, SiMBA proposes a mechanism that is analogous to a widely accepted one in literature in which the 5-membered ring of fructose remains intact.49 The first dehydration step yields intermediate D, which can exist either as the enol or keto tautomers. The second elimination introduces a unit of unsaturation in the ring (E). Finally, HMF can be obtained from the last dehydration from the ring. Given that the tautomerism is a very fast process, this mechanism can be described kinetically by three consecutive elementary dehydration steps. The detailed mechanism can be found in Fig. 5(i).
Iteration 2 introduced a more elaborate mechanism by appending an additional elementary step and including an extra intermediate species (F). The resulting 4 × 6 matrix improved the fit of the model, offering a more nuanced description of how fructose converts into HMF through an additional intermediate stage. As reported in Table 3, this enhanced mechanism yielded an AIC of −169.40 – a slight improvement that underscores a better balance between complexity and predictive accuracy. The middle–right plot of Fig. 6 shows a slightly improved alignment between the simulated trajectories and the measured concentrations, particularly for water (B), which now follows the experimentally validated rate model's curvature slightly more precisely.
Remarkably, in this iteration, SiMBA also recovers an underlying sequence of elementary steps that is analogous to another widely accepted mechanism in the literature for the dehydration of fructose.49,55 This mechanism starts with the acyclic form of fructose (A), which is thought to be more abundant. In order for dehydration steps to occur, the formation of an enediol intermediate (D) is thought to be critical. Following two sequential dehydration steps, the dideoxyhexosulose intermediate (F) can cyclise very readily to form the 5-membered ring, prior to the elimination of the final water molecule to yield HMF (C). The detailed mechanism can be found in Fig. 5(ii).
Iteration 3 further expanded the proposed mechanism by introducing yet another species (G). Although this more complex mechanism continued to reproduce the trajectory of the reaction reasonably well, its increased complexity did not translate into further gains in predictive power; the AIC rose to −161.52, signifying that the added steps merely inflated model complexity without meaningful improvement in data fitting. This is corroborated by the bottom–right plot of Fig. 6, where the concentration profiles remain comparable to those from iteration 1, highlighting diminishing returns on model complexity. Consequently, the algorithm converged in iteration 3, as no additional refinement offered a better trade-off between model simplicity and accuracy. Among all iterations, the mechanism discovered in iteration 2 proved optimal.
The fructose example further illustrates the exponential scaling: iteration 1 (515 ≈ 3.05 × 1010 possible matrices) takes 12.80 s; iteration 2 (524 ≈ 5.96 × 1016) takes 1978.81 s; and iteration 3 (535 ≈ 2.91 × 1024) takes 10
217.54 s. Two key conclusions follow. First, the total enumeration size grows exponentially with matrix dimensions, but the backtracking time – and thus the feasible subset – is strongly problem-dependent (i.e., dependent on SiMBA's starting point and the stoichiometry of the reaction) but remains small. Second, as network complexity grows, most of the runtime shifts from backtracking to the parameter-estimation phase, which becomes the dominant cost in evaluating every feasible candidate.
In this example, unlike our other case studies – where we deliberately hid intermediates in a microkinetic simulation – this example underscores SiMBA's value in a realistic setting. The results establish that SiMBA can parse complex, experimentally grounded datasets and distill them into verifiable mechanistic pathways, further solidifying its robustness and practical relevance for catalytic reaction discovery.
It is also important to note that despite SiMBA's systematic approach and rigorous filtering criteria, the mechanisms returned are fundamentally justified by the kinetic data rather than guaranteed to represent the actual underlying chemical pathways. The inherent limitation here is that concentration–time datasets, especially when incomplete, inherently underconstrain the reaction network. Consequently, SiMBA-generated mechanisms should be interpreted as being consistent with the available kinetic data and chemically plausible within the defined constraints (which can always be augmented), but not necessarily as uniquely true representations of the fundamental reaction mechanisms. This caveat is particularly critical when applying SiMBA to real-world experimental datasets, which may be incomplete or subject to measurement uncertainties. Users should therefore view the generated mechanisms as robust hypotheses warranting further experimental verification and refinement.
To mitigate the other limitations, several strategies have been implemented within the current study. For instance, to address the computational demands associated with exploring large mechanism spaces, we have employed a backtracking technique, as detailed in Section 2.1. This method significantly reduces the search space by eliminating unfeasible pathways early in the process, and the exploration has been parallelized to further improve computational efficiency. To counter the potential sensitivity to initial parameter guesses during optimization, as discussed in Section 2.3, we have utilized a well-established optimization algorithm, specifically the BFGS algorithm, with a multi-start option. This approach increases the likelihood of finding the global optimum by starting the optimization from multiple initial guesses.
Looking ahead, future work will focus on overcoming the lack of inherent chemical knowledge as well as continuing to reduce the computational cost to further enhance SiMBA's capabilities. For the issue of chemical identification of intermediates, we plan to explore the use of quantum chemistry methodologies. While quantum chemistry workflows (e.g., as reviewed in Simm et al.56) provide a systematic route to enumerate and validate reaction pathways, their high computational cost often limits the breadth of mechanism exploration – particularly for complex networks with many intermediates. We therefore envision SiMBA serving as a low-cost “exploration” engine that rapidly identifies the simplest skeleton mechanisms consistent with kinetic data available. These skeletal networks can then be subjected to more expensive DFT or other quantum-chemical calculations (“exploitation”) to assign chemical identities, compute activation barriers, and confirm intermediate stabilities. In cases where quantum-chemical results diverge from SiMBA's proposal (e.g., predicting additional intermediates), targeted MBDoE can be employed to generate discriminating data and refine the mechanism (as explained in Section 2.5). This hybrid workflow would thus combine the speed and parsimony of data-driven exploration with the physical rigor of first-principles validation.
Additionally, we are considering to add canonicalization rules into SiMBA, which will decrease the number of duplicates that, at the moment, the methodology inevitably computes and explores (a more in-depth discussion can be found in the SI). We are also considering the integration of uncertainty quantification methods, which will increase the robustness of the models proposed by SiMBA. These enhancements aim to make SiMBA a powerful tool for chemists and engineers, capable of addressing the diverse challenges encountered in kinetic discovery and reaction mechanism elucidation.
SiMBA was developed to fill this gap by introducing a minimalistic, data-driven approach that incrementally builds model complexity based on available information. Unlike other methods, SiMBA begins with the simplest possible mechanism and systematically adds complexity only if the additional parameters provide informational gain. This balance between simplicity and accuracy is achieved through four key phases: mechanism generation, mechanism translation, parameter estimation, and model comparison. The algorithm starts by proposing feasible reaction mechanisms using a parallelized backtracking algorithm, translates these mechanisms into systems of ODEs, optimizes their kinetic parameters, and selects the best model using the AIC to ensure the right trade-off between model complexity and fit.
The effectiveness of SiMBA was demonstrated through three case studies: a hypothetical reaction, an aldol condensation, and the dehydration of fructose. In each case, SiMBA successfully distilled complex reaction behaviors into accurate models, even in situations where intermediates were not directly observable. These case studies highlight the algorithm's versatility and robustness in generating models.
While SiMBA has proven to be a powerful tool for microkinetic model discovery, it is not without limitations. The current version does not provide chemical identification of intermediates, necessitating expert input for it. Additionally, while the algorithm excels at balancing simplicity with accuracy, incorporating uncertainty quantification could further enhance the robustness of its predictions. Future work will focus on integrating more chemical knowledge and techniques for identifying intermediates, as well as expanding the algorithm's capabilities to address uncertainty in model predictions.
In conclusion, SiMBA offers a novel approach to overcoming many of the challenges associated with existing automated methods for microkinetic model discovery. By systematically generating, refining, and evaluating microkinetic models, SiMBA provides a new framework for mechanistic discovery. As SiMBA continues to evolve with future enhancements like uncertainty quantification and intermediate identification, we hope that it will become a useful tool for chemists and engineers, helping bridge the gap between theoretical exploration and industrial applications.
The accompanying supplementary information offers an in-depth discussion of the candidates proposed by SiMBA in the first iteration of the dehydration of fructose case study. The code used to produce all results and graphs shown in this work can be accessed at https://github.com/OptiMaL-PSE-Lab/auto_react_mech_construct. See DOI: https://doi.org/10.1039/d5sc01473e.
| This journal is © The Royal Society of Chemistry 2026 |