Open Access Article
Francisco Bolaños-García
*,
Jean-Marc Commenge and
Laurent Falk
Université de Lorraine, CNRS, LRGP, Nancy F-54000, France. E-mail: francisco.bolanos-garcia@univ-lorraine.fr
First published on 20th May 2026
The acceleration of chemical process development through flow chemistry depends on obtaining reliable kinetic models. Model-based design of experiments (MBDoE) has been successfully applied with flow chemistry platforms for estimating reaction rate parameters for low-complexity models. However, its use with dynamic experiments involving computationally expensive models remains challenging, particularly when experimental conditions must be suggested in real time. Surrogate models can approximate complex models, but often lack the ability to reproduce the derivatives of the original model, which are essential for MBDoE as a gradient-based approach. This work investigates the use of gradient-enhanced neural networks as surrogate models for parameter estimation within an MBDoE framework. By incorporating gradient information, the surrogate model is able to reproduce the local sensitivity structure of the original first-principles model, ensuring predictive accuracy for both output values and gradients needed for parameter estimation. A case study on competitive–consecutive reactions demonstrates that artificial neural networks (ANNs) that were trained with gradient information can be used for parameter estimation, while substantially reducing computational cost by a factor of approximately 200
000. This enables sequential MBDoE suitable for real-time applications.
Identifying an appropriate model structure and estimating the values of model parameters are essential steps for the development of process engineering models for different system. In reactor design, for example, kinetic models that describe the chemical reaction should be obtained at an early stage, as they are fundamental for further scale-up, defining control strategies, and optimization. Systematic procedures, known as model-based design of experiments (MBDoE), have therefore been proposed and applied for model discrimination5,6 and parameter estimation7–9 with the objective of performing more informative experiments. Examples of these strategies in flow chemistry include the estimation of reaction rate parameters under steady-state conditions for single reactions, such as the Diels–Alder reaction,10 and for multistep reactions, such as a nucleophilic aromatic substitution (SNAr).11 They have also been applied in batch-like flow platforms, where flow ramps are used to extract kinetic information, as demonstrated for the esterification of benzoic acid with ethanol.12 However, there have been no reports of MBDoE applied to platforms using autosamplers and injection loops to intensify the experimental workflow,13 even though this approach could reduce reagent consumption and lower both cost and time in chemical process development. In these systems, small volumes of reagents are sequentially injected into a reactor whose volume is significantly larger than the injected volume. As a result, the reactor operates under transient conditions.
To deploy sequential planning on automated platforms, some open-source tools have already been designed, for example EFCOSS,14 that provides a modular environment coupling numerical simulation codes with optimization packages. Pyomo.DOE15 formulates the MBDoE problem as a stochastic program and uses nonlinear sensitivity analysis. Another framework is GPdoemd,16 which trains Gaussian process (GP) surrogate models to support model discrimination when candidate models correspond to computationally expensive black-box simulators and gradient information with respect to model parameters is not readily available. In that framework, GP surrogates are constructed from sampled model evaluations and used to evaluate design criteria based on their predictive distributions.
In contrast, the focus of the present work is on parameter estimation for first-principles models, where parameter sensitivities can be computed, but the integration of MBDoE with self-guided flow chemistry setups requires fast numerical optimization. During the procedure, a parameter estimation is performed using the available measurements, while a second optimization problem determines the experimental conditions that maximize the information content of subsequent experiments.17 When model evaluations are computationally intensive, these optimization steps become a bottleneck, limiting MBDoE deployment in automated platforms by constraining the speed at which new experiments can be executed. A solution to this was presented by Friso and Galvanin,18 where an optimization-free procedure is performed in which a relative information value is used to compare possible experimental conditions. However, for the chosen design to be near the unknown true optimal design depends on the pre-defined set of candidate experiments.
In process systems engineering, surrogate models have emerged as a tool to tackle complex models while saving computational time and resources.19 Within experimental design frameworks, Artificial Neural Networks (ANNs) have been employed to identify suitable kinetic models from experimental data,20,21 showing the potential of using surrogate models for model building. However, these applications have typically not extended to parameter estimation, which requires solving an inverse problem to find model parameters from the measurements. Other methodologies have used physics-informed neural networks (PINNs). With them, a simultaneous training of the surrogate model parameters and the kinetic parameters is performed,22 which can be challenging if not enough data are available.
One strategy to improve the training of neural networks consists in including gradient information, as matching the Jacobian matrix provides additional information per training sample and can be interpreted as a form of data augmentation.23,24 The idea is to apply surrogate models that are trained not only with the model output, but also using its corresponding derivatives. Named Sobolev-trained neural networks,23 they benefit from current libraries, e.g. Pytorch,25 that compute these gradients efficiently through automatic differentiation (AD). Previously, gradient-enhanced ANNs needed the explicit derivation of the gradient,26,27 limiting their application. These types of models have already been applied to process engineering for gradient-based optimization,28 where they have been proven to be beneficial in scenarios where small data sets are available.
Gradients also play a crucial role in MBDoE procedures, so their calculation should be accurate. The sensitivities of the model with respect to its parameters, captured by the gradients, are fundamental for defining the design criteria. In addition, gradients are used within the optimization routine that seeks to optimize these criteria.17
Motivated by this, the present work explores the use of ANNs enhanced by derivative training for parameter estimation through sequential planning, with the aim of allowing the deployment of such a strategy in automated platforms that must handle computationally expensive model evaluations. To the authors' knowledge, this work represents the first application of gradient-enhanced surrogate models to sequential MBDoE for parameter estimation with experiments under transient conditions. The following section introduces the background of the MBDoE procedure, followed by a description of the neural network enhancement with gradient information illustrated through an example, and finally a case study where parameters are estimated from in silico flow chemistry experiments under unsteady conditions.
| ŷ = f(ξ,θ) | (1) |
From a set of preliminary experiments, hypotheses can be made to propose a set of plausible models that describe the observed behaviour. After selection of model structures, it is important to be sure that they are identifiable and distinguishable.29 If, from the data available, it is not possible to discriminate between the best two models, a sequential procedure to produce more observations to maximize discrepancy between predictions can be performed.30
Once we assume that the selected model structure is adequate, experiments that increase the information available for the estimation of parameters must be executed. This part of the MBDoE is the focus of this work. The definition of the parameter space Θ should be based on some prior knowledge of the system behaviour from the preliminary experiments. In order to obtain a parameter estimation
from experimental data, an estimator of the fit quality of the model with respect to observations should be defined. The maximum likelihood estimator (MLE) is mostly used due to its consistency, efficiency and asymptotic normality under modest assumptions.31 This estimator maximizes the likelihood function
, defined as the joint probability of observing the experimental data ye(e = 1, …, Ne) given the Np model parameters. To properly quantify it, the uncertainty in the measurements must be taken into account with the variance-covariance matrix of the experimental errors, Σy. Its diagonal elements represent the variance of the uncertainty associated with each of the Ny measurements within the eth experiment and the off-diagonal elements represent the covariance between pairs of them.32,33
![]() | (2) |
![]() | (3) |
∈ Θ.
To improve the estimation with further experiments, it is important to quantitatively assess the influence of the parameter values on the outputs of the proposed model. This is represented by the sensitivity matrix S[Ny × Np], which is a local measurement depending on current parameter values and experimental conditions. For example, it captures how variations in a rate constant affect predicted outlet concentrations at a given temperature and residence time.
![]() | (4) |
From this, a square matrix Np × Np known as the Fisher Information Matrix (FIM), F, can be computed with eqn (5). This step is crucial as the variance-covariance matrix of the estimated parameters Σ
can be approximated using the inverse of the FIM evaluated at
,ξ,33 useful to determine the confidence region of the parameters and the correlation between them. In this context, experiments that produce stronger sensitivity of concentrations to kinetic parameters result in a more informative FIM. As stated above, the MLE is asymptotically Gaussian and unbiased. This means, with θ* being the vector of true parameters, the distribution of
tends to
as Ne → ∞.31
In practice, however, only a limited number of experiments are available. As a result, this asymptotic approximation may not hold exactly, and its accuracy depends on several factors: the proximity of the estimated parameters to their true values, the degree of model nonlinearity with respect to the parameters, and the signal-to-noise ratio of the measurements. Its widespread use in MBDoE is motivated by its computational tractability and its effectiveness for comparing alternative experimental designs, rather than its ability to provide an exact quantification of uncertainty,33 remaining as a practical and widely used tool for ranking candidate experiments before performing them.
![]() | (5) |
Σ ≈ FNe( ,ξ)−1
| (6) |
To facilitate sequential experimental design, a scalar function of the FIM is typically introduced to quantify the information content associated with a given set of operating conditions. This scalar metric serves as the design criterion and is maximized at each step to inform the selection of subsequent experiments. This enables ranking candidate operating conditions (e.g., different temperatures or flow rates) according to their expected contribution to parameter estimation. The optimal design is, therefore, the one that maximizes the selected criterion ψ(·).
![]() | (7) |
.
Several design criteria exist to decide the best next experiment.17 In this work, the D-optimality is employed, as it is the most widely used, where the determinant of the FIM is the scalar property to be maximized. The aim is to find the experimental conditions that minimize the uncertainty of the parameter estimation, as increasing the determinant of the FIM reduces the volume of the confidence region of the estimated model parameters. In practice, this corresponds to selecting the next set of operating conditions that is expected to give the most informative measurements for refining the kinetic parameters.
ψD(F( ,ξ)) := det(F( ,ξ))
| (8) |
The whole workflow is presented in Fig. 1 illustrating the iterative interplay between model evaluation, parameter estimation, and experiment selection. Stopping criteria could be either a fixed number of experiments based on a pre-defined experimental budget, or a threshold value when assessing the adequacy and reliability of the estimated parameters by, for example, comparing χ2 and t-values with reference values from the corresponding distribution with Ne × Ny − Np degrees of freedom.17 The focus of our work is on parameter estimation and the subsequent design steps, which require efficient computation of sensitivities, that is, first-order derivatives of the model outputs with respect to the parameters (eqn (4)). Since analytical expressions for these derivatives are rarely available, they are often obtained numerically, a process that can become prohibitively expensive in automated platforms when using finite-difference schemes.
Overall, the workflow takes a candidate model, experimental data (e.g., concentration measurements), and admissible operating ranges (e.g., temperature and flow limits) as inputs and returns updated parameter estimates together with the next optimal experimental conditions to be tested, guiding subsequent experiments.
| zl = ϕl(WlTzl−1 + bl), ∀l = 1,…,L | (9) |
Different hyperparameters, such as the number of layers (L) and the number of neurons per layer (Nl ∀ l = 1, …, L), must be specified. The surrogate model is then trained on a designated training dataset of size Ns to determine the optimal weights and biases by solving an optimization problem that minimizes a chosen loss function. For regression tasks, commonly used loss functions include the mean absolute error (MAE) and the mean squared error (MSE). During training, backpropagation is performed where gradients of the loss with respect to all weights and biases are computed using the chain rule via automatic differentiation (AD). This AD framework can also provide the sensitivities of fNN(x) not only during training but also when evaluating the ANN, as illustrated in Fig. 2.
![]() | (10) |
If the loss function is restricted to the MSE (eqn (10)) between function outputs, the ANN can accurately predict the function values, but may perform poorly if required to yield the derivatives of the function with respect to the inputs, which are essential in gradient-based applications such as MBDoE for parameter estimation. This limitation is significant since the surrogate model replaces the mechanistic model that provides a detailed but computationally expensive representation of the experimental setup. As shown by Czarnecki,23 Sobolev-trained neural networks minimize the distance between the true function f(x) and the surrogate model fNN(x) in the Sobolev space by adding a loss term for each of the j elements in the set of available and relevant derivative orders
.
![]() | (11) |
To ensure that all terms in the loss function are of the same order of magnitude, Tsay28 proposed a simple method to scale the derivative terms and select a proper value for each weight λj, avoiding the challenge of relying on heuristic tuning. Since it is common practice to normalize the input and output vectors before training the neural network, the derivatives used for training must be scaled consistently with eqn (12). We denote a scaled variable with an overbar symbol (e.g.,
). If resulting derivatives are also normalized using the min–max scaler, then D
j
(
n) ∈ [0,1], and both terms in the loss function will have the same order of magnitude.
![]() | (12) |
![]() | (13) |
![[x with combining macron]](https://www.rsc.org/images/entities/b_i_char_0078_0304.gif)
is not needed and just the first derivative with respect to θn will be computed. This avoids the need to construct complete gradient information over all inputs, providing a distinct structural advantage over other surrogate models, i.e., gradient-enhanced Gaussian process (GEK) approaches, which generally require complete gradient information across all inputs to formulate stable joint covariance matrices.35
![]() | (14) |
![]() | (15) |
As discussed previously, the gradients of the neural network output with respect to its inputs can be obtained by AD, already available in libraries such as Pytorch (version 2.8.0). For the first-principles model, several strategies can be employed to obtain these derivatives. While analytical expressions offer exact results, they are often impractical for complex systems. Numerical finite differences provide a straightforward alternative but are computationally costly and prone to numerical error. A more systematic alternative is AD, which provides exact derivatives up to machine precision. Depending on the problem size, AD can be applied in forward or backward mode. In the context of models governed by ordinary differential equations (ODEs), the corresponding sensitivity equations (forward mode) or adjoint equations (backward mode) can be solved alongside the system dynamics to efficiently compute parameter sensitivities.36,37
![]() | (16) |
![]() | (17) |
![]() | (18) |
![]() | (19) |
is the molecular diffusion, L is the length of the reactor, τ is the space time, and k is the kinetic constant. It was assumed that
, a typical value for liquids. This first-principles model will serve as the reference to be replaced by a surrogate model.
A dataset {(τn,kn,XAn,dXAn/dkn)}100n=1 was created by a standard Latin Hypercube Sampling (LHS) over the domain of the experimental conditions, i.e., the space time (τ ∈ [10 s, 90 s]), and the domain of model parameters, i.e., the kinetic constant (k ∈ [0.001 s−1, 0.1 s−1]). This dataset was then randomly partitioned into three subsets: a training set (70%), a validation set (15%), and an independent test set (15%). The architecture of the surrogate model is a fully connected neural network with two inputs, τ and k, one hidden layer of 10 neurons, and a single output neuron for XA. The number of neurons in the hidden layer (10) was determined through preliminary testing, as increasing this number did not result in any further improvement in the surrogate model's predictive accuracy. For scaling the input vector, the boundaries of the design space were used as min–max values, and not the min–max values from the sampled dataset. Two surrogate models were trained by minimizing the loss function using the Adam optimizer: one using only the function values XA and another using both the function values and the derivatives dXA/dk. Training was performed for a maximum of 20
000 epochs with early stopping based on the validation loss, using a patience of 500 epochs. The trained network weights and biases corresponding to the lowest validation loss were retained, and final performance metrics were computed on the independent test set.
A comparative analysis was performed to determine the most effective activation function for the ANN architecture. The evaluation involved training 10 networks, each with a random weight initialization. The Root Mean Squared Error (RMSE) of the model output and of the gradient obtained through AD of the ANN was recorded. The median RMSE across the 10 trained networks was used as a performance metric. This comparison was carried out for both types of surrogate models, the standard trained and the enhanced with gradient information.
The median RMSE reported in Fig. 3 exhibits the performance for three different activation functions: ReLU, sigmoid and tanh. Under standard training, the sigmoid and tanh functions yielded similar performance, although the tanh activation showed greater variability. A common characteristic across all three activation functions was a greater error in predicting the gradient than in predicting the function value itself. When incorporating gradient information into the surrogate model training, the performance for the sigmoid and tanh functions improved. In contrast, the ReLU function consistently produced the highest RMSE and showed no significant improvement with gradient-informed training, an expected outcome given its non-differentiability at zero. Regarding computational cost, the inclusion of the gradient loss term did not lead to an increase in training time. In contrast, the gradient-enhanced training exhibited a slightly lower average training time (10 s) compared to standard training (12 s).
Because the sigmoid activation achieved the lowest RMSE for both the model output XA and its gradient with respect to k, it was selected as the preferred choice for further analysis. An ablation study on the gradient loss weight was conducted to assess its influence on model performance. In addition to the baseline formulation (eqn (15)), weights of 0.5 and 2.0 for the gradient loss were evaluated. Both alternatives resulted in higher average total MSE values, 1.78 × 10−4 and 1.02 × 10−4, respectively, compared to 3.91 × 10−5 obtained with the baseline weighting. These results indicate that the chosen weighting provides a near-optimal balance between fitting the function value and its gradient.
Fig. 4 compares the true function and gradient values with the predictions from the standard trained ANNs with the sigmoid activation function. The comparison is performed at τ = {10 s, 50 s, 90 s}, which correspond to the lower boundary, centre, and upper boundary of the operating range used to generate the training dataset. The prediction of the conversion XA is accurate for all three space time values and over the whole range of k values. This is not the case for the prediction of dXA/dk in the lower region of k values of the training set, especially at both boundary values of τ, where the predictions are less accurate. This is a common behaviour of ANNs and why they tend to perform badly when extrapolating. Fig. 5 shows the improvement on both the conversion and its first-order derivative prediction when adding the gradient information during training, particularly increasing accuracy near the lower values of k. This means that using gradient-enhanced training yielded a surrogate model that will behave more like the first-principles model, a relevant advantage in gradient-based applications.
Then, if the surrogate model is used by a MBDoE framework to determine the most informative experimental conditions for the estimation of k, the sensitivities are computed. As just one parameter is being estimated, the FIM becomes a scalar value
. This value was determined assuming a measurement error in the observed conversion,
, with σe = 0.05. Fig. 6 presents the predicted and true Fisher information for all possible values of the space time with k = 0.022 s−1, a value for which dXA/dk > 0 over the whole design space so the conversion is sensitive to variations in the kinetic constant. When comparing between the standard and the gradient-enhanced training, the latter is more accurate in the prediction of the Fisher information. This is relevant when performing an experimental planning to find the optimal value of τ that reduces the uncertainty on our estimation of k. Based on the first-principles model (full black line), an optimal space time for an experiment is close to 47 s, and the same decision would be taken if the gradient-enhanced ANN is used. In contrast, an ANN with a standard training would hesitate between the values of 47 s and 90 s, which could misguide a sequential experimental strategy toward less informative experiments.
Gaining knowledge on model parameters from this kind of system is not trivial, as neglecting the effect of dispersion will lead to an erroneous estimation of the kinetic constants. A detailed model is then needed to account for the real physical behaviour and precisely estimate the intrinsic values. The proposed case study consists of two competitive–consecutive reactions, the iodination of L-tyrosine. In this scheme, L-tyrosine (A) reacts with iodine in water (B) to form 3-iodotyrosine (R) and then R can further react with B to give 3,5-diiodotyrosine (S). Both reactions are second order,8 with reaction rates denoted by r and kinetic constants denoted by k.
| r1 = k1CACB | (20) |
| r2 = k2CRCB | (21) |
![]() | (22) |
With the following boundary conditions:
![]() | (23) |
![]() | (24) |
The model output f(ξ,θ) corresponds to a unique experimental measurement, the area under the curve of iodine concentration at the reactor outlet as a function of time (eqn (25)). Iodine concentration was selected as the measurable variable because it can be directly monitored in practice using an inline UV spectrometer.
![]() | (25) |
To solve the set of PDEs numerically, we used the method of lines. The axial coordinate z was discretized into 1000 uniformly spaced nodes, and both first- and second-order spatial derivatives were approximated using centred finite differences. The discretization converted the PDE into a system of ODEs, solved with a Runge–Kutta method from the SciPy library (version 1.16.2) in Python (version 3.11.7).40
For the simulations, a 5 mL tubular reactor with an internal diameter of 0.75 mm and a length of 11.32 m was considered. A molecular diffusion of 1 × 10−9 m2 s−1 was assumed, and the dispersion coefficient D was determined as described in the previous section. The total concentration, (CA,0 + CB,0), was held constant at 0.75 mol m−3, which defined the height of the rectangular injection profile. The ratio tinj/τ was fixed at 0.04, corresponding to a 200 µL injection volume. As an example, Fig. 7 shows the simulated behaviour of species B for a space time of 5 min, an injection duration of 12 seconds, an injection concentration of CB = 0.375 mol m−3, k1 = 0.05 m3 s−1 mol−1, and k2 = 0.02 m3 s−1 mol−1. The simulation starts with the discrete injection of the reagents (Fig. 7a). The initial concentration of B decreases along the reactor length due to the combined effects of reaction and dispersion (Fig. 7b). The outlet concentration of B is tracked as a function of time (Fig. 7c), mimicking the response that could be measured with an inline UV detector. The area under this curve is then used as the single output of our first-principles model.
The architecture of the neural network consisted of 4 inputs (τ, CA,0/0.75, k1, and k2), 1 hidden layer of 16 neurons, and one output layer corresponding to the area under the curve of B. The hidden layer size of 16 neurons was selected based on preliminary tests, as increasing the number of neurons beyond this value did not lead to any improvement in the surrogate model's accuracy. A training for 10
000 epochs with early stopping (a patience parameter of 500) was performed. Both sigmoid and tanh functions were tested by training 10 neural networks with a randomly initialized set of weights.
Fig. 8 shows the RMSE obtained with both activation functions. Under standard training, the tanh activation achieved lower RMSE values than the sigmoid function for both output and gradient predictions. The higher accuracy of tanh in this case may be attributed to its resemblance to the error function (erf), which is commonly used to describe cumulative distributions such as the area under the curve after a pulse injection. Incorporating gradient information into the training reduced RMSEs in all cases relative to standard training, except for the output predictions with tanh, leading to similar performance between the two activation functions. The advantage provided by the resemblance between tanh and erf during standard training appears to diminish once gradient terms are included in the loss function. In addition, consistent with observation in Section 4, the inclusion of gradient information did not introduce additional computational cost. The gradient-enhanced models converged with an average training time of 4.2 s compared to 5 s for the standard approach.
Since the tanh outperformed the sigmoid activation in approximating the true function values under standard training, it was selected for further comparison with the gradient-enhanced neural network. Alternative gradient loss weights of 0.5 and 2.0 were tested, yielding higher average total MSE values (2.51 × 10−3 and 3.01 × 10−3, respectively) than the baseline value of 2.06 × 10−3 obtained with a weight of 1.0.
From each of the two sets of 10 ANNs, one with standard training and one with gradient-enhanced training, the model with the lowest total RMSE (sum of output and gradient errors) was selected. These two surrogate models were then compared to evaluate their predictive ability and performance when applied to a MBDoE framework for parameter estimation. Both the best standard and gradient-enhanced networks accurately predicted the peak area of B, as shown in the parity plots in Fig. 9, indicating that the inclusion of gradient information does not compromise the accuracy of output predictions.
![]() | ||
| Fig. 9 Parity plots comparing the ANN predicted output fNN(ξ,θ) against the true values obtained from the full mechanistic model f(ξ,θ). (a) Standard training and (b) gradient-enhanced training. | ||
In contrast, differences arise when comparing gradient predictions. The parity plots from Fig. 10 and 11 demonstrate that the gradient-enhanced network provides an improved accuracy for the derivatives w.r.t. both kinetic parameters k1 and k2. In addition, the sign of the gradients is distinguished with colours, as all derivatives should be negative for physical consistency: indeed, increasing either kinetic constant while holding the other fixed increases the consumption of B and reduces the peak area at the reactor outlet, implying a negative derivative. In the standard training case (Fig. 10), several predicted gradients were positive, which is physically unrealistic. Incorporating gradient information (Fig. 11) reduced the occurrence of such predictions, and the few remaining positive values were close to zero, despite no hard constraints being imposed during training. Therefore, gradient-enhanced training not only improves the accuracy of gradient predictions but also mitigates the occurrence of gradients that lack physical meaning. As an additional baseline, a standard GP surrogate trained on the same dataset yielded a total MSE (1.89 × 10−2) and parity plots comparable to those of the standard ANN (Fig. S.1), indicating similar limitations in reproducing parameter sensitivities.
![]() | ||
| Fig. 10 Parity plots comparing the ANN predicted gradients with respect to k1 and k2 under standard training against the true gradients obtained from the mechanistic model. | ||
![]() | ||
| Fig. 11 Parity plots for the ANN predicted gradients with respect to k1 and k2 under gradient-enhanced training against the true gradients obtained from the mechanistic model. | ||
, was added to the computed area under the curve for species B, assuming a constant σe across all measurements. The impact of the noise magnitude is tested below.
The campaign began with five preliminary runs selected via LHS. Because the initial design may influence the outcome, different LHS designs comprising five runs were evaluated. Subsequently, the sixth and all following experimental conditions were chosen iteratively according to the D-optimality criterion. Optimization of ψD was performed using a differential evolution algorithm, as implemented in SciPy.40
As discussed in previous sections, a primary motivation for employing surrogate models within an MBDoE framework is the substantial reduction in the computational burden associated with the optimization routines used to design each experiment. Computing the sensitivity matrix S using the full first-principles model with a central difference scheme, for instance, requires two model evaluations for each parameter θp under every experimental condition ξe. To assess computational cost, we measured the time required to compute the [5 × 2] sensitivity matrix for the five preliminary experiments on a workstation with 16 cores (Intel Xeon w5-3433, 2.00 GHz) and 256 GB RAM. Using finite differences with the first-principles model, which required 20 full model evaluations, the computation time was 970 ± 143 s. In contrast, the gradient-enhanced ANN required only 5 surrogate model evaluations and completed the same task in 0.005 ± 0.003 s, corresponding to a reduction in computational cost by a factor of approximately 200
000. For this comparison, finite differences were preferred over an adjoint approach. Given that only two parameters are estimated, the benefits of employing an adjoint formulation are limited. Additionally, due to the stiffness of the governing equations, the evaluation of sensitivities via the adjoint method proved to be more computationally demanding (4571 ± 1278 s).
To quantify the time needed for a full MBDoE cycle, the number of function evaluations was assessed. On average, 640 model evaluations are required for parameter estimation, along with 339 sensitivity computations to determine the D-optimal experimental conditions. Consequently, identifying the optimal experimental design using the first-principles model requires approximately 3–6 days, compared to only 1–5 s when using the surrogate model. This acceleration is critical for an online MBDoE strategy, which relies on a fast iterative calculation of the sensitivity matrix to inform the next optimal experiment in real-time applications.
Effectiveness was assessed using two complementary metrics: (i) the distance between estimated and true parameter values and (ii) the probability that the true parameters lie within the estimated confidence region at the end of the workflow. Results are shown for a total budget of ten experiments, including five preliminary runs, in Fig. 12 for two noise levels (σe = 0.2 and 0.4 mol s m−3, corresponding to approximately 7.5% and 15% relative error relative to the mean output).
To account for variability due to the initial design, the procedure was repeated using ten different LHS initializations, and the median performance after each experiment was reported. In addition, the coverage probability was estimated from 100 independent runs performed under the same experimental budget. The resulting coverage probabilities, along with the median volume ratio and correlation error, are summarized in Tables 1 and 2.
| ANN | Volume ratio | Correlation error | Coverage probability |
|---|---|---|---|
| Standard (output-optimized) | 0.30 [0.28, 0.34] | 0.12 [0.10, 0.15] | 4% |
| Standard (total-optimized) | 0.26 [0.22, 0.46] | 0.09 [0.04, 0.13] | 7% |
| Gradient-enhanced | 0.44 [0.39, 0.49] | 0.15 [0.12, 0.21] | 60% |
| ANN | Volume ratio | Correlation error | Coverage probability |
|---|---|---|---|
| Standard (output-optimized) | 0.42 [0.33, 0.55] | 0.20 [0.18, 0.23] | 46% |
| Standard (total-optimized) | 0.25 [0.21, 0.37] | 0.07 [0.05, 0.16] | 36% |
| Gradient-enhanced | 0.62 [0.52, 0.76] | 0.22 [0.13, 0.29] | 68% |
For comparison, three surrogate models were evaluated: the standard ANN with the lowest RMSEtotal (RMSEtotal = RMSEoutput + RMSEgrad), the standard ANN with the lowest RMSEoutput, and the gradient-enhanced ANN with the lowest RMSEtotal. The results in Fig. 12 reveal that the standard-trained ANNs failed to converge to the true parameter values, even in the lower noise scenario. By contrast, the gradient-enhanced ANN converged to the correct parameters. This behaviour arises because the D-optimal designs often involve experimental conditions located at the boundaries of the design space (Fig. S.2–S.4). As demonstrated in the previous section, standard ANNs exhibit reduced accuracy near boundary regions, particularly under limited training data. Consequently, the MBDoE procedure systematically selects experiments in regions where the surrogate is least reliable, creating a feedback loop that reinforces model bias and prevents convergence to the true parameters.
As expected in sequential MBDoE, all surrogates guide the algorithm toward experiments that reduce parameter uncertainty, as evidenced by the steady decrease in the median determinant of the FIM (Fig. S.5). However, reducing uncertainty alone is insufficient; a reliable surrogate must also ensure that the resulting confidence region is centered on the true parameter values. This aspect is captured by the coverage probability.
Under lower noise conditions, the standard ANNs exhibit low coverage probabilities of 4% and 7%. As shown in Fig. 12a, this failure is primarily due to their inability to approach the true parameter values. Interestingly, these models still show reasonable alignment in terms of correlation error (Table 1), indicating that their FIM-based covariance matrices adequately describe the spread of their own parameter estimates. This can be visually assessed in the SI (Fig. S.6–S.8), which illustrates that while the shape and size of the covariance ellipses generally match the parameter distributions, they are centered far from the true values due to the flawed local sensitivity structure they provide to the experimental design algorithm. In contrast, the gradient-enhanced ANN minimizes the distance to the true parameters, yielding an improved coverage probability of 60%. This value remains below the nominal 95% and may be attributed to limitations of FIM-based uncertainty quantification at this experimental budget (Ne = 10), as the FIM provides a local approximation of the parameter covariance matrix that may not fully capture nonlinear effects. As discussed in Section 2, these factors can lead to an underestimation of uncertainty, even when the surrogate accurately represents local model sensitivities.
Comparing the higher noise scenarios reveals an important characteristic of the FIM-based uncertainty metrics. Under higher noise, the coverage probabilities for the standard models improve, rising to 46% and 36% (Table 2). This increase, however, does not reflect an improvement in parameter estimation accuracy; the distance trajectories (Fig. 12b) show that the estimates remain far from the true values. Rather, the higher experimental variance inflates the volume of the resulting confidence ellipses. The coverage probability increases primarily because these wider confidence bounds contain the true parameters more frequently, despite the central estimates remaining off-target. The gradient-enhanced ANN scales its coverage to 68% under these wider bounds.
As an example of sequential experimental planning for parameter estimation, Fig. 13 shows the evolution of the estimated parameters and the associated confidence ellipse. The results are displayed after the five preliminary experiments (LHS), after the sixth optimal experiment, and at subsequent steps, for both the ANN with the lowest RMSEtotal and the gradient-enhanced ANN. This trajectory is representative of the behaviour discussed above. In both cases, the sequential addition of experiments leads to a reduction of the confidence regions, indicating improved parameter precision. However, the gradient-enhanced ANN gives estimates closer to the true parameter values. This improved accuracy and reliability make the gradient-enhanced ANN a promising candidate to replace first-principles models, enabling its use in real-time, automated experimental platforms for flow chemistry.
![]() | ||
| Fig. 13 Confidence regions of the estimated kinetic parameters k1 and k2 during an MBDoE campaign. (a) Standard trained ANN with the lowest RMSEtotal. (b) Gradient-enhanced trained ANN. | ||
The case study highlighted the effectiveness of the gradient-enhanced neural network through in silico synthetic validation. The use of gradient information during training improved predictive accuracy of model derivatives, mitigated the occurrence of physically implausible gradients, and delivered reliable predictions near boundary values that are typically targeted by MBDoE as highly informative regions. The gradient-enhanced approach demonstrated higher robustness in parameter estimation, converging toward true parameter values in a higher percentage of cases than the standard ANN, supported by an increase in coverage probability. Standard ANN surrogates, however, exhibited poor coverage probability in scenarios with low noise and systematic bias in parameter estimates, indicating that accurate reproduction of sensitivities is key to ensuring reliable parameter estimation within MBDoE. In addition, assessments of optimization timing confirmed a substantial reduction in the computational cost required for iterative model evaluations. As a proof-of-concept, these results establish gradient-enhanced neural networks as promising candidates for substituting first-principles models, providing a framework for their future application in self-driven flow chemistry platforms where experimentation, parameter estimation, and optimization can be achieved in a fully automated closed-loop.
Beyond flow chemistry, this framework could be applicable to other domains requiring accelerated parameter estimation in real-time applications. However, the present study relies on synthetic validation and specific noise model assumptions, which may not fully capture experimental variability or potential mismatch between the model and the experimental platform encountered in practice. Validation with real experimental workflows on automated platforms is therefore needed to assess the robustness of the proposed approach. Additionally, limitations arise in systems with a large number of parameters, where surrogate training becomes increasingly challenging and ill-conditioning of the FIM may occur. Future work with larger models could focus on strategies before surrogate training, such as identifying a subset of the most estimable parameters. This reduction in dimensionality would improve numerical stability during FIM inversion and ease the training process by concentrating on the most influential parameters under the given experimental conditions.
| This journal is © The Royal Society of Chemistry 2026 |