Open Access Article
Tim Rensmeyer
*,
Denis Kramer and
Oliver Niggemann
Helmut-Schmidt-University Hamburg, Hamburg, Germany. E-mail: rensmeyt@hsu.hamburg
First published on 7th April 2026
Due to the computational complexity of evaluating interatomic forces from first principles, the creation of interatomic machine learning force fields has become a highly active field of research. However, the generation of training datasets of sufficient size and sample diversity itself comes with a computational burden that can make this approach impractical for modeling rare events or systems with a large configuration space. Fine-tuning foundation models that have been pre-trained on large-scale material or molecular databases offers a promising opportunity to reduce the amount of training data necessary to reach a desired level of accuracy. However, even if this approach requires less training data overall, creating a suitable training dataset can still be a very challenging problem, especially for systems with rare events and for end-users who don't have an extensive background in machine learning. In on-the-fly learning, the creation of a training dataset can be largely automated by using model uncertainty during the simulation to decide if the model is accurate enough or if a structure should be recalculated with quantum-chemical methods and used to update the model. A key challenge for applying this form of active learning to the fine-tuning of foundation models is how to assess the uncertainty of those models during the fine-tuning process, even though most foundation models lack any form of uncertainty quantification. In this paper, we overcome this challenge by introducing a fine-tuning approach based on Bayesian neural network methods and a subsequent on-the-fly workflow that automatically fine-tunes the model while maintaining a pre-specified accuracy and can detect rare events such as transition states and sample them at an increased rate relative to their occurrence.
Density Functional Theory (DFT) in particular, has established itself as a valuable tool in computational chemistry that allows the computation of many material and molecular properties, such as electronic structure, binding energies and interatomic forces with a high degree of accuracy and a computational complexity that is feasible on a typical high-performance cluster for many tasks.4 While DFT has enabled the investigation of the properties of individual materials at quantum mechanical accuracy, high-throughput screening of materials or molecules for desired properties still remains very computationally demanding. Furthermore, Molecular Dynamics (MD) – the simulation of the time evolution of molecular systems and materials – remains challenging, as the forces on all atoms have to be calculated at each timestep. This severely limits the time horizon that can be achieved in a practical amount of time using DFT.2
Subsequently, the development of machine learning models that can predict interatomic forces has become an active field of research.5 Here, neural networks have become a promising approach. These models have made great strides in the past years and can achieve much higher accuracy than previous methods with just a few hundred well-sampled training configurations of a specific system, in some cases.6–11 Even more alluring is the prospect of not starting the training from scratch but instead using one of the models from the growing collection of publicly available foundation models.12–19 These models have been pre-trained on large databases of materials or molecules and can often model the overall dynamics of the system to a certain degree of accuracy, but frequently fail to capture subtle physical effects that might be of particular importance, as will be illustrated in Section 5. Moreover, the accuracy of a foundation model is inherently limited by the level of theory used for its pretraining dataset, which is often chosen on the basis of speed and stability rather than accuracy. Fine-tuning is, therefore, often necessary to model specific system details accurately and to bridge the gap between the level of theory used for pre-training and the level that is required for modeling specific properties accurately.20 Fine-tuning such pre-trained models can significantly reduce the amount of training data necessary to reach a desired accuracy when compared to training from scratch,20–25 making it an attractive approach.
However, a key challenge that remains even for fine-tuning is how to select data points for training datasets of a specific system of interest. While existing algorithms for fine-tuning foundation models perform well on benchmark datasets, those benchmark datasets are usually subsampled from very long MD-trajectories in DFT that cover the entire space of atomic arrangements that can occur. Simulating such an exhaustive trajectory to generate a training set in an applied context, where a certain system of interest is supposed to be investigated, can quickly become almost as computationally demanding as the simulation the neural network was supposed to replace. This is especially problematic for systems with high-dimensional configuration spaces with large degrees of freedom, for example, due to many rotational degrees of freedom of single σ-bonds or transition pathways between meta-stable configurations or phases of the system.
At first glance, computing the trajectory at a lower accuracy and computational complexity (e.g., with a minimal basis set) and then recalculating some subset of those configurations at the desired accuracy appears like a simple solution to increase sample diversity and reduce computational demand. However, this is often not feasible. For example, using minimal basis sets in DFT can alter the predicted equilibrium bond distances by a few percent when compared to large basis sets. In practice, this can lead to almost disjoint radial distributions, as shown for the nitrogen molecule in Fig. 1, even though many other physical properties, such as formation energy and vibrational frequency, can be similar to the higher accuracy method.
![]() | ||
| Fig. 1 Radial distribution functions of a nitrogen molecule at 500 kelvin computed with a minimal STO-3G basis set and an extensive triple-Zeta basis set. A B3LYP functional26–29 was used with both basis sets. Because the minimal basis set overestimates the equilibrium bond distance with a prediction of 1.134 Å, unlabeled data generated from a simulation at this temperature with this basis set will contain almost no molecular geometries where the bond distance is below the equilibrium bond distance of around 1.085 Å predicted by the more accurate triple-Zeta basis set. Therefore, labeling this data with the triple-zeta basis will lead to a biased training set that lacks information from the repulsive regime of this basis set, providing a significant risk for the model to degrade in accuracy for such states during a simulation. For reference, the experimental value for the bond distance is 1.098 Å which is significantly closer to the prediction with the triple-Zeta basis set. | ||
Using the foundational neural network model itself to generate the training data for fine-tuning is also often not that easy. For example, the model might predict the wrong phases at the target temperature, as will be illustrated in a later example. This is especially problematic in instances where the correct phase diagram of the material is not known beforehand.
In general, a problem with using a faster approximate method to generate unlabeled training data is that while trajectories generated by such models can often be accurate, it is difficult to assess if this is the case in any particular application. This is especially challenging because qualitatively inaccurate modeling of a system phenomenon by such a model might still appear plausible without extensive analysis a posteriori. If this model simulates a process of interest, such as a state transition or phase behaviour, qualitatively inaccurately, then the training data generated from such a simulation will lack explicit and accurate examples of this phenomenon of interest. Therefore, it is difficult to assess whether the subsequently finetuned model will be able to accurately simulate the processes of interest, since a lack of relevant training structures could potentially cause a significant degradation in its accuracy. This poses a significant obstacle to the trustworthiness of the fine-tuned model.
Even though researchers with sufficient experience in machine learning force fields often find ways to create training datasets for specific applications, the above discussion illustrates the significant challenges associated with and the work required in creating training datasets in general. The tedious and challenging burden of manually generating training datasets should, therefore, ideally be automated in order to allow computational material scientists and chemists to focus on the underlying physics and to lower the technical barrier of entry for fine-tuning models to a reliable, trustworthy and accurate state for less experienced users of machine learning force fields.
Moreover, for finetuning the foundational neural network potentials inside high-throughput materials discovery workflows, for example, for calculating thermodynamic properties of candidate materials, it would be very desirable if the finetuning process were automated completely without requiring human input for each material.
Uncertainty-based active learning methods have established themselves as a way to automate the selection of training data30–33 by sampling structures with high model uncertainty and potential energies that are low enough to be accessible at the target temperature. Uncertainty-quantification methods and uncertainty-based active learning are also attractive, because they provide an additional layer of trustworthiness and reliability by aiming to maintain high model confidence for all structures that have a high likelihood of occurring during simulations, even during rare events. Thus, the use of uncertainty-based active learning for fine-tuning foundation models appears very promising.
For example, on-the-fly learning is an elegant approach30,34 where the pre-trained model might be used to drive the dynamics (e.g., MD simulation, transition state optimization, etc.) until a configuration is encountered that exceeds a certain uncertainty threshold. This configuration is then recalculated with quantum-chemical simulation methods and used to update the model, which then resumes the task until the next configuration above the uncertainty threshold is encountered (Fig. 2). This approach has the additional benefit of being very easy to use for non-machine learning experts since essentially only the initial state of the system and the uncertainty threshold would have to be specified.
![]() | ||
| Fig. 2 An illustration of the on-the-fly learning approach based on Bayesian neural networks introduced in this work. | ||
Unfortunately, most foundational neural network potentials do not come with an uncertainty estimate for their predictions. Hence, the development of a framework to systematically update a foundational neural network potential on new data while also being able to assess its uncertainty would be very desirable.
Another challenge for fine-tuning foundational neural network potentials on-the-fly is that the fine-tuning algorithm has to be able to fit the training data and avoid overfitting neural networks with millions of trainable parameters while progressively building up the training dataset from a single initial sample to possibly hundreds of training samples, even though a validation dataset to detect overfitting is unavailable.
Moreover, a general challenge in uncertainty quantification with neural networks is that even if the predicted uncertainties are very well correlated with the error, this correspondence is often not one-to-one. As an example, while a higher standard deviation in the predictions of an ensemble of models might imply a larger error than a smaller one, a standard deviation of σpred = 1 in the predictions might correspond to an actual standard deviation in the error of σobserved = 1.5. If a validation set exists, the uncertainty estimates can be calibrated. For example, by changing the predicted standard deviation to σpred → ξ × σpred where the rescaling factor ξ is estimated on the validation set by matching the mean predicted variance to the observed mean squared error on the validation set. Unfortunately, in the case of on-the-fly learning, this is not an option, since no validation set exists. In general, only the predicted uncertainties and observed errors from the configurations that were recalculated in DFT due to a large uncertainty can be used as empirical data to estimate the miscalibration of the model.
A compounding challenge is that if at one point the recalibration factor is wrongfully estimated as too small, this miscalibration might not be correctable, since such a miscalibration inhibits new DFT calls that could be used to calibrate the model more accurately.
Bayesian neural networks have demonstrated a high quality of uncertainty quantification comparable to classical ensemble-based methods of uncertainty quantification for machine learning models of interatomic forces and potential energies.35,36 Further, they are inherently more robust to overfitting by weighing preexisting knowledge in the form of the Bayesian prior density and empirical evidence from training data via Bayes' theorem.37,38
Due to these inherent properties, the principal research question of this paper is whether it is possible to develop a framework for on-the-fly fine-tuning of foundation models based on the formalism of Bayesian neural networks that is capable of addressing the additional challenges mentioned above.
The main contributions of this work are the following:
• We develop a simple Bayesian framework for uncertainty-aware fine-tuning of foundation models, by harnessing a simple transfer learning prior as well as Monte Carlo Markov Chain (MCMC) sampling of an ensemble of models and assessing the uncertainty via the disagreement in the predictions of the models.
• We introduce a method for on-the-fly calibration of the uncertainties during the fine-tuning process.
• We demonstrate that the resulting on-the-fly learning workflow is capable of automatically fine-tuning foundation models to a pre-specified accuracy and biasing the training dataset towards rare events.
Machine-learned interatomic force fields for molecules aim to map an atomic configuration {(r1, z1), …, (rn, zn)}, composed of the nuclear coordinates
and nuclear charges zi, to the potential energy E and forces Fi acting on each nucleus i ∈ {1, …, n}. The predicted forces can then, for example, be used in combination with Newton's equations of motion to model the time evolution of the molecules. For materials, the input will usually also contain the lattice vectors, and the stress tensor will often be predicted as well. Because generating large amounts of high-quality training data is typically infeasible due to the high computational demand of quantum-chemical simulation methods, modern machine learning models have several forms of physical constraints built into them, in order to make them more data efficient,7–9,11,39,40 such as energy conservation by calculating the forces as the negative analytical gradient of the potential energy with respect to the atomic positions. Further, rotation invariance of the potential energy is explicitly built into modern neural network architectures.
the prior density gets refined into the posterior density
using Bayes' theorem:
is a normalization constant. On a new input sample x, the probability distribution of the target variable y can then be calculated viaA more sophisticated approach for constructing a transfer learning prior was introduced by Shwartz-Ziv et al.,37 where a rescaled local approximation of the posterior on the pretraining dataset was used as a prior. However, in the applications considered here, this approach is not practical since such a prior would have to be constructed during the pretraining of the foundation models, while we focus on fine-tuning foundation models that have already been pre-trained.
Transfer learning of a pre-trained model for interatomic force fields has been investigated by Kolluru et al.,21 Zaverkin et al.,22 Smith et al.23 and Falk et al.24 However, the modeling of uncertainty has not been under consideration in those works.
After presenting our initial investigation of the uncertainty-aware transfer learning approach in a non-archival peer-reviewed venue,49 there has very recently been one other work published that does uncertainty-aware fine-tuning.50 However, the integration into an active learning workflow, like on-the-fly learning, is not under investigation in that work. Further, they investigate instances where large training datasets of thousands of training samples are available and are more focused on fine-tuning for entire subclasses of materials. In contrast, we focus on fine-tuning for specific systems where the size of the training dataset varies from one to a few hundred training samples.
Finally, there are the seminal papers by Li et al.51 and Vandermause et al.30 for on-the-fly learning for molecular dynamics. However, they use Gaussian processes as their substitutional model, which are less data efficient than modern neural network architectures and furthermore require training from scratch instead of fine-tuning a pre-trained model.
To calibrate the uncertainties on the fly, we use the following Bayesian procedure:
The uncalibrated Bayesian Neural Network (BNN) models the error eobs = Epred − Etrue as
To calibrate the model, we want to estimate a suitable value for λ in order to match the raw uncertainties produced by the model to actual error estimates. To do this, we use a Bayesian estimator for λ by introducing a prior in the form of a Gamma distribution over λ:
(see Appendix B.11 for the derivation of this result).
We use this result to decide if a DFT calculation should be done by specifying an error threshold K and calling a DFT calculation if the predicted probability of the error magnitude exceeding K is larger than five percent.If the probability is smaller than five percent, the model continues with the MD-simulation. Otherwise, a DFT reference calculation is done and added to the training dataset. Afterwards, we then use Markov chain sampling to generate an ensemble of eight neural networks from the updated posterior. To sample the posterior density, we use the AMSGrad version of the Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) algorithm46 introduced by us in a previous work.36 We use the old Monte Carlo samples as a starting point for sampling the new ones and simulate short Markov Chains with those initial seeds and priority sampling of the newly added structure to achieve faster convergence (see Appendix B.5 for more details on the sampling procedure). Finally, the DFT forces and stresses are used to perform this MD step. The resulting workflow is illustrated in Fig. 2.
For our experiments, we use three different neural network potentials, NequIP,9 MACE12,13 and Equiformerv2.14 To apply the Bayesian neural network formalism to these models, we add a few layers to the models so that each model predicts a distribution instead of making point predictions. The details of these modifications are provided in Appendices B.1–B.5. Additional details about how predictions and uncertainty quantification are done with the BNN can be found in Appendix B.9.
We always label the initial atomic structure with DFT and include this first data point to sample the initial Monte Carlo samples. Because potential energies are only defined up to a constant and this constant might be different for the data the model was pre-trained on and the simulation software used for fine-tuning, we also use this initial labeled data point to construct an offset value that is added to all energies calculated by DFT during the on-the-fly simulation. This value is chosen so that the offset DFT calculated energy equals the potential energy predicted by the pre-trained model on the initial structure.
To illustrate the effect of training a Bayesian neural network model with the prior constructed from the weights of a pre-trained model, compared to training a model from scratch, we start out with a transfer learning scenario for potential energies of a surface adsorbate system. We use a publicly available EquiformerV2 model14 with 31 million trainable parameters, which was pre-trained on the OC20 dataset.52 For the target dataset, we choose the NbSiAs surface – COH adsorbate dataset from the OC20-Dense dataset.53 This system is not included in the pre-training data of this foundation model and both datasets are computed with identical DFT settings. Notably, this benchmark does not yet involve the on-the-fly learning workflow and is instead done on a publicly available benchmark dataset. We partition the dataset into training, validation, and test data and then generate Monte Carlo samples from the posterior resulting from the training dataset (see Appendix B.5 and C.4 for details). Here, we compare the model accuracy to a comparable BNN that is trained with an uninformative prior. Further, we illustrate the quality of the resulting uncertainty quantification by calculating the Area Under the Curve – Receiver Operating Characteristic (AUC-ROC) scores for determining whether a test sample has a prediction error that lies above or below 1 kcal mol−1 based on the standard deviation in the predictions of the model. Because recognizing samples with an error above a certain threshold becomes easier at higher accuracies, we plot the AUC-ROC scores over the Root Mean Squared Errors (RMSEs) on the test set for this metric. To assess the viability of this transfer learning approach to finetuning models, additional transfer learning ablation experiments were performed, which focused on transfer learning of molecular forces with a NequIP model in several settings, including a change of electronic structure theory from DFTB54 to DFT and DFT to CCSD(T).55 These experiments and their results are listed in Appendix A.
As the first benchmark for the proposed on-the-fly fine-tuning workflow, we investigate the on-the-fly fine-tuning of a NequIP9 model during a molecular dynamics simulation of an ethanol molecule at 300 kelvin. The NequIP model has 13.45 million parameters and was pre-trained by us on the SPICE dataset.56 Ethanol was not contained in the pre-training data and a different level of theory was used for fine-tuning (B3LYP26–29) and pre-training (ωB97M-D3(BJ)57–59). Despite its small size, ethanol is an interesting benchmark molecule because it has different meta-stable conformations – one anti and two symmetry-related gauche conformations (Fig. 4). Due to this, there are transition states between those conformations that are expected to occur during an MD-simulation at this temperature, which are relatively rare due to the higher potential energy when compared with the metastable conformations. In a good on-the-fly learning workflow, such transition states have to be identified by the uncertainty measure and added to the training dataset at a higher rate compared to the rate of their occurrence during the simulation to ensure the accuracy of the model during transitions. For this benchmark, we set the target accuracy as 0.5 kcal mol−1 and perform a 10 ps simulation.
For the second experiment with the proposed on-the-fly fine-tuning algorithm, we run a simulation of a 2 × 2 × 2 LaMnO3 supercell with the mace-mp0 medium model with 4.69 million trainable parameters12,13 pre-trained on the MPtrj dataset.15,60 This system is interesting because it undergoes a phase transition at around 750 kelvin (Fig. 5 a). LaMnO3 crystallizes in an orthorhombic phase with space group Pbnm below 750K due to the strong Jahn–Teller activity of manganese. The orthorhombic phase is an anti-ferromagnetic insulator in which an orbital ordering is established due to the cooperative Jahn–Teller effect breaking the degeneracy of the electronic configuration of Mn3+ (t2g3eg1).61 For this reason, we perform a 150 ps molecular dynamics simulation with two temperature jumps. For the first 25 ps, we run the simulation at 300 kelvin. At the 25 ps mark, we introduce a temperature jump to 800 kelvin. We keep this temperature for 100 ps, after which we introduce a second temperature jump back to 300 kelvin and continue the simulation for another 25 ps. Some geometries of this system were contained in the training data of the foundation model and we use the same level of theory for fine-tuning as was used for the pre-training (PBE + U62,63). However, as will be seen by the experiments, the non-fine-tuned model has insufficient accuracy to model this system physically accurately, making it a good benchmark for transfer learning experiments.
The final benchmark for the proposed on-the-fly fine-tuning algorithm is a proton diffusion simulation in a 160-atom CaZrS3 supercell at 1500 kelvin, again with the mace-mp0 medium model. This benchmark is challenging because proton mobility is intimately linked to the dynamics of the host lattice. Occasional close proximity of two sulfur atoms belonging to non-connected ZrS6 octahedra enables a direct jump between the two ZrS6 octahedra with virtually zero activation energy64 (Fig. 6a), resulting in unusually high proton mobility at low temperatures. The S–S distance is governed by a static lattice distortion due to size mismatches between Ca and Zr as well as the vibrational properties of the sulfur sublattice. Hence, this benchmark goes beyond the mere reflection of transition state energies. It also requires a correct representation of the dynamics of the host lattice. Accurately modeling the proton transport in this system, therefore, might require a very high accuracy of the model. Here, we evaluate the model with two different error thresholds, 5 kcal mol−1 and 15 kcal mol−1. We fine-tune both models on-the-fly during an initial 30 ps MD simulation and then use the fine-tuned models and the non-fine-tuned one to investigate the diffusion of protons in CaZrS3 with five additional 30 ps simulations, which was revealed by the initial training runs to be a well-suited time-horizon to estimate proton mobility. While some geometries of the host lattice were contained in the pre-training data of the foundation model, no such structures with interstitial protons were included. We use PBE62 level of theory for the fine-tuning run. We run the mace model and the DFT calculations without dispersion correction.
All experiments were done with 8 Monte Carlo samples, which we identified as a good tradeoff between computational complexity and quality of uncertainty quantification in previous benchmarks.36 Details of the sampling procedures can be found in Appendix B.5. More details for all the simulations can be found in Appendix B.6. We set σTL from the Bayesian neural network prior as 0.2 for all experiments involving the NequIP model, 0.5 for the MACE model and 0.02 for the EquiformerV2 model. For this method to be practically useful, it would be very desirable that the strength of the prior does not require precise fine-tuning to a value that is unique to each specific application. For this reason, we did not perform a deep hyperparameter optimization for each model and experiment. Instead, we started with a small value of σTL for each model and then incrementally relaxed it until it could fit the data from an initial task well and then used these values for all experiments with the corresponding model. For the MACE model, the calibration was the simulation of the LaMnO3 system, for the NequIP model, it was the paracetamol ablation experiment in Appendix A, and for the EquiformerV2 model, the NbSiAs surface – COH adsorbate dataset itself was used.
Based on the results of our transfer learning experiments in Appendix A, we chose a = 3 and b = 10 for the gamma-prior over the calibration parameter for the NequIP model. Because we did not have such results for the MACE model, we chose the more conservative parameters a = 1.5 and b = 10 for the corresponding experiments. The lengths of the training runs (or phases of training runs for LaMnO3) were determined adaptively by starting with a conservative time estimate for each system and extending it based on the frequency of DFT calls requested until DFT requests became infrequent. The most important factor influencing the difference in required training times is the size of the configuration space of each compound, with LaMnO3 requiring a significantly longer training time due to the occurring back and forth transitions between the high- and low-temperature configurations, which, due to finite size effects of the 2 × 2 × 2 supercell, still happen at 800K. Accuracy thresholds were chosen primarily by the size of the system and secondarily by the expected size of the configuration space for the specific simulations.
Lastly, we also evaluated the number of DFT calls over time (Fig. 5d). We find that the model stops doing DFT reference calls very early during the initial low-temperature phase of the simulation. After the temperature jump, it starts doing DFT calls again, with the frequency decreasing over time. Further, the model stays well within the specified accuracy region with only four instances where the error is larger than the threshold at the time of intervention. For reference, we also plot the error of the non-fine-tuned model on the structures for which DFT calls were made, illustrating the improvement in accuracy.
After the initial training run, we use the fine-tuned models to analyse the square displacement of protons in CaZrS3. We perform five simulations with each model by first setting the target temperature with a Langevin thermostat during an initial one picosecond-long simulation. Afterwards, we perform a 30 ps simulation in the NVE ensemble. Because two protons were placed in the supercell of the simulation, this results in ten proton trajectories for each model.
The squared displacements for the different models are shown in Fig. 6c. We see a very large Mean Square Displacement (MSD) for the non-fine-tuned model, which is reduced significantly when fine-tuned with an error threshold of 15 kcal mol−1. Interestingly, fine-tuning to a lower error threshold of 5 kcal mol−1 leads to the MSD rising again slightly.
To explore the source of this phenomenon, we investigated the mechanism behind the proton jumps for the individual models. To do this, we analyzed the average sulfur–sulfur distance for the 50 fs leading up to proton jumps. For this, we have additional DFT reference data from a 30 ps DFT simulation from a previous research project in our group, where the average separation at the time of the jump was analyzed. The results are shown in Fig. 6b. For the low error threshold model, we find very good agreement with the DFT reference value. For the higher threshold model, the distance is slightly too large. This yields a possible explanation for the slightly lower proton mobility from this model, since the activation energy for jumps will be slightly larger at this separation. For the non-fine-tuned model, the average sulfur–sulfur distance at the time of the jump is much larger than the DFT reference. This demonstrates that the transition states underlying the proton jumps are not accurately modeled without finetuning, since the activation energy at such distances should be very high, resulting in low proton mobility instead of the erroneously high mobility that is predicted by the foundation model.
To further assess the accuracy of the models at test time, we sampled 100 random geometries from the first simulation for each of the already fine-tuned models. From the results shown in Fig. 6d, it can be seen that the models are within their target accuracy for those configurations. For reference, we again include an evaluation of the error of the non-fine-tuned model on the geometries sampled from the simulation of the high-accuracy model and find a much poorer accuracy.
Overall, it took 31 DFT calls to train the model to the higher threshold and 285 calls to train it to the lower error threshold.
We did an additional analysis to assess if, after the initial learning phase of 5000 simulation steps, there is a preference for DFT calls during transition states where the proton jumps between sulfur atoms. Interestingly, despite the large size of the system with 162 atoms, the overall uncertainty from the system did not completely drown out the uncertainty from the transition states in the low-threshold training run. In particular, we found a statistically highly significant increase in DFT calls (p = 0.0012) in 15 fs time windows around proton jumps, with an increase of 56 percent in the average number of DFT calls during time windows around proton jumps compared to time windows of equal size that do not contain proton jumps.
Overall, the high data efficiency is primarily achieved via the transfer learning approach and not the uncertainty-guided on-the-fly learning, whose primary function is automation, ease of use and providing an additional layer of trustworthiness by maintaining high model confidence during the simulation. In particular, on-the-fly learning requires high intervention frequencies during the start of the simulation to maintain accuracy and to align the foundational neural network prior to the target system's specific potential energy surface. This might be suboptimal in terms of data efficiency since the initial structures will be correlated. Strategies that initially allow for a higher error threshold during a training run could be employed to reduce the initial sampling frequency, while risking that some of the structures sampled this way might retroactively turn out not to be relevant at the simulation temperature. This might, in practice, improve data efficiency further. However, error thresholds that are too large have the potential to lead to instabilities in the simulation since a certain level of catastrophic forgetting can't be ruled out even when using the transfer learning prior to counteract this effect. Due to the additional complications in terms of automation and ease of use that other active learning strategies can have, we chose to integrate the uncertainty-aware transfer learning approach into an on-the-fly learning workflow. However, due to its generality, the transfer learning approach introduced here can, in principle, be integrated into any uncertainty-guided active learning method.
Due to the simplicity of the uncertainty-aware transfer learning approach, it should further be straightforward to apply this approach to other neural network models not included in this work, since it worked out of the box for all neural network models investigated here after fitting only two hyperparameters for each model architecture individually, the strength of the BNN prior and the step size of the Markov chain. However, during our experiments, the same hyperparameters for each architecture gave good results across different simulations or datasets. This indicates that strong baseline values for those hyperparameters can be determined and given as default values to end-users. For large domain shifts, this value and the number of training steps at each update may need to be increased for the model to sufficiently adapt to the target simulation. However, post-hoc analysis indicates that the main driver behind variation in the scale of σTL across models seems to be largely driven by the scale of the weights of the respective foundation models, with the EquiformerV2 having substantially smaller weights on average, also needing a much smaller σTL. Therefore, even for large domain shifts, good values for σTL should still be of the same order of magnitude as the values determined here. Lastly, if the simulation is split into fine-tuning and production simulations, for example, to avoid violating energy conservation due to model updates, the same simulation-hyperparameters (i.e., temperature or pressure) can be used for both simulation types, which keeps the selection of those parameters for training runs simple.
While reliance on Bayesian neural network methods represents a likely source of methodological unfamiliarity for end-users compared with more classical optimization methods for neural networks, the effects that these hyperparameters have on the optimization process are conceptually intuitive to understand, with the strength of the prior controlling how far the fine-tuned model might deviate from the pretrained model in its predictions and the step size having an analogous role to step-sizes in regular optimizers. Combined with the existence of strong baseline values, the Bayesian nature should therefore not represent a high barrier of entry for computational material scientists and chemists.
Because we use the uncertainty in the energy predictions to decide if a DFT call should be done or not, the proposed fine-tuning workflow is geared towards applications where the accuracy in predicted potential energies is of primary interest, even though the DFT-computed forces and stresses are included as training labels during model optimization. An alternative approach would have been to use the uncertainty in the force predictions. We used the energy predictions because accuracy in the energies controls the accuracy of the thermodynamic ensembles, and accurate forces do not necessarily guarantee accurate potential energies since even small errors in the forces can add up to large errors in the potential energy over large changes in the geometry. However, an advantage of using the force uncertainties would be that forces are localized. This makes it possible to choose individual error thresholds for each atom, where some atoms or regions might be of particular interest, such as in the case of proton diffusion. For this reason, we plan to implement the option to include the force uncertainties as a decision criterion for DFT interventions in the future. For some physical properties that arise from higher-order derivatives of the potential energy surface, such as vibration frequencies which depend on the Hessian, it is challenging to ensure accuracy with any training methods, even on the training data itself, since the objective function used in the training of neural network potentials usually only involves the potential energy and its first-order derivatives. The main reason for this is that the first-order derivatives can be computed without significantly increasing the total computational complexity of the electronic structure calculation by leveraging the Hellmann–Feynman Theorem,66,67 while higher-order derivatives would be much more computationally demanding to compute. Therefore, it is currently unclear how accurately such properties are modeled within this workflow.
Regarding the speed up our fine-tuning approach can achieve for simulations, the exact results will be dependent on the chosen error threshold, hardware availability and the DFT software used. We give some illustrative examples and estimates for different hardware configurations for the CaZrS3 – proton system in Appendix D. It only took 2000 training steps at a batch size of 5 to update each Monte Carlo sample on a new DFT-labeled geometry. Additionally, updating the model is highly parallelizable because the Markov chain of each Monte Carlo sample can be run independently. Nonetheless, on limited hardware, it might be better to only sample 4 instead of 8 Monte Carlo samples. In previous works, we observed a decrease in log-likelihoods by a value comparable to the decrease observed by reducing the number of Monte Carlo samples from 16 to 836. However, it is not advisable to go below 4 Monte Carlo samples as the quality of uncertainty quantification starts to degrade more significantly, with a decrease of almost twice the value of going from 8 to 4 samples when reducing it further from 4 to 2. In cases of very constrained GPU resources, using more computationally efficient variational inference-based BNN methods instead of MCMC methods might therefore be a better option.
Overall, the quality of uncertainty quantification was sufficient to keep the model in the specified accuracy range, with the fraction of structures that at DFT intervention exceeding the error threshold consistently staying below the 5 percent target. Further, the uncertainty measure was able to bias the training set towards rare events for both the ethanol and the proton diffusion example. However, there were still instances where the error was large despite a small predicted standard deviation, which leaves room for improvement.
One possible cause for this might be that we sample the different sets of weights from a small region around the weights of the pre-trained model, because of the chosen prior over the weights. In this work, we use a Gaussian prior as a proof of concept to show that by harnessing the prior of a BNN for transfer learning, enhanced data efficiency and useful predictive uncertainty estimates can be achieved. However, there is likely room for improvement in the design of the prior by experimenting with different types of parametric distributions. Further, recently, an approach to sampling Bayesian neural network weights has been introduced, where the prior can be specified on the function space instead of the weight space.68 This is an attractive alternative because it would avoid biasing the weights of the model to stay close to the original ones and instead would only bias the weights to result in predictions that lie within a certain range of the original model. Attempting to adapt this formalism to construct a more sophisticated prior from the function space, based on the belief that the neural network might need to change its predictions by a certain amount to be accurate for the system of interest, might therefore be a promising approach to improve the quality of uncertainty quantification.
Other factors can additionally make uncertainty quantification challenging in an on-the-fly learning setting. There can be discontinuities in the electronic structure of molecules or materials, for example, when changes in the atomic geometry push previously occupied electronic states through the Fermi level. This can cause discontinuities in the curvature of the potential energy surface, which makes uncertainty quantification challenging because current machine learning models are unaware of the electronic structure at this detail. Because of this, the first occurrences of geometries with novel electronic structures in an on-the-fly learning scenario will likely always be predicted with some degree of overconfidence.
We currently make the assumption in our calibration of the uncertainty that the miscalibration of the uncalibrated model uncertainties remains approximately constant over the course of the simulation. This assumption can, of course, be violated, for example, during phase transitions, and hence it might be beneficial to adjust the calibration procedure for such cases. One approach could be to not calculate Mn from an average but instead from an exponential moving average that will weigh more recent examples more strongly compared to older ones. Another interesting future research direction is extending this fine-tuning approach from fine-tuning for specific systems to entire subclasses of systems. Currently, a model that is fine-tuned to a specific system with the proposed workflow here will likely diminish in accuracy for almost any other system. Further, the uncertainty is calibrated to a different system and will likely also not always be transferable to other systems. However, by automatically generating training data from system-specific fine-tuning runs for many individual systems from this subclass and then updating the foundation model based on the merged training data, it might be possible to fine-tune foundation models for entire subclasses of structures in an automated workflow.
The code for running the on-the-fly learning simulations, including tutorials, is available at:
https://github.com/TimRensmeyer/OTFFineTune and has been archived on Zenodo (DOI: https://doi.org/10.5281/zenodo.18772098)
The ablation experiments for the transfer learning approach in isolation were done on public benchmark datasets. At the time of writing (25.02.2026), the datasets are available at the links below:
The stachyose dataset69 is available at:
https://www.quantum-machine.org/gdml/repo/datasets/md22_stachyose.npz.
The RMD17 dataset70 is available at:
https://figshare.com/articles/dataset/Revised_MD17_dataset_rMD17_/12672038.
The coupled cluster-level ethanol dataset71 is available at:
https://www.quantum-machine.org/gdml/data/npz/ethanol_ccsd_t.zip.
The OC20-Dense dataset53 is available at:
https://github.com/Open-Catalyst-Project/AdsorbML.
Only the NbSiAs surface – COH adsorbate data from this dataset was used in the ablation experiment.
Source code and in-depth illustrations on how the neural networks were trained and evaluated in those ablation experiments are available in the code appendix.
Supplementary information (SI): source code to perform the ablation experiments from the paper. See DOI: https://doi.org/10.1039/d5dd00392j.
The second benchmark is a transfer learning scenario from DFTB level accuracy to DFT level accuracy. In particular, we generate a large dataset of different configurations of a stachyose molecule in DFTB for pre-training and then utilize the stachyose data from the MD22 dataset for the transfer learning task.
The third test scenario is a transfer learning task for reaching CC level accuracy on an ethanol molecule starting from a model pre-trained on the corresponding ethanol data from the MD17 dataset. The CC-level dataset used for this was introduced by Bogojeski et al.71
All experiments were done with 8 Monte Carlo samples generated from the same Markov chain, which has been identified as a good tradeoff between computational complexity and quality of uncertainty quantification in our previous work.74 Details of all the datasets can be found in Appendix C. We set σTL as 0.2 for all experiments.
On the stachyose dataset, again, a clear improvement in accuracy at equal amounts of training samples is visible when compared to the baseline model (Fig. 7). However, both models have higher RMSEs than their counterparts on the paracetamol dataset at equal amounts of training configurations. The MLLs of the transfer learning model appear to be slightly lower than for a model trained from scratch when controlled for accuracy. The same is true only to a much smaller degree for the AUC-ROC scores. Further analysis revealed that the validation set was too small for the large configuration space of stachyose to properly recalibrate the uncertainties, which led to an overestimation of the errors on the test set for the transfer learned models but not for the baseline models. This also explains the absence of such reduced performance on the AUC-ROC scores, which are invariant under recalibration of uncertainties. Accounting for the slightly wrong calibration by recalibrating the uncertainties on the test set instead of the validation set confirmed miscalibration as the main source of the gap in MLLs. In particular, the gap between the MLLs of the transfer learning models that are closest in RMSE reduced from 0.191 to 0.086. Notably, we also attempted a transfer learning scenario on the stachyose dataset, where the model was only pre-trained on the small molecules of the MD17 dataset. However, this did not lead to any improvement in data efficiency. The most likely explanation for this is that the local atomic environments on the small molecule datasets, which the neural network uses to calculate potential energy contributions, are qualitatively very different from those of the stachyose molecule.
The biggest improvement in accuracy, when compared to the baseline model, was found on the ethanol dataset (Fig. 8), with an RMSE of less than 0.5 kcal mol−1 Å−1 with only 10 configurations. There appears to be no decrease in the quality of uncertainty quantification, both in terms of MLLs as well as AUC-ROCs on this benchmark, when compared to the baseline model. Comparing both the baseline model and the transfer learning model to the deep ensemble, almost identical performance can be seen on the outlier detection task, while the deep ensemble has slightly higher MLLs.
Additional analysis was performed by breaking down the MLLs into contributions from force components, whose prediction error falls into a certain interval, e.g. the contribution to the MLLs from samples where the prediction error is in the interval [0.1, 0.2) is given by the sum over the log-likelihoods of all force components in the test set where the prediction error is in the interval [0.1, 0.2) divided by the total number of force components in the test set. The results shown in Fig. 9 demonstrate that the total MLL score is dominated by samples whose prediction error is small. This offers an explanation for the slightly better MLLs without a similar performance gap on the outlier detection task on this benchmark: the deep ensembles achieve slightly better uncertainty quantification on configurations with a small prediction error but not on configurations with a large prediction error.
Notably, for all force transfer learning scenarios, the error of the pre-trained model was quite large with mean absolute errors of 2.31 kcal mol−1 Å−1 on the paracetamol validation set, 4.34 kcal mol−1 Å−1 on the stachyose validation set and 5.12 kcal mol−1 Å−1 on the ethanol validation set. Further, all pre-trained models achieved a validation loss smaller than 0.15 kcal mol−1 Å−1 on their pre-training datasets, strongly indicating that DFTB, DFT and CC methods disagree quite substantially in their force predictions for a given configuration. However, as was already alluded to in the introduction, this can potentially be traced back to simple biases in the simulation methods, such as slightly different equilibrium bond lengths. These small biases in different simulation methods can lead to qualitatively very similar force fields that may disagree substantially on the forces of a given configuration. This would also explain why transfer learning is very efficient in these cases, as the model mostly has to correct for those biases, such as equilibrium bond lengths. Importantly, those force fields will lead to similar predictions of physical and chemical properties despite their apparently large disagreement, while a machine-learned force field with a similar magnitude of error to one of those methods cannot, in general, be expected to yield those properties as well and hence needs to be trained to a much higher accuracy.
One additional result that stands out is the relatively high RMSE of both the transfer learning and the baseline model on the stachyose dataset when compared to the other two test scenarios. However, two factors make this dataset particularly challenging. First of all, stachyose is a larger molecule than paracetamol and ethanol, which, in addition, contains many single sigma bonds that allow for rotational degrees of freedom along the bond axis. This results in a very large configuration space for stachyose molecules, even relative to their size. The second factor that makes this benchmark more challenging for the transfer learning model is that, unlike in the ethanol case, the higher accuracy dataset was not composed of configurations generated from an MD trajectory of the lower accuracy method but instead from a trajectory at DFT-level accuracy. As a result, the distribution of configurations in the DFT dataset will be different from the one from the DFTB dataset.
Lastly, one important observation we made is that the transfer learning approach converges much faster than when training from scratch. While state-of-the-art models can take days to train from scratch, training and validation losses converged within minutes on the transfer learning tasks. The only reason we let the sampling algorithm run for as long as described in Appendix B is to make sure that no pathological overfitting takes place.
. NequIP and MACE then calculate the forces acting on the atoms as the negative gradients of the potential energy
via automatic differentiation libraries, while the Equiformerv2 model calculates the forces directly from a set of equivariant atomic feature vectors. To apply the Bayesian neural network framework to these models, the architectures were modified slightly by adding layers that compute standard deviations σ1, …, σn for the forces from the invariant features {v1, …, vn}. Further, an energy standard deviation σÊ is introduced and the distribution over the energy and forces is modeled asFor the Equiformev2 and MACE models, the predictions are additionally conditioned on the lattice vectors L1, L2, L3. For the MACE model, the stress tensor is also predicted. For this, the predicted distribution is modified to
The standard deviations of the forces σi are predicted by a three-layer MLP with input dimension 64, latent dimensions 32 and 16 and output dimension 1.
SiLU activation functions are used for the latent layers and the output activation function is the exponential function.
A base with 5 interaction blocks was used.
Because there aren't a lot of labeled examples for the potential energies, a fixed energy standard deviation was used during the on-the-fly experiments, which was set to 10 percent of the target accuracy.
For the experiments in Appendix A, no energies were included in the training and train only the force labels were trained on.
The output of the mace-mp medium model contains feature vectors for each atom. 256 components of each feature vector are rotation invariant. From these features, the standard deviations of the forces σi are predicted by a three-layer MLP with input dimension 256, latent dimensions 64 and 16 and output dimension 1.
Again use a fixed energy standard deviation set to 10 percent of the target accuracy was used. For all experiments, a fixed stress standard deviation of 0.1/16 kcal mol−1 Å−3 was used.
with γ0 = 0.001 and cycle length K = 50
000 to sample the subsequent models from the same Markov chain at the end of each cycle.
The same procedure is also utilized for the baseline model on the stachyose test case. However, the initial convergence phase is shortened to 0.5·106 steps for the transfer learning model, as the other two test cases had revealed a quicker convergence for the transfer learning models. For the paracetamol and ethanol cases, a batch size of 30 is used and for the stachyose case, it is set as 15. Analogous to the on-the-fly learning experiments, an energy offset was calculated for the training, validation and test data, of the equiformer model so that for the first sample in the training data, the energy equals the one predicted by the pre-trained model. For the surface-adsorbate transfer learning task, a batch size of 20 was used for the baseline model and for the transfer learning task, it is set as 15. Furthermore, the same sampling procedure is employed. For the baseline and transfer learning model, the step size γ is exponentially decreased from 10−4 and·10−5 to 10−7 and 10−8 respectively during the initial convergence phase. This phase was 0.5·106 steps long for the baseline model and 105 steps long for the transfer learning model, respectively. Then again, a cyclical sampling procedure is employed to generate the other samples with γ0 = 0.0001 and cycle length K = 50
000 After the first 90 percent of the initial convergence phase, the mass term is kept constant to ensure close convergence to the posterior.
was used. For the MACE model, γ0 = 0.001 was used and for the NequIP model, it was set to γ0 = 0.00003. To improve the speed of convergence, priority sampling was used to increase the likelihood of sampling the newly added training sample at each iteration. The sampling probability was increased so that, on average, the newly added sample is contained once in each minibatch. Each sample in the minibatch estimator of the log-likelihood was weighted by the likelihood ratio of a uniform sampling procedure and the sampling probabilities used to ensure the minibatch estimator is unbiased.
000 training steps, the energy MSE on the validation set were evaluated. The training was stopped after the energy validation loss hadn't improved for 3 epochs. The final model used is the one with the lowest validation loss during the training run.
i,j is the predicted expectation value for the i-th force component of that particular model and
the corresponding standard deviation. The variance is calculated over the Monte Carlo samples/ensemble models. The mean of the predicted distribution was simply calculated as 
The calculation of the final energy and stress distribution was done completely analogously.
. By integrating this density from e* = −K to e* = K, the result:
and
for x → ∞ it can be verified, that for n → ∞ the predicted error distribution becomes
000 configurations generated from a molecular dynamics trajectory calculated at DFT level accuracy. The training and test datasets of ethanol at CCSD(T) level accuracy introduced by Bogojeski et al.71 were used for the transfer learning task. The last 10 configurations of the training set were used as validation data. The actual training data consisted of the first
configurations of the training dataset for varying values of m.
000 configurations from each MD17 dataset and all configurations from the MD22 datasets were used to form a pool of configurations from which 100
000 are randomly drawn as the pretraining dataset.For the actual training set
, configurations are randomly sampled from the MD17 paracetamol dataset for varying values of m. 10 additional configurations are randomly sampled as a validation set. The rest of the 106
490 configurations are used as a test set.
000 configurations. For both the geometry optimization as well as the MD simulation, a Hamiltonian with self-consistent charges83 and third-order corrections84 was used in correspondence with the 3ob-3-1 Slater Coster files.85 For all atoms, s- and p-orbitals were used in the Hamiltonian.For the actual training set
, configurations are randomly sampled from the first 10
000 configurations of the MD22 stachyose dataset for varying values of m. 10 additional configurations are randomly sampled from the configurations 10
100 to 10
900 as a validation set. Configurations 11
000 up to 27
000 are used as a test set.
8000 configurations were randomly sampled as possible training configurations. For a training dataset of size
, the first n configurations from that subset were used as a training set. Further, twenty randomly sampled configurations were used as a validation set to recalibrate the uncertainties. The rest of the configurations were used as the test set.
GPU nodes containing 8 L40S GPUs were used for the experiments and updating all Monte Carlo samples on a new training data point takes only a few minutes on that hardware. To test the impact of constrained GPU resources, the experiment for the largest system, the 162-atom CaZrS3-proton system, was rerun using only two A100 GPUs. On that hardware, it takes around 20 minutes to update the model. However, it should be noted that DFT calculations for systems of that size will also be quite time-intensive and on the two Intel Platinum 8360Y processors that were used, they took around 24–25 minutes.
DFT calculations involving two 36-core Intel Platinum 8360Y processors were conducted on a single compute node with two CPU sockets (four NUMA domains, 36 logical cores and 64 GB RAM per domain). For calculations with eight 36-core Intel Platinum 8360Y processors, VASP was run on four compute nodes. VASP was run using the full nodes, with MPI ranks and OpenMP threads distributed across CPUs. No explicit NUMA binding or memory placement was applied. Memory allocation followed the system's default NUMA policy. VASP was compiled with Intel compilers (icx, icpx, ifx) and Intel MPI (mpiicx, mpiicpx, mpiifx). Linear algebra routines used the Intel MKL library. The VASP binary used was vasp_std (CPU variant). No custom compilation options beyond the default module build were applied.
Neural network optimization was performed with torch 2.6 and mace-torch 0.3.0 using CUDA 12.4.1 without cuEquivariance acceleration.
| This journal is © The Royal Society of Chemistry 2026 |