Federico Grasselli*ab,
Sanggyu Chongc,
Venkat Kapildef,
Silvia Bonfantigh and
Kevin Rossi
*ij
aDipartimento di Scienze Fisiche, Informatiche e Matematiche, Università degli Studi di Modena e Reggio Emilia, 41125 Modena, Italy. E-mail: federico.grasselli@unimore.it
bCNR NANO S3, 41125 Modena, Italy
cLaboratory of Computational Science and Modeling, Institute of Materials, École Polytechnique Fédérale de Lausanne, 1015 Lausanne, Switzerland
dYusuf Hamied Department of Chemistry, University of Cambridge, Cambridge CB2 1EW, UK
eDepartment of Physics and Astronomy, University College London, London, UK
fThomas Young Centre, London Centre for Nanotechnology, University College London, London WC1E 6BT, UK
gCenter for Complexity and Biosystems, Department of Physics “Aldo Pontremoli”, University of Milan, Via Celoria 16, 20133 Milano, Italy
hNOMATEN Centre of Excellence, National Center for Nuclear Research, ul. A. Sołtana 7, 05-400 Swierk/Otwock, Poland
iDepartment of Materials Science and Engineering, Delft University of Technology, Delft, 2628 CD, The Netherlands. E-mail: k.r.rossi@tudelft.nl
jClimate Safety and Security Centre, Delft University of Technology, TU Delft the Hague Campus, The Hague, 2594 AC, The Netherlands
First published on 9th June 2025
The widespread adoption of machine learning surrogate models has significantly improved the scale and complexity of systems and processes that can be explored accurately and efficiently using atomistic modeling. However, the inherently data-driven nature of machine learning models introduces uncertainties that must be quantified, understood, and effectively managed to ensure reliable predictions and conclusions. Building upon these premises, in this perspective, we first overview state-of-the-art uncertainty estimation methods, from Bayesian frameworks to ensembling techniques, and discuss their application in atomistic modeling. We then examine the interplay between model accuracy, uncertainty, training dataset composition, data acquisition strategies, model transferability, and robustness. In doing so, we synthesize insights from the existing literature and highlight areas of ongoing debate.
In many cases, chemists and materials scientists draw conclusions based on incomplete or uncertain information, as it is often the case when dealing with expensive, time-consuming, oftentimes noisy measurements. Uncertainty quantification (UQ) provides a framework for systematically incorporating uncertainty in this scientific process, thereby enhancing the reliability, robustness, and applicability of experimental and theoretical results. In the context of materials science, chemistry, and condensed matter physics, researchers optimize materials and properties while accounting for uncertainties, variations, and errors in their measurements and theories (e.g., via replication and sensitivity analysis). This improves the reliability and validity in the models of physical phenomena and design of novel materials and processes.
Nowadays, machine learning and artificial intelligence methods are emerging as a key tools for accelerating the design, engineering, characterization, and understanding of materials, molecules, and reactions at interfaces. In the context of atomistic modeling, machine learning facilitates the development of predictive models and interatomic potentials that can simulate materials behavior with high accuracy and reduced computational cost compared to traditional methods. Incorporating UQ into these machine learning models is crucial for assessing the reliability of predictions and understanding the limitations of the models. By quantifying uncertainties, researchers can identify areas where the model's predictions are less certain, guide the selection of new data points for training (active learning), understand how to train robust models, and make informed decisions about the deployment of these models in practical applications.
In this perspective, we examine how the integration of machine learning and UQ enhances the predictive capabilities of atomistic simulations and ensures that inherent uncertainties are systematically accounted for, leading to more robust and reliable materials design and discovery. In this context, we acknowledge the recent contributions to the topic by Dai et al.2 and Kulichenko et al.3 By the same token, we remark that, while we aim for a complete discussion, we intentionally focus on a relatively restricted number of representative works in the literature. A search for the term “Atomistic Modeling (or Modelling)” and “Uncertainty” shows that these appear with an increasing frequency, amounting to more than 5000 literature items (Fig. 1), highlighting the significance of the topic under scrutiny.
The focus of our work is then on analysing recent trends in uncertainty estimation methods—such as Bayesian frameworks and ensemble approaches—and their practical application to assess prediction reliability, and guide efficient data acquisition. When possible, we synthesize and unify insights from the literature. Examples include interpretation of uncertainty and extrapolation measures as Mahalanobis distances and discussion on geometrical and statistical approaches to define in- and out-of-distribution predictions. When consensus is lagging, we highlight key gaps and open research questions. These concern benchmarking of uncertainty quantification methods and models transferability; uncertainty propagation for dynamical observables; uncertainty quantification in data-efficient methods including foundation and multi-fidelity approaches. In conclusion, we aim to provide a unified framework and highlight the open questions toward robust, efficient, and interpretable machine learning approaches for atomistic modeling.
(1) Accurate, by realistically modeling the true uncertainty associated with the machine learning (ML) prediction, and aiming to minimize bias and systematic errors;
(2) Precise enough to provide a sufficiently narrow range of possible values;
(3) Robust, against variations in the data or model assumptions, providing reliable results also when tested out-of-domain;
(4) Traceable and comprehensive, by capturing and identifying all the possible sources of uncertainty, which include the choice of hyperparameters and training set data points, or the stochastic optimization of non-deterministic models.
(5) Computationally efficient, requiring only a negligible overhead, possibly also in training, in obtaining the uncertainty values of interest from the ML model.
In what follows, we adopt operative definitions of uncertainty based on the variance (second moment) of the distribution of predictions (either theoretical or constructed via ensembles) to quantify the spread of uncertain outcomes. This definition indeed displays the properties listed above. The analysis of first and second moments only may not be fully descriptive for non-Gaussian (e.g., skew, heavy-tailed or multi-modal) distributions. Nonetheless, it provides an interpretable and computationally lightweight measure of variability. Furthermore, it aligns well with Gaussian or near-Gaussian models, such as those that are built from the Laplace approximation (see Sec. 2.1.3). Finally, it supports simple calibration strategies that leverage the comparison of the uncertainty estimate with the second moment of the empirical distribution followed by the residuals yi − ỹi between the reference value for input i and its ML prediction.
Finally, towards a clear and unified discussion, we spell out the notation we will adopt for our successive considerations:
x (or xi when labeling is needed) generic input/sample
X matrix collecting inputs in the training set as rows
fi array of features corresponding to xi
F matrix collecting training-set features as rows
w parameters of the model (a.k.a. weights)
ỹi ≡ ỹ(xi) machine-learning prediction for input xi
yi reference value corresponding to input xi
training dataset of input-label pairs (xi, yi), with i = 1, …, Ntrain
σ2i variance on prediction ỹi
α Calibration constant (see Sec. 2.4)
term of the loss function corresponding to a single instance i of the training set
(x) machine-learning prediction, corresponding to input x, for the uncertainty in mean-variance estimates and mean-variance ensembles, see Sec. 2.3.
![]() | (1) |
(1) It can be viewed as an inverse covariance matrix of the properly defined features of the input data points in i.e. G = [cov(F)]−1, where F ∈
Ntrain×Nf collects as rows the transpose of the feature vectors {fi}i=1,…,Ntrain of the training points;
(2) It is therefore strongly dependent on the distribution of input points xi in and on how much the new point ⋆ is “close” to such distribution in this metric space;
(3) It is largely independent of the specific target values yi in †
We report below the specific, model-dependent expression of the features f and therefore of the metric tensor G.
ỹ(x,w) = x⊤w | (2) |
G = (X⊤X + ς2ID)−1 | (3) |
![]() | (4) |
![]() | (5) |
By assuming again that the weights are sampled from a zero-mean Gaussian prior, we have:
![]() | (6) |
![]() | (7) |
Oftentimes the reported GPR uncertainty formula is:
![]() | (8) |
Mercer's theorem ensures that for any kernel there exist a possibly infinite (i.e. Nf → ∞) set of basis functions. Furthermore, the representation in terms of the functions ϕ is also very useful when sparse kernel approximations, such as the Nyström method detailed in Appendix A, are employed.
![]() | (9) |
![]() | (10) |
Ho is routinely approximated by its Gauss–Newton form, which employs only first order derivatives of predictions with respect to the weights and evaluated at MAP, which can be easily retrieved by backpropagation:
![]() | (11) |
Finally, the distribution of the output corresponding to the input ⋆ becomes:12
![]() | (12) |
from which the variance is obtained in the form of a Mahalanobis distance, eqn (1). In the context of atomistic modeling, this approach has been investigated by Zaverkin and Kästner, and leveraged for active-learning strategies (see also Sec. 3.5).13 The same results can be obtained in an alternative but equivalent mathematical construction probing how robust a ML model is, to a change in the prediction of an input ⋆, based on a constrained minimization of the loss.14,15 (see Sec. 3.4 for more details).
Two remarks should be made: first, as it is reasonable to expect, the quality of this approximation depends on how much the posterior distribution of the NN model is close to a multivariate Gaussian; second, the large number of weights, and thus of components in ϕi, in current deep NN typical architectures makes the storage of Ho unfeasible, even in the Gauss–Newton approximation, due to memory requirements quadratic in the size of the ϕ arrays. The context of NN Gaussian processes,16 and in particular of the Neural Tangent Kernel formalism,17–19 provides the ideal theoretical framework to justify the first point and to find a viable strategy to overcome the second one. In Bigi et al.,15 the use of a last-layer (LL) approximation,12,20 was extensively justified, whereby only the derivatives of the predictions with respect to the LL weights wL, i.e. the LL latent features
![]() | (13) |
are considered in building Ho and in evaluating the variance of the prediction for a new input ⋆. For a mean square loss function, the latter becomes:‡
![]() | (14) |
(1) The presented derivation is based on the assumption that no additional nonlinear activation is applied to the product of the LL latent features and LL weights, i.e. that ỹ = f⊤wL. Things get more complicated whenever a nonlinear application function φ is instead applied, ỹ = φ(f⊤wL), even though the correct distribution of the prediction can in principle still be sampled (e.g., by Monte Carlo integration).
(2) The LL latent features fi do not explicitly depend on the LL weights wL, which are the weights reported to change more during training.17 As such, the LL latent features, as well as the covariance matrix F⊤F, are expected to be rather constant during training.18 Furthermore, the different elements of the array of LL latent features are identically distributed at initialization, and centered around zero, for any given sample, because the weights (and in particular wL−1) are taken as independent, identically distributed and centered around zero. This implies that the additional enforcement of feature centering should not change the result of the uncertainty estimate.
(3) Numerical experiments indicate that, analog to linear regressions, while the calibration of α2 is crucial, the regularizer scarcely affects even when the regularizer strength ς2 is varied over several orders of magnitude, unless data are scarce or highly collinear. The role of the regularizer is further discussed in Appendix E of ref. 15.
Wenger et al. report on limitations of the NTK perspective. In particular, to leverage the advantages of the NTK formulation, one would need “architectures that are orders of magnitude wider than they are deep” (verbatim from ref. Wenger et al.).21 This is in fact the case for several current architectures of ML models in atomistic learning.§ Note, the NTK theory is valid for any Ntrain in the infinite-width limit, but, for finite-width networks, the width must grow sufficiently fast (polynomially) with the number of samples to maintain the NTK approximation valid.
Beyond last-layer approximations, it has been shown in the context of atomistic machine learning that full-gradient representations, combined with random projections to reduce memory cost, can offer improved performance.24 For a broader perspective on gradient features and their connection to the Neural Tangent Kernel, we refer the interested reader also to Holzmüller et al.25
![]() | (15) |
from which one can quantify the uncertainty as the second moment of the distribution. When Laplace approximation is invoked, one can obtain simple expressions like eqn (12). Whenever this is not possible, an explicit sampling of the posterior
![]() | (16) |
![]() | (17) |
![]() | (18) |
The ensemble members y(m)(x) are routinely obtained in different ways:
(1) By subsampling the entire dataset and then training one model for each of the subsampled datasets The size of
depends on the subsampling technique, but usually amounts to
(2) For models that are not trained via a deterministic approach, stochasticity in the model architecture and training details (e.g., varying the random seed, Monte Carlo dropout31) can be exploited to obtain the ensemble.
![]() | (19) |
Busk et al.33 interpret 2(x), predicted by MV model as an additional model output, as an aleatoric uncertainty that may stem from random noise, data inconsistencies, or the model's inability to fit precisely, and, as such, cannot typically be reduced by collecting more data. Incidentally, it should be remarked that, in the case of a dataset that assumes noiseless observations, all the deviations of the model's prediction from the observations must be considered as model's bias,34 since considering the observations to be the ground truth implies that a perfect model should really pass through those points.
Ensembles models and MV estimation models can be combined to give rise to mean-variance ensembles, introduced by Lakshminarayanan et al.35 with the name of deep ensembles; see Fig. 3(a). In this approach, a committee of MV estimation models is created, where each member of the committee outputs a prediction and variance pair, (ỹ(m)(x),2,(m)(x)), and then the uncertainty is estimated by summing the variance of the predictions, as in eqn (18), with the average of the variances of the committee:
![]() | (20) |
In this latter case, akin to LL approximation motivated in 2.1.3, one could construct “shallow ensembles” (Fig. 3(b)), where only the last-layer of the neural network is varied in obtaining an ensemble of models, and rest of the weights are shared across the members. Such an approach effectively mitigates the large computational cost associated with the training of multiple neural network models and their inference. Kellner and Ceriotti present a version of this approach, where a shallow ensemble of models is trained to using an NLL-like loss obtain the empirical mean and variance through eqn (17) and (18).37
![]() | (21) |
![]() | (22) |
![]() | (23) |
![]() | (24) |
![]() | (25) |
![]() | (26) |
It is crucial to remark that eqn (26) is a biased estimator in the number of models composing the ensemble, M. Whenever M is small, eqn (26) should be replaced by the bias-corrected formula
![]() | (27) |
![]() | (28) |
If the estimated variance σ2(xi) follows a functional form such as eqn (1), the free parameter α2 must be adjusted, for example, following the procedure of eqn (26)—to align the expected variance with the observed MSE. Notably, in log–log space, varying α2 results in a rigid shift of the entire plot. Thus, even for uncalibrated UQ estimates, a linear correlation between the expected and observed distributions of (squared) residuals should still be apparent. If there is a poor linear correlation, this suggests that the Gaussian-like UQ framework defined by eqn (1) may be inadequate, as can occur in NN models with only few neurons per layer, and/or that a local calibration α2(x) may be necessary.
Along these lines it is worth mentioning the insightful work by Pernot,43,44 which involves stratifying the evaluation of a given calibration score, such as eqn (26), based on the predicted uncertainty: the dataset is split into subsets where the predicted uncertainty falls within certain ranges (e.g., low, medium, or high uncertainty predictions); calibration metrics are then calculated independently for each subset. This allows for a more detailed analysis of how well the model is calibrated across different levels of predicted uncertainty. By performing different types of stratification/binning of the dataset, Pernot shows that it is also possible to distinguish between consistency (the conditional calibration with respect to prediction uncertainty) and adaptivity (the conditional calibration with respect to input features), and that good consistency and good adaptivity are rather distinct and complementary calibration targets.45
Nonetheless, methods based on the miscalibration area can be challenging to interpret. In fact, they provide an indirect assessment of UQ quality, since, by comparing CDFs, they inherently compare higher moments of the distributions. For example, two distributions may share similar second moments–quantities typically interpreted as uncertainty—but diverge significantly in their CDFs due to deviations from Gaussianity, such as skewness or heavy tails. This conflation of uncertainty with other aspects of distribution's shape highlights a key limitation of such approaches.
For this reason, we recommend prioritizing the calibration strategies discussed in Sec. 2.4.1 and 2.4.2, which more clearly align with the intended interpretation of uncertainty.
• Train a regression model to give predictions ỹi
• Use a separate calibration dataset
• Compute the nonconformity score (typically the absolute error, si = |yi − ỹi|)
• Determine the (1 − α)-quantile q1−α of the sorted scores. For instance, if α = 0.05, then 95% of the scores shall lie below the value q0.95
• For a new input x⋆ construct the prediction interval as .
The prediction interval contains the true y with at least 1 − α confidence. The key assumption in CP is that the new (test set's) error distribution is representative of both the training and calibration sets error distribution. This allows 1 − α, derived from the calibration data , to apply universally across all inputs x. As a result, q1−α acts as a global threshold for constructing prediction intervals, a measure of the model's “typical error” that encapsulates how large the prediction errors tend to be (up to the (1 − α)-quantile), independent of any specific input x. Nonetheless, if the new set of inputs is significantly out of distribution, meaning it differs substantially from the training and calibration data, the assumptions underlying CP may no longer hold, and the prediction intervals obtained from CP might lose their validity.
In the context of atomistic modeling of materials, CPs have been recently used, e.g., in Hu et al.49 and Zaverkin et al.,50 for the UQ of ML interatomic potentials predictions.
Tran et al. compared the accuracy and uncertainty of multiple machine learning approaches for predicting the adsorption energy of small molecules on metals.51 The most effective approach combines a convolutional neural network (CNN) for feature extraction with a Gaussian process regressor (GPR) for making predictions. This hybrid model not only provided accurate adsorption energy estimates but also delivers reliable uncertainty quantification.
Tan et al.52 evaluated ensembling-based uncertainty quantification methods against single-model strategies, including mean-variance estimation, deep evidential regression, and Gaussian mixture models. Results, using datasets spanning in-domain interpolation (rMD17) to out-of-domain extrapolation (bulk silica glass), showed that no single method consistently outperforms the others. Ensembles excel at generalization and robustness, while MVE performs well in-domain, and GMM is better suited for out-of-domain tasks. The authors concluded that, overall, single-model approaches remain less reliable than ensembles for UQ in Neural Network interatomic potentials (NNIPs).
Kahle and Zipoli53 reported that NN potentials ensembles may result overconfident, underestimating the uncertainty of the model. Further, they require to be calibrated for each system and architecture. This was verified across predictions for energy and forces in an atomic dimer, an aluminum surface slab, bulk liquid water, and a benzene molecule in vacuum. Bayesian NN potentials, obtained by sampling the posterior distribution of the model parameters using Monte Carlo techniques, were proposed as an alternative solution towards better uncertainty estimates.
Further, the integration of UQ methods with existing machine learning architecture is often streamlined for one specific approach only54,55 and it is rarely the case that one single workflow allows for the adoption of a diverse set of ML UQ methods.
Kellner and Ceriotti41 have investigated the size extensivity of uncertainty estimates for bulk water systems using their shallow ensembling method for a deep NN model. In their analysis, they decomposed the uncertainties on differently sized bulk water systems into “bias” and “residual” terms. The bias term, computed by taking the absolute difference between the mean predicted and reference energies for a given system size, was found to scale roughly with N. The remaining residual term, which would largely capture the random distortions of the water molecules, was then found to correlated with Given the significant contributions from both terms, their experiments showcase the non-trivial trends in the size extensivity of ML model uncertainties for real material systems, which exposes the limitations of approaches where extensivity is ignored or a naive scaling law is assumed. Size extensivity of uncertainty estimates presented, in the context of gradient features—has also been discussed in Zaverkin et al., 2024.50
![]() | (29) |
![]() | (30) |
Explicit sampling can also be useful to account for the correlated nature of the errors made by ML models. Excessively conservative UQ estimates are made, e.g., when computing differences of observables–such as the relative energy E(x1) − E(x2) for nearby configurations x1 and x2 – if independent errors are assumed across configurations. In reality, ML models often produce highly correlated predictions in such scenarios, meaning that while absolute uncertainties may be significant, the uncertainty on differences (which are often more physically relevant) can be much smaller. Sampling ΔE(m) ≡ E(m)(x1) − E(m)(x2) and then estimating the mean value and the uncertainty as
is a viable option offered by explicit sampling, which naturally incorporates correlations between ML estimates that are function of multiple configurations.
In ML-driven atomistic simulations, UQ is also needed to single out the uncertainty ascribable to ML models from the statistical one due to a poor sampling (i.e. too short trajectories): in fact it would be pointless to run very long MD simulations if the uncertainty due to ML could not be lowered below a given threshold.
A more subtle problem arises in the realm of machine learning interatomic potentials (MLIPs), whenever one aims at propagating uncertainty to thermostatic observables (e.g., the radial distribution function, the mean energy of a system, etc.) where ML uncertainty on the energy of a given structure enters the Boltzmann weight of thermodynamic averages, or – equivalently under the ergodic hypothesis – affects the sampling of the phase space via molecular dynamics simulations. In Imbalzano et al., the availability of model-dependent predictions was leveraged to propagate uncertainty to thermostatic observables while running a single trajectory with the mean MLIP of the ensemble, by applying simple reweighing strategies.11
For instance, consider the case of training a committee of M ML interatomic potentials to learn the ML potential energy surface V(r), where r indicates the set of positions of all the atoms of a system. The potential energy of the i-th model is labeled by V(i)(r), and the mean potential energy of the committee by (r), as in eqn (17). Then, for a given observable a(r) of the atomic positions, its canonical average, computed by sampling the configurational space according to the Boltzmann factor exp[−βV(i)(r)], where β = (kBT)−1, can be equivalently expressed in terms of a canonical average using the Boltzmann factor exp[−β
(r)] associated to the mean potential of the committee as:
![]() | (31) |
Looking ahead, fundamental questions remain. e.g., from a physical perspective: what are the key ingredients towards the definition of a rigorous theory for uncertainty propagation for time-dependent thermodynamic observables, such as correlation functions? Such a question is relevant for spectra and transport coefficients, obtained from ML-driven molecular dynamics simulations,57–66 since its answer would make it possible to quantify the uncertainty on these dynamical observables in an efficient way, bypassing time-consuming, brute-force approaches that require running several trajectories.
Misspecification is a form of systematic model error, distinct from overfitting or random noise. In the context of atomistic simulations, modern ML architectures based on NN have in general enough capacity to fit any function to training data without overfitting; nonetheless, misspecification may still arise due to omitted physics, inadequate data coverage, or improper model assumptions. While it is universally true that “all models are wrong”,68 this limitation becomes especially critical in the context of misspecification, which can lead to unreliable or misleading UQ. For instance, if the model cannot represent long-range interactions that are present in the underlying physics—as in the binding curves of molecular dimers69—all members of an ensemble may still agree on an incorrect prediction, such as vanishing forces beyond the model's short-range cutoff. This leads to artificially low uncertainty estimates. Such issues cannot be resolved even with calibration strategies like eqn (26), since no global calibration can simultaneously account for both the regions where the model performs well (short distances) and those where it systematically fails (long distances). Another case where misspecification occurs concerns wrong functional forms as, e.g., in ML interatomic potentials trying to model a sharp repulsive wall with a smooth function.
In the context of atomistic modeling, robustness is important for novel materials discovery, where models are often used to predict the properties of new phases or materials that lie outside the existing dataset. The concept of robustness can also be straightforwardly extended to the predictions of “local” (e.g., atom-centered) or “component-wise” (e.g., range-separated) quantities of the chemical systems, which do not correspond to physically observable targets. This is especially relevant for ML models constructed to make global predictions on the system as the sum of local and component-wise predictions on distinguishable parts and their associated features. This is indeed a common practice in ML for atomistic modeling.
This construction provides a method to distinguish in-distribution samples, which lie within from out-of-distribution samples, which fall outside the convex hull. Extrapolation, in this context, refers to the model's attempt to make predictions for such out-of-distribution points by extending patterns learned from the training data, often resulting in increased uncertainty and reduced accuracy (Fig. 4 left panels).
The convex hull evaluation faces significant challenges in high-dimensional spaces. The computational cost of constructing and evaluating convex hulls increases with dimensionality, making this approach impractical for large-scale, high-dimensional machine learning tasks. Even more importantly, the number of points required to approximate the convex hull grows exponentially with the dimensionality of the feature space, a problem commonly referred to as the “curse of dimensionality.” Consequently, the convex hull becomes increasingly sparse in high dimensions, causing most points in the space to be classified as out-of-distribution.70 This observation also holds for the case of atomistic machine learning models based on local atomic environment representation.71
Importantly, while the intrinsic dimensionality of high-dimensional representation may be still low, low-dimensional projections (e.g., D = 2 or D = 3) for visualization or analysis, can introduce artifacts that misrepresent the true relationships in the data, such as incorrectly classifying in-distribution samples as out-of-distribution due to oversimplified boundaries. This is also relevant in machine learning for atomistic modeling, where the information high-dimensional representation can be reduced to a small but not too small amount of principal components.71
In an adaptive k-nearest-neighbor (k-NN) density estimation procedure proposed by Zeni et al.,71 each test point x* is temporarily inserted into the training set so that its k* nearest neighbors among the training samples can be identified. This process makes it possible to compute the local density at x* via:
![]() | (32) |
Through this methodology, the resulting metric reflects the degree to which an unseen atomic environment lies in a well-sampled portion of the representation space. Importantly, it also correlates with the errors observed in machine-learned regression potentials. Furthermore, the same authors report a strong consistency between this density-based measure, model-specific error estimator – namely the predictive uncertainty from a committee of models trained on different subsamples of a larger training set – and model-free estimators – namely the Hausdorff distance between the prediction point and the training set. This result is also consistent with reports contrasting other model-specific and model-free approaches based on distances (e.g. latent space distances74).
In related work, Schultz et al.75 utilized kernel density estimation in feature space to evaluate whether new test data points fall within the same domain as the training samples. Their approach illustrates that chemical groups traditionally considered unrelated exhibit pronounced divergence according to this similarity metric. Moreover, they show that higher dissimilarity correlates with inferior predictive performance (manifested as larger residuals) and less reliable uncertainty estimates. They additionally propose automated methods to define thresholds for acceptable dissimilarity, enabling practitioners to distinguish between in-domain predictions and those lying outside the model's scope of applicability.
We similarly consider the work of Zhu et al.76 in the context of statistical methods to define extrapolation and interpolation and in- and out-of distribution. Given a specific training set, a feature vector for each point is derived from the latent space representation of a NequIP77 model. Next, a Gaussian mixture model (GMM) is fitted on this distribution. A negative log-likelihood can be then obtained by evaluating the fitted GMM on the feature vector associated to any test point. Higher negative log-likelihood were observed for points resulting in higher predictions uncertainty.
To conclude, we note that, while statistical estimates of in- and out-of distribution are of interest because of their efficiency and effectiveness, questions remains. The magnitude of these metrics depends on the chosen representation,71 and its precise correlation with the mean absolute error (MAE) is contingent upon both the system and the model employed.
• Phase Transferability refers to the ability of an ML interatomic potential trained on certain phases of a material to accurately predict properties of other ones (e.g. different polymorphs or phases), assuming both are sampled at similar temperatures.78–80
• Temperature transferability concerns the accuracy of the ML potential when, e.g., trained on structure sampled at high temperatures and tested on structures at lower temperatures, or viceversa.79,81–83
• Compositional transferability refers to the ability of a ML model when providing predictions for systems with unseen compositions with accuracy comparable to known stoichiometries. This can refer both to predictions for unseen stoichiometries, or for unseen elements (alchemical learning).84–86
The standard according to which a model is deemed transferable across different conditions remains rather flexible too. Criteria used to relate transferability and model accuracy so far have included:
• The error incurred by the model in the test domain is comparable with the one observed for the training domain.
• The error in the test domain is sizably larger from the one in the training domain, but remains acceptable for practical purposes (e.g., energy errors below 10 meV per atom, force errors below 100 meV Å−1).
• Simulations remain stable over long timescales, showing no significant energy drift or sampling of unphysical configurations.
The lack of a rigorous and standardized definition of transferability challenges our attempt to unify conclusions drawn from studies concerned in assessing ML model transferability in the context of atomistic modeling. Heuristic observations on transferability report that:
• Phase transferability: There is often a trade-off between accuracy and generality when applying ML potentials across different phases. This trade-off is generally acceptable for many practical applications.39,81
• Temperature transferability: Models trained on high-temperature data tend to generalize well to lower-temperature conditions.79,82
• Compositional transferability: Interpolation within the compositional space is generally feasible, but extrapolation to entirely new stoichiometries or elements (e.g., alchemical learning) poses significant challenges, unless tailored schemes are adopted.84,85
The relationship between a model's complexity and its transferability has also been a subject of discussion. In principle, more flexible models are more susceptible to overfitting, which can reduce transferability. Empirically, this tendency has been observed in ML methods based on feed-forward neural networks.78 Importantly, modern high-order graph convolutions and/or physics-inspired (e.g., symmetry conserving) architectures have not exhibited this limitation, suggesting that increased complexity does not necessarily compromise transferability, at least within the data- and parameter-sizes considered in those applications.87
To address this problem, recently, Chong and coworkers have introduced the concept of “prediction rigidities” (PRs),14,15,88 which are metrics that quantify the robustness of ML model predictions. Derivation fo the PRs begin from considering the response of ML models to perturbations in their predictions, via their loss, by adopting the Lagrangian formalism. A modified loss function can be defined as shown below:
![]() | (33) |
![]() | (34) |
![]() | (35) |
One can recognize the crucial connection between the Ho appearing in this expression and the Ho defined in eqn (9), as well as the original eqn (1) defined for the Mahalanobis distance. Note that Ho is often approximated as the sum of the outer products of structural features over the training set, which would be the sum or mean of atomic features of the given structure. This is an indirect approach to consider the “groupings” of local environments that are present as structures in the training set.
A few important remarks should be made here:
• Absence of any calibration parameters in the PRs hint that these are purely dataset and representation-dependent parameters, and hence distinct from being a UQ metric;
• Dependence on the dataset and model training details is determined by Ho, which can adopt a Gauss–Newton approximation scheme and be computed in a similar manner as eqn (11), and remains constant for a given model;
• There is freedom in how ⋆ is defined—it is possible to compute the PRs for any data point as long as it is definable within the input parameter space, furthermore, one can also target specific local predictions or component-wise predictions of the model, resulting in local PR (LPR) or component-wise (CPR) that quantitatively assess the robustness of intermediate model predictions that do not have corresponding physical observables.
The robustness metrics introduced thus far are solely dependent on the dataset distribution (i.e. structural diversity of material systems and their local environments) and remain detached from the distribution of the target metric. One must be mindful of the repercussions, which is that complexity of the target quantity landscape is ignored: if the target landscape is smooth, learning may require fewer data points to achieve the target accuracy; if it is rough, more data points would be needed to resolve the complex landscape and achieve desirable accuracy.
The quality of data and representation may be similarly relevant. For example, as discussed by Aldeghi et al.89 and van Tilborg et al.,90 “activity cliffs”—instances where structurally similar pairs of molecules that exhibit large differences in the targets—negatively affect the model performance. By learning a representation, e.g., through contrastive learning, that correctly separates such structures the learning problem is simplified. Also, a modified Shapley analysis was also proposed for analysing and interpreting the impact of a datapoint in the training set on model prediction outcomes.91,92
(1) Generation of new targets is relatively expensive and/or time consuming. In such a case one may like to know in advance for which new data point inputs x compute the target y (i.e. assign a label);
(2) There is a vast pool of data and one aims to select a subset of data points. This is critical for, e.g., the construction of the representative set of sparse kernel models, although it is common in modern deep learning training strategies to use all the data at disposal.
A first example of workflows iteratively improving the accuracy of an atomistic model was the “learn on the fly” hybrid scheme proposed by Csányi et al.93 Here, fitted potentials (based upon an analytical formulation93 or machine learning94) are refined using a predictor-corrector algorithm and quantum calculations to ensure reliable simulations of complex molecular dynamics.
Since the units of the uncertainty naturally allow for the definition of interpretable thresholds and tolerance criteria, uncertainty can be naturally adopted as the metric to identify configurations where model predictions are too uncertain, for which additional information is necessary to steer the model towards more robust predictions.
Numerous active learning schemes (Fig. 6 for interatomic potentials based upon uncertainty thresholds have been proposed in the last years, encompassing a variety of materials and chemistry, from heterogeneous catalysis51,95 to energy materials,96 from reactions in solutions97,98 to molecular crystals.99 More recently, biasing the sampling towards configurations corresponding to highly uncertain prediction was brought forward as a strategy to ensure the collection of varied training set, and the training of a model presenting an uncertainty always below a specific threshold across a (large) region of interest in the configurational space.50,100–103
In the next subsections, we show that several “active learning” approaches to sequentially select new, optimal data points can be framed in the context of the maximum gain of information, as first discussed by MacKay in the context of the Bayesian interpretation of learning.9 We also show that apparently model-free approaches do effectively identify new points where uncertainties would be the largest.
![]() | (36) |
![]() | (37) |
![]() | (38) |
Let us now take a new point of features f⋆ and add it to the dataset. The new feature matrix Fnew is obtained by concatenating the row vector to F. The new Hessian becomes
![]() | (39) |
![]() | (40) |
One can use the matrix determinant lemma104 to express:
![]() | (41) |
The total information gain from adding ⋆ to the dataset, ΔI, is
![]() | (42) |
The MIG criterion also motivates, from an information theory perspective, the addition of structures characterized by environments with lowest local prediction rigidity as active learning criterion, as in Chong et al.88 We remark that:
• The MIG criterion is independent of the specific target, which need not be computed in advance to perform the active data selection.
• If we consider the initial dataset as fixed, then maximizing the information gain implies looking for ⋆ to satisfy:
![]() | (43) |
• In this approach, the noise on data is taken the same for all the data (in fact, a single calibration constant α2 was used in Sec. 2). Generalization to the case of sample-dependent noise is nonetheless straightforward.
While the selection of a single datum based on maximum expected information gain is a well-established approach in active learning, practical applications often require evaluating how much information a single new datum contributes with respect to a set of target points (e.g., a test set or region of interest). MacKay—see Section 4.1 of ref. 9—can be credited for generalizing data-point selection by considering information gain across a set of input points, representing a region of interest. However, a naïve use of the joint information gain of the interpolated values can lead to suboptimal results, as it may favor inducing correlations among outputs rather than minimizing their individual uncertainties. A more appropriate strategy that MacKay suggests is to maximize the mean marginal information gain (MMIG) across these points (which we label by u = 1, …, V), independently, i.e., in formulae–for noiseless observations:
![]() | (44) |
![]() | (45) |
![]() | (46) |
This is the same criterion found the previous section from the theory of maximal information gain, i.e. the quest for the sample ⋆ with largest variance, eqn (1). Nevertheless, this time, the adopted metric is = (
⊤
)−1, obtained with the D-optimally subsampled dataset. Notice that
![]() | (47) |
![]() | (48) |
V(A) = Erepulsive(A) − KSKP(A) | (49) |
Nonetheless, SKP diverges to −∞ whenever the features associated to environments i and j in eqn (48) coincide. Schwalbe-Koda et al. solve this issue by defining a dataset entropy of a set of Nenv environments:
![]() | (50) |
![]() | (51) |
We highlight that even when the dataset is built with targets that are structural properties, rather than local atomic properties, no “grouping” of local environments into structures is taken into account in these data-entropy based schemes, as it is instead done in the construction of the metric tensor in Mahalanobis distance (see eqn (34)). It is further relevant to note that any kernel can be written (Mercer's theorem) as
![]() | (52) |
![]() | (53) |
![]() | (54) |
![]() | (55) |
Unfortunately, in contrast to the forms of active learning of the previous subsections, based on eqn (1), here the complexity of the dataset is effectively “averaged out” by considering in the latent feature space. Notice that if the features in the latent feature space are centered (see Appendix C for a discussion on whether latent feature centering is legitimate or not),
vanishes, and one incurs into the additional problem that the argument of min⋆ vanishes for any x⋆.
A different perspective on assessing the proximity of two distribution (e.g., training points features and test point features) was proposed by Zeni et al.81 Their study considered 2-body machine learning potentials and the Kullback–Leibler divergence between distributions of interatomic distances in the training and test sets to rationalize prediction errors in machine learning potentials. The Kullback–Leibler divergence is an asymmetry statistical measure to quantify the information loss when a probability distribution q(ξ) associated to a dataset over some sample space Ξ is used to approximate another probability distribution p(ξ) associated to a dataset
on the same sample space:
![]() | (56) |
A positive correlation between KL divergence and the mean absolute error of 2-body kernels was observed, highlighting the importance of including training data that captures interatomic distances relevant to the test set. The KL divergence was thus proposed as a measure to assess how well structural features in the training dataset align with those in the test dataset and interpret model errors in a case study concerning machine learning potentials for Ni nanoclusters.81 Extending the assessment to 3-body machine learning potentials and the KL divergence of bond-angle distribution functions also resulted in the observation of positive correlation between the two quantities. We notice an hysteretical behavior exists in this metric, whereby the net cross-entropy change in first adding a point to the dataset and then removing it is nonzero (see Appendix B).
![]() | ||
Fig. 7 Illustrative example of Bayesian optimization to sampling the minimum of a one-dimensional energy landscapes. |
While similar to active learning in its iterative approach and reliance on uncertainty to guide decisions, Bayesian optimization differs in its goal. Active learning focuses on generally improving a model's predictions. Bayesian optimization aims to optimize an objective function directly. Instead of stochastically sampling or evaluating all possibilities, Bayesian optimization identifies the next point to sample by balancing two goals: exploring unknown regions (where the function behavior is uncertain) and exploiting promising areas (where the function is predicted to perform well). Once the function is evaluated at the chosen point, the new information is used to update the model, and the process repeats until an optimal solution is found.121
Building upon these premises, Bayesian optimization acquisition deciding where to evaluate the objective function next and uncertainty estimates are at the heart of this process: e.g., improvement122 uses uncertainty to identify areas where the potential for improvement is highest. Probability of Improvement121 factors in uncertainty to assess the likelihood of finding better outcome. The Upper Confidence Bound123 takes a more explicit approach, blending the model's predictions with a weighted measure of uncertainty.
Bayesian optimization has emerged as a powerful tool in atomistic modeling, offering efficient strategies for navigating complex energy landscapes and exploring vast configuration spaces. By leveraging probabilistic models, it enables the optimization of potential energy surfaces for intricate atomistic systems, guiding the search toward minima or other critical points with minimal computational cost. Challenges in Bayesian optimization may arise if it is performed in a too highly dimensional space or if the property landscape is rough (that is to say, properties change rapidly with respect to a small change in the feature space, akin to the discussion presented in Sec 3.4).
Successful applications of Bayesian optimization (BO) in atomistic modeling have been showcased for molecular,124,125 crystalline126–129 and disordered systems,130 as well as complex interfaces.131,132 Beyond identifying stable configurations, Bayesian optimization facilitates the exploration of structures to achieve specific target properties or locate configurations along the Pareto front in multi-objective predictions. This capability makes it highly valuable for material where balancing multiple competing properties is often required. Applications include (but are not limited to) metallic glasses mechanical properties,133 multi-principal134 or high-entropy alloys135,136 and their catalytic properties,137–139 or electrolytes properties for energy storage applications.140
The benchmark of foundation models so far took place against established metrics, informing on errors in thermodynamic stability.144 Community benchmarks and validation practices further too rarely account for uncertainties or their propagation. An attempt to introduce performance assessment against an observable (thermal conductivity) drawn from molecular dynamics has been recently introduced.145 Similarly, UQ has been introduced in foundation models,142 adopting the last-layer approximation described in eqn (14), with almost no additional computational load with respect to inference of raw prediction. We consider these key steps towards the definition of probing and informative benchmarks and validation practices.
Fine-tuning and transfer learning build upon the knowledge encoded in pre-trained models, such as universal and foundation models, adapting them to specific tasks or systems through targeted retraining with smaller datasets. This approach enhances data efficiency, and often robustness, as the base model serves as a strong starting point. Their achievements span across diverse areas of machine learning, also including atomistic modeling.146–150 Nevertheless, challenges arise when the pre-trained model's domain significantly differs from the target domain. Question thus arise in reference to the effect of different strategies on the reliability and robustness of the uncertainty estimators.
Multi-fidelity learning152 integrates information from datasets of varying accuracy and cost, effectively linking low-fidelity data to high-fidelity outputs. By synthesizing information from multiple sources, this approach enhances robustness while reducing the dependence on high-cost data. However, inconsistencies between fidelities and the challenge of accurately propagating uncertainties associated from models considering multiple fidelity levels demand careful consideration.
Open questions remain on how uncertainty estimate is affected by the use of these data-efficient models. These include – but are not limited to – a reflection on whether the simultaneous learning of multiple level of theory advantageous in terms of both data efficiency and robustness, also in relation to their effect on prediction uncertainty.
• Models trained to predict or guide synthesis strategies may be significantly affected by noise and bias in experimental data, making robust uncertainty estimates essential for actionable predictions.153
• Machine learning tools developed to accelerate materials characterization must account for ambiguities in signal assignment and model interpretability.154
• Surrogate models used to establish structure–property relationships or optimize structures towards target properties are often based on regression over high-dimensional descriptors. These models thus benefit from UQ strategies already employed in atomistic modeling, such as ensemble methods or Bayesian approximations.155
• Multiscale modeling frameworks require principled approaches for propagating uncertainty across scales—from atomistic to continuum—where even well-calibrated models at a lower scale may induce unpredictable errors at a higher one.156
Taken together, our work underscores the role of rigorous UQ frameworks for guiding data-driven modeling and the value of thoughtful dataset construction in enhancing the transparency and robustness of ML-based atomistic modeling. As new techniques—especially those geared toward data-efficient learning—continue to mature, careful validation and thorough uncertainty assessments become even more critical to maintain trust in model predictions. We hope this perspective stimulates further development and integration of UQ protocols into atomistic modeling efforts.
Finally, we emphasize that the challenges and strategies for managing uncertainty in atomistic modeling echo a broader scientific discourse extending beyond this specific domain. Across materials science, physics, and chemistry, there is a renewed drive to establish clear standards for assessing information and uncertainty, from the reproducibility of synthetic protocols—whether in organic or materials synthesis—to the quality of data gleaned from real- and inverse-space characterization methods. The same principles underpin efforts to gauge the reliability of outputs from generative AI for materials discovery, large language models, automated image and spectrum analyses, and multi-modal approaches alike. Similar to Tycho Brahe's endeavor, these collective efforts will contribute to robust, transparent, and reproducible scientific findings.
![]() | (57) |
In this formula, k(x, Xs) is the vector of the kernels between the input point x and each of the points in the sparse set, that are collected in the matrix Xs ∈ M×D, while Us ∈
M×M is the matrix of the eigenvectors of the sparse set kernel matrix, Ks ≡ K(Xs, Xs), that has as entries the kernel between two points of the sparse set:
Ks = UsΛsU⊤s. | (58) |
The diagonal matrix Λ collects the eigenvalues, ordered to correspond to Us. The kernel matrix of the training set is then approximated as (Nyström formula):
![]() | (59) |
![]() | ||
Fig. 8 Effect of application of eqn (1), where f⋆ and F are centered or not with respect to the center of the dataset's features distribution, for a toy model with two features. The (unitless) Mahalanobis distance is obtained by expressing the uncertainty from eqn (1) in units of calibration parameter α. |
![]() | (60) |
Then, starting from the dataset remove a data point to return to
The KL divergence in this case is:
![]() | (61) |
In general, . This results in a form of information hysteresis in the cycle
, if we associate the KL divergence with the concept of information gain or loss. Notation was kept loose on purpose: Information hysteresis would exist irrespective of whether p and q represent (posterior) probability distribution of the weights,9 as in Sec. 3.5.1, or the probability distribution of dataset features, as in Sec. 3.5.3.
Centering the kernel—either directly or by centering the pseudofeatures as in the Nyström approximation—adjusts the distribution of data representation in latent space, thereby affecting the variance of predictions, which may reflect both the global mean effect and deviations from this mean. Centering also isolates variability purely due to deviations, aligning the variance estimate more closely with the concept used in linear models, where centering is standard.
However, pseudofeatures centering has a direct impact on variance estimates, especially for bias-less models. For instance, consider a NN model where the output is forced to vanish for vanishing features no matter the values assumed by the last-layer weights: the uncertainty on the prediction, σ⋆(f⋆ = 0), is always zero by construction—see eqn (1). In general, the variance for predictions near (far from) the origin of the latent space will be small (large). Nonetheless, this is a characteristic of the model: one could in fact argue that re-centering may introduce spurious a posteriori effects that clash with how the model ultimately represents (or learns) data in latent space. By centering the input features one effectively removes the global mean effect, ensuring that the predictions reflect deviations based solely on the relative relationships between data points. Yet, it is not evident that performing centering on pseudofeatures (that are the way the kernel represents data, or are learned by the model in NN architectures) should be encouraged.
Footnotes |
† The only dependence on the specific target quantity and values is through the value of the regularizer that is included to make the inversion of the covariance matrix numerically stable. |
‡ To avoid notation overburden, we used the same symbol α2 in both eqn (12) and (14) for the calibration parameters, although there is no reason for them to be equal. |
§ For instance, the default fitting NN architecture of DeePMD is made by 3 layers of 240 neurons each.22 Even in the earliest Behler–Parrinello architectures the number of nodes (40) was much larger than the number of layers (2).23 |
¶ Karabin and Perez114 report that “Extensions to global entropy-maximization over the whole training set (in contrast to the local configuration-by-configuration optimization presented here) are in development and will be reported in an upcoming publication.” Montes de Oca Zapiain et al.116 still adopts the local approach. |
This journal is © The Royal Society of Chemistry 2025 |