Open Access Article
Ilgar
Baghishov
*ac,
Jan
Janssen
ad,
Graeme
Henkelman
bc and
Danny
Perez
*a
aTheoretical Division T-1, Los Alamos National Laboratory, USA. E-mail: baghishov@utexas.edu; danny_perez@lanl.gov
bOden Institute for Computational Engineering & Sciences, University of Texas at Austin, USA
cDepartment of Chemistry, University of Texas at Austin, USA
dMax Planck Institute for Sustainable Materials, Germany
First published on 18th November 2025
Machine-learned interatomic potentials (MLIPs) are revolutionizing computational materials science and chemistry by offering an efficient alternative to ab initio molecular dynamics (MD) simulations. However, fitting high-quality MLIPs remains a challenging, time-consuming, and computationally intensive task where numerous trade-offs have to be considered, e.g., How much and what kind of atomic configurations should be included in the training set? Which level of ab initio convergence should be used to generate the training set? Which loss function should be used for fitting the MLIP? Which machine learning architecture should be used to train the MLIP? The answers to these questions significantly impact both the computational cost of MLIP training and the accuracy and computational cost of subsequent MLIP MD simulations. In this study, we use a configurationally diverse beryllium dataset and quadratic spectral neighbor analysis potential. We demonstrate that joint optimization of energy versus force weights, training set selection strategies, and convergence settings of the ab initio reference simulations, as well as model complexity can lead to a significant reduction in the overall computational cost associated with training and evaluating MLIPs. This opens the door to computationally efficient generation of high-quality MLIPs for a range of applications which demand different accuracy versus training and evaluation cost trade-offs.
The most recent developments in the field have prioritized improving the accuracy of MLIPs by incorporating complex atomistic descriptors and sophisticated machine learning models. While such models can now achieve remarkable accuracy, their training requires substantial amounts of high-fidelity ab initio training data and their evaluation can be thousands of times more expensive than traditional force fields, leading to significant computational costs at both training and evaluation times.6–14 In contrast, other efforts prioritize applications such as high-throughput materials discovery, simulations of large atomic systems, or long timescale simulations, where minimizing the model's training and evaluation costs is paramount, even at the expense of a decrease in accuracy.15–18 A prominent example of this philosophy is the development of ephemeral data derived potentials (EDDPs), which are lightweight potentials rapidly generated from moderately converged data specifically for demanding tasks such as high-throughput structure searching.19–21 Finally, the development of “foundation” or “universal” models—highly complex MLIPs, often graph neural networks trained across vast chemical spaces22–24—raises questions about the continued need for optimizing application-specific potentials. However, these “universal” models often require fine-tuning for specific material systems to achieve high accuracy.25–27 Crucially, fine-tuning preserves the high computational cost associated with the complex architectures of these “universal” models, which can be orders of magnitude greater than simpler alternative MLIPs like the linear Atomic Cluster Expansion (ACE).25 Thus, for applications demanding both robustness and speed, tailoring less complex, optimized MLIPs remains crucial. This requires a systematic optimization of the application-specific cost/accuracy trade-off, considering the quality of the training set, the complexity of the model, and the training procedures, which is the central theme of this paper.
This paper explores the critical trade-off between accuracy and computational cost inherent in fitting and evaluating MLIPs. Fig. 1 conceptually maps the key factors that we investigate to navigate this trade-off, starting with the choice of MLIP complexity dictated by an application's needs in terms of number of simulations, simulation size and timescale with respect to the available computational budget. The optimization involves balancing the MLIP's predictive error (e.g., energy and force RMSE) against the computational costs for constructing the training set and evaluating the MLIP. The computational cost for generating the density functional theory (DFT) training set is limited by the choice of convergence parameters such as plane wave energy cut-off (ENCUT) and k-point mesh sampling, as well as the total number of atomic configurations. In this paper, we refer to the ENCUT and k-point sampling parameters collectively as ‘DFT convergence’ or simply ‘convergence’. The choice of convergence parameters dictates the numerical accuracy of DFT data relative to the fully converged basis set limit. Increasing ENCUT monotonically improves the absolute energy. This might suggest that it is only a constant shift to the potential energy surface, which would allow an MLIP to remain accurate up to a constant. However, the magnitude of this energy correction is structure-dependent. This variation introduces significant errors in the relative energies, with a magnitude comparable to k-point sampling errors (see Fig. S1 in the SI). Consequently, one cannot use an arbitrary ENCUT and assume that the MLIP will be accurate up to a constant; both convergence parameters must be investigated. It is important to distinguish this numerical convergence from the intrinsic accuracy in comparison to experiment dictated by the choice of the exchange–correlation functional, which is not considered here. Instead, the MLIP is benchmarked against the fully converged DFT results with a given functional.
While requiring tight DFT convergence settings is common,28–30 it incurs substantial computational cost. We demonstrate that utilizing reduced-convergence DFT training sets can be sufficient provided the energy and force contributions are appropriately weighted during training. Furthermore, systematic sub-sampling techniques can identify the most informative configurations, drastically reducing the required training set size. By considering these aspects alongside the choice of MLIP complexity (which governs the computational cost of evaluation), we perform a joint Pareto analysis, conceptually illustrated by the optimal surface in Fig. 1. Our findings reveal that it is possible to achieve near-optimal MLIP accuracy with small, lower-convergence DFT training sets. This is especially true when using computationally efficient, reduced-complexity MLIPs. This underscores the substantial benefits of jointly optimizing model complexity, training set convergence, and training set to generate application-specific MLIPs with superior accuracy/cost characteristics.
The paper is organized as follows: Sec. 2 details the computational methodologies employed in this study. This includes the preparation of the DFT training set across six distinct convergence levels (Sec. 2.1), the description of the spectral neighbor analysis potential (SNAP) formalism and its quadratic extension (qSNAP), detailing variations in model complexity and the training procedure involving energy and force weighting (Sec. 2.2), and the leverage score technique utilized for efficient data sub-sampling (Sec. 2.3). Sec. 3 presents and discusses our findings. We first quantify the nature and magnitude of errors introduced by varying DFT convergence levels (Sec. 3.1), followed by analyzing how these DFT errors propagate into the trained MLIPs (Sec. 3.2). Next, we explore the effects of energy–force weighting (Sec. 3.3) and training set size (Sec. 3.4) on MLIP errors. Subsequently, we perform a multi-objective optimization, integrating DFT convergence, training set size (informed by leverage sampling), energy–force weighting and MLIP complexity to map out the Pareto front of accuracy versus computational cost (Sec. 3.5). Sec. 4 discusses the implications of these findings and quantifies the reduction in computational cost. Finally, Sec. 5 summarizes the key conclusions drawn from this work and discusses their implications for the fitting of application-specific MLIPs by inverting the parameter selection to tailor to specific application requirements for accuracy and computational cost.
000 configurations selected from the dataset introduced in ref. 32, which was uniformly rescaled from the equilibrium lattice constant of tungsten to the equilibrium lattice constant of beryllium. This approach leverages the fact that both systems are unary, allowing the reuse of diverse geometric configurations from the entropy maximization algorithm without repeating the generation step for beryllium. Each configuration contains on average 50 atoms. When training MLIPs, the dataset is split evenly into a training and a testing set of 10
000 configurations each. While the system is chemically simple, the entropy-maximization method generates extreme topological diversity, even compared to datasets enriched by active learning.32,33 The extremely broad coverage of the feature space leads to ultra-robust MLIPs, in contrast to models trained on conventional datasets that routinely dramatically fail at predicting the properties of entropy-maximized configurations. This diversity makes the expressivity tradeoffs of the MLIPs especially pronounced.
Reference energies and forces are calculated using pyiron workflow manager34 and the Vienna Ab initio Simulation Package (VASP)35,36 at six levels of DFT convergence, as shown in Table 1. The calculations employed the projector augmented-wave (PAW) method in conjunction with the Perdew–Burke–Ernzerhof (PBE) exchange–correlation functional. To model electronic smearing, we utilized the Methfessel–Paxton scheme (ISMEAR = 1) with a smearing parameter of SIGMA = 0.2 eV. The table also reports the average simulation time for single point evaluations using a single NVIDIA A100 GPU; the computational effort required to generate the reference data is seen to vary by a factor of around 100 between low (level 1) and high (level 6) convergence simulations.
| Convergence level | k-point spacing, Å−1 | Energy cut-off, eV | Average run time per configuration, sec |
|---|---|---|---|
| 1 | Gamma point only | 300 | 8.33 |
| 2 | 1.00 | 300 | 10.02 |
| 3 | 0.75 | 400 | 14.80 |
| 4 | 0.50 | 500 | 19.18 |
| 5 | 0.25 | 700 | 91.99 |
| 6 | 0.10 | 900 | 996.14 |
The per-atom energy distribution (Fig. 2a) spans from −3.75 to −2 eV per atom. Although this dataset is challenging to fit due to its diversity, it has been shown to produce highly transferable and robust potentials. The force component distribution (Fig. 2b) shows a standard deviation of 0.44 eV Å−1 and ranges from −10 to 10 eV Å−1, with most values concentrated between −2 and 2 eV Å−1.
To study the benefits of increased DFT convergence for MLIPs with varying levels of complexity, we control the MLIP complexity for the qSNAP MLIP via the 2Jmax parameter, which directly controls the angular order of the spherical harmonic expansion. Conceptually, 2Jmax determines the angular resolution of the descriptors; a larger value allows the model to capture finer details and more complex geometric arrangements in the local atomic environment. Table 2 reports how increasing 2Jmax affects the number of descriptors, which corresponds to the number of MLIP coefficients and the computational cost of evaluation for applications, like calculating MD trajectories with the MLIP. Notably, the evaluation cost varies by nearly an order of magnitude between 2Jmax = 4 and 10, highlighting the direct impact of model complexity on computational cost for the application of the MLIP. While other hyperparameters, such as the radial cutoff and element-specific weights, are important for defining the scope of the local environment, 2Jmax is the principal hyperparameter that governs the flexibility and descriptive power of the qSNAP functional form itself, and therefore serves as the primary control for model complexity.
| 2Jmax | Number of descriptors | Atoms-timestep per second |
|---|---|---|
| 4 | 105 | 4 × 105 |
| 6 | 465 | 1 × 105 |
| 8 | 1540 | 4 × 104 |
| 10 | 4186 | 2 × 104 |
In spite of what its name suggests, training a qSNAP MLIP can be cast as a linear regression task, which greatly facilitates the training of the large number of models which are presented below. In the following, for simplicity we consider only regression to reference energies and forces, although other quantities such as stresses can be added.
The training of the linear model is based on minimizing a weighted least squares loss function:
![]() | (1) |
In matrix form, the solution to this minimization problem is found by solving the weighted least squares equation:
| Wy=WXβ | (2) |
is the descriptor matrix with n representing the total number of energy and force components in the dataset, and p being the number of qSNAP descriptor components,
is the vector of reference values (including both energies and forces), and
is the vector of MLIP coefficients. The diagonal weight matrix
shown in eqn (1) provides a user-adjustable relative weighting of energy and force terms in the loss function.![]() | (3) |
This formulation ensures that minimizing the weighted least squares problem is equivalent to minimizing the loss function defined in eqn (1). It will be shown below that the choice of weights wE and wF plays a critical role in balancing the influence of energy and force errors and depends on both the complexity of the MLIP model and the DFT convergence of the training set (Sec. 3.3).
000 configurations could be sufficient to obtain a converged MLIP even when using a high-diversity dataset.31–33 This begs the question of how to choose a proper subset of configurations to evaluate with DFT. In the following, we use a leverage score based strategy. Leverage quantifies how much a configuration's features in descriptor space deviate from the average, allowing us to identify configurations with distinctive features. It can also be interpreted in terms of the sensitivity of the ith predicted value yi on the ith dependent value yi (where the ys can be either energies of a particular configurations or force components of a particular atom). High-leverage points therefore have the potential to significantly affect predictions carried out with the trained MLIP. This concept is closely related to the maximum volume approach or D-optimality criterion used in active learning, which selects data points that expand the coverage in descriptor space.39 Our sub-sampling procedure consists of randomly sampling configurations from the 10
000 candidates with probabilities proportional to their leverage scores until a subset of the desired size is obtained.
The leverage score of data item i is the corresponding diagonal element of the so-called hat matrix H=X(XTX)−1XT. A numerically stable and efficient procedure to evaluate the leverage involves the singular value decomposition (SVD) of X = UΣVT. From basic properties of the SVD, it can easily be shown that:
| H = X(XTX)−1XT = UΣVT(VΣUTUΣVT)−1VΣUT = UΣVT(VΣ2VT)−1VΣUT = UΣVTVTΣ−2VTVΣUT = UΣVTVΣ−2VTVΣUT = UUT | (4) |
Since each configuration corresponds to a block of rows in the descriptor matrix X (one for total energy and 3Nm for atomic forces), we explore two strategies to assign a single score to each configuration. The first, which we call ‘regular leverage sampling’, uses the leverage score calculated from the energy descriptor row only, a method analogous to CUR decomposition.40 The second, ‘block leverage sampling’, calculates a total score by summing the individual leverage scores of all rows (energy and all force components) associated with that configuration, an approach analogous to block CUR decomposition.41
![]() | ||
| Fig. 3 Pairwise relationships between 2nd and 6th convergence level DFT data. (a) Energies and (b) forces. | ||
Error metrics are summarized in Table 3, which presents the root mean squared differences (RMSD) between the 6th convergence level and other convergence levels for the energies and forces. The “Unshifted” column reports raw errors. It is sometimes argued that errors between different DFT convergence levels can be resolved by a simple constant energy shift, such as referencing the bulk ground state, a consideration motivated by the faster convergence of energy differences due to empirically-observed error cancellation.42 To test this hypothesis, the “Bulk shifted” column reports the RMSD after applying such a shift. The results show that this procedure is often detrimental, increasing the error for most convergence levels. This demonstrates that for a configurationally diverse dataset like ours, which includes many high-energy structures, a bulk reference point is insufficient to correct for convergence-related errors across the entire potential energy surface. The “Mean shifted” column, although a dataset-dependent measure, is included to distinguish a systematic energy offset from other sources of error. Finally, we observe that convergence level 5 is required to achieve energy errors lower than 1 meV per atom, which is often regarded as a target for accurate MLIPs.
| Convergence level | Energy, meV per atom | Forces meV Å−1 | ||
|---|---|---|---|---|
| Unshifted | Bulk shifted | Mean shifted | ||
| 1 | 497.05 | 437.96 | 221.72 | 417.76 |
| 2 | 64.31 | 192.14 | 47.46 | 163.07 |
| 3 | 15.34 | 20.21 | 15.28 | 77.64 |
| 4 | 5.59 | 59.34 | 5.39 | 51.17 |
| 5 | 0.54 | 2.42 | 0.50 | 10.17 |
We note that an analysis in terms of pointwise averages omits very important properties of energy and force errors that differentiates them from statistical noise. First, actual errors are correlated in that small changes in the positions of atoms are likely to incur similar errors, leading to smooth local distortions with respect to a fully converged potential energy surface. Second, errors can also be discontinuous, e.g., when the k-point mesh discretely changes as a simulation cell is smoothly distorted. As will be shown in the next section, the different statistical properties of the energy and force errors can be used to mitigate their impact and reduce error propagation into MLIP models.
000 configurations) evaluated at the six convergence levels and subsequently tested the models on the other half, using both the corresponding convergence level or the highest convergence level. In addition, the effect of the relative weight of energies and forces is explored. Unless otherwise noted, all energy RMSE values reported from this point forward are the raw ‘unshifted’ errors.
Table 4 and 5 report the root mean squared errors (RMSE) values for energy and force errors for MLIPs trained at different convergence levels, with both higher energy weight (wE:wF = 150
:
1) and higher force weight (wE:wF = 12
:
1). Note that the datasets contain around 150 times more force components than energies as each configuration contains on average 50 atoms (see Sec. 2.1). Each row in the table corresponds to the convergence level of the training set. The “Testing on self” column reports the RMSE values when tested on the same convergence level as the training set, while the “Testing on 6th” column reports errors when tested on 6th (highest) convergence level data. The ultimate goal is to obtain models with low errors when tested against the highest convergence level. Testing errors are evaluated using both an unshifted potential and a shifted potential, where the latter includes a constant energy offset equal to the mean prediction error. Table 5 presents similar data for force RMSE values, with the exception of the absence of a shift correction.
000 configurations at different DFT convergence levels and with the highest level of MLIP expressivity (2Jmax = 10)
| Conv. Level | Higher energy weight | Higher force weight | ||||||
|---|---|---|---|---|---|---|---|---|
| Training | Testing on self | Testing on 6th | Training | Training | Testing on self | Testing on 6th | Training | |
| Unshifted | Shifted | Unshifted | Shifted | |||||
| 1 | 111.74 | 117.91 | 473.42 | 164.02 | 187.84 | 188.55 | 458.17 | 104.85 |
| 2 | 30.90 | 33.18 | 53.76 | 32.17 | 39.26 | 40.06 | 50.21 | 25.85 |
| 3 | 11.01 | 11.67 | 11.08 | 10.99 | 15.12 | 15.48 | 9.24 | 9.14 |
| 4 | 6.20 | 6.58 | 5.80 | 5.62 | 8.84 | 9.01 | 8.16 | 8.04 |
| 5 | 4.79 | 5.06 | 5.05 | 5.05 | 7.70 | 7.84 | 7.85 | 7.85 |
| 6 | 4.77 | 5.05 | 5.05 | 5.05 | 7.70 | 7.84 | 7.84 | 7.84 |
000 configurations at different DFT and with the highest level of complexity (2Jmax = 10). No shift correction is applied to force errors
| Prec. Level | Higher energy weight | Higher force weight | ||||
|---|---|---|---|---|---|---|
| Training | Testing on self | Training | Testing on self | Training | Testing on self | |
| 1 | 931.09 | 944.15 | 900.29 | 375.83 | 379.2 | 242.75 |
| 2 | 235.72 | 238.89 | 186.67 | 189.33 | 190.69 | 116.66 |
| 3 | 143.30 | 145.03 | 124.53 | 130.66 | 132.06 | 108.57 |
| 4 | 120.46 | 121.86 | 111.50 | 116.4 | 117.67 | 106.89 |
| 5 | 109.41 | 110.87 | 110.51 | 105.71 | 106.98 | 106.6 |
| 6 | 109.04 | 110.52 | 110.52 | 105.32 | 106.6 | 106.6 |
Table 4 highlights that training errors are larger for MLIPs trained on lower convergence data compared to higher convergence, indicating that the potential energy surface becomes smoother and hence easier to fit as the convergence increases. This can be related to the discussion above (see Sec. 2.1) where the impact of discontinuities and inconsistencies due to incompatible k-point meshes and limited plane wave energy cut-offs is expected to decrease with increasing convergence, producing smoother and more internally consistent potential energy surfaces, as commonly reported in the literature.42,43 Interestingly, we observe that testing errors are generally lower when models trained on lower convergence data are tested on 6th convergence data than on their own level of convergence. This observation suggests that the inability of MLIPs to capture unphysical behavior such as energy discontinuities/inconsistencies due to discrete changes in k-point mesh sampling can, in fact, be an advantage, since it can be used to partially recover the behavior of smoother high-convergence data. Supporting this interpretation, SI Fig. S4–S7 show strong correlations between energy and force residuals from MLIPs trained on low-convergence data and actual DFT error between high- and low-convergence DFT energies and forces, indicating that artifacts in the low-convergence DFT energy surface are indeed partially corrected by the MLIP. This trend is especially evident when large force weights are used during training, as reflected in Table 4, where the lowest energy RMSE values for the 1st, 2nd, and 3rd convergence levels occur at higher force weight. However, this is no longer true for the 4th and 5th convergence levels, where DFT errors are small and the MLIP errors become limited by the model's intrinsic ability to capture the full complexity of the dataset, as indicated by the error saturation when training to higher DFT convergence data.
Perhaps counterintuitively, increasing the energy weight of convergence levels 2 and 3 leads to higher training and testing energy errors when measured at the 6th convergence level (solid line). In these cases, the Pareto front almost collapses to a single point which simultaneously provides the lowest energy and force errors. This outcome stems from the nature of errors in low-convergence DFT calculations, caused by insufficient sampling of the k-point mesh and sharp features in the potential energy surface based on the restricted number of plane waves, affecting the total energy of the supercell. The forces are less sensitive and exhibit faster convergence compared to energies with respect to the DFT convergence parameters like plane wave cut-off and k-point density.43 This is also illustrated by the force error in Fig. 3b which is more symmetric and Gaussian-like, suggesting force-related errors less effected by a reduced plane wave energy cut-off or a reduced k-point mesh sampling, resulting in a statistically more well-behaved training set for learning compared to the errors in energy. Consequently, focusing training excessively on low-convergence energies by increasing energy weight can lead the MLIP to partially learn the incorrect low-convergence potential energy surface, resulting in these systematic errors that distort the potential energy surface away from the high-convergence reference. In contrast, by focusing more on forces, the MLIP can most efficiently “average-out” errors, effectively learning a smoother representation of the low-convergence energy landscape which is closer to the fully converged high-convergence potential energy surface. This suggests that leveraging the larger, statistically more robust force dataset via increased weighting can mitigate the impact of low-convergence energy noise. Therefore, carefully adjusting the relative weighting of energies and forces is crucial when training MLIPs on lower-convergence data, as increasing force weights can, perhaps paradoxically, produce MLIPs yielding better energies in comparison to the high-convergence potential energy surface as well as better force convergence. A similar principle has been demonstrated in the context of multi-fidelity learning, which shows that forces from a lower-accuracy level of theory can be effectively combined with higher-accuracy energies to produce robust potentials.44 Note, the increase in test energy errors at high DFT convergence (levels 4 and 6, right panel of Fig. 4) can be attributed to the model using its fitting freedom to overfit to the small number of energy data points when they are weighted higher. This behavior is closely related to model misspecification; the model's flexibility, including adjusting the implicit atomic reference energy, is used to capture specific features of the training set energies at the expense of generalizability. This overfitting can be regularized by increasing the force weight.
In most applications that require simulating material properties that are not accessible with direct DFT simulations, the complexity of the MLIP is limited by the available computational resources e.g., the complexity of the MLIP is chosen based on the goal to achieve a fixed target in terms of the number MD simulation time steps required to investigate a given physical phenomenon of interest. The results presented in the previous section suggest that the impact of DFT errors varies based on the complexity of the MLIP, which provides the opportunity to reduce the DFT convergence and increase the computational efficiency during the dataset curation. Indeed, highly complex MLIPs are expected to be more prone to learning the DFT errors compared to simpler models, and so will intrinsically require higher convergence DFT training sets, while simpler models benefit from lower-convergence training sets, reducing the computational cost to construct the DFT training set. To explore this, the previous analysis is repeated for different model complexities. For qSNAP potentials, this is achieved by varying the angular order of the bispectrum expansion, which is commonly referred as the 2Jmax parameter7 (see Sec. 2.2). As shown in Table 2, increasing 2Jmax corresponds to a rapid increase in the number of coefficients in the MLIP, and hence to an increase in complexity.
Fig. 5 shows that simple MLIPs are indeed less sensitive to the errors in low-convergence DFT training sets. For example, for 2Jmax = 4, the potentials trained on convergence levels 3, 4, and 5 yield similar errors when tested on a convergence level 6 testing set. In contrast more complex MLIPs with 2Jmax = 10 show a more pronounced dependence on the training set convergence level. This suggests that lower complexity MLIPs can leverage low-convergence DFT training sets compared to higher complexity MLIPs, as they are less able to learn spurious features of the low-convergence DFT potential energy surface.
000 configurations where the error is dominated by the finite expressivity of the MLIP. This results in a reduction of computational cost by a factor of 10 compared to random sampling. Similarly, MLIPs can approach the limiting force errors by about 10 meV Å−1 using 3 to 4 times less configuration than required by random selection. Finally, block leverage sampling is observed to yield lower force errors while regular leverage sampling leads to lower energy errors, although the differences between the two approaches are modest. It is important to note that the computational cost of leverage sampling is minimal, as it can be obtained at a computational cost comparable to that of a single linear regression solution on the whole dataset of 20
000 configurations. Importantly, configurations can be prioritized using leverage sampling without the need to carry out the related DFT simulation first, as only the features of each atomic configuration are required to compute the leverage score, but the energies and forces computed by DFT are not required.
Several key conclusions emerge: neither the 1st nor 2nd convergence levels appear on the Pareto front, indicating that smaller higher-convergence subsets always outperform larger very-low-convergence datasets. We postulate that this reflects an inherent trade-off where larger number of configurations of very noisy data are required to “average out” intrinsic errors, compared to high-convergence data where the ultimate accuracy limit-the point at which errors are controlled only by the model's finite complexity can be expected to occur earlier. This phenomenon can be frequently observed on the Pareto front (especially in panels (a) and (b) of Fig. 7) where different DFT convergence levels can lead to similar accuracies and overall DFT computational costs, indicating that the lower-convergence training sets were hence larger than their high-convergence counterparts. Of course, whether the extra amount of low-convergence training set configuration can be obtained at a sufficiently low computational cost compared to a smaller number of configurations at higher DFT convergence is application-specific. Similarly, the 6th (highest) level convergence training set is mostly absent except at the very edge of the accessible error range, due to a marginal increase in convergence obtained in spite of the significantly higher computational cost.
Correlating these findings with Table 3 suggests that different levels of DFT convergence corresponding to the error much lower than the ultimate accuracy achievable by the MLIP due to finite complexity are unlikely to be optimal, as the amount of extra information gained by high-convergence DFT simulation has a limited impact on the accuracy of the MLIP. This explains the scarcity of DFT simulation of level 6 convergence on the Pareto front. Similarly, the value of low-convergence DFT simulation is limited when the DFT errors significantly exceed the accuracy achievable by the MLIP due to the need to average out these errors. This is also consistent with the absence of the 1st and 2nd convergence levels of DFT convergence on the Pareto front, as their intrinsic errors exceed the representation capabilities of all MLIPs considered here. These observations are consistent with a rule of thumb, where matching the ultimate accuracy of the MLIP and the convergence of the DFT simulation is desirable.
Further analysis of Fig. 7, 8 and S8 reveals that the optimal DFT level convergence depends strongly on both the MLIP complexity (2Jmax) and whether energy or force accuracy is prioritized. For complex MLIPs (2Jmax = 10), the limiting energy and force errors saturate around 4.5 meV per atom and 105 meV Å−1 respectively. In this case, it is possible to approach both of these limits simultaneously through a proper choice of the energy/force weights. Approaching limiting force errors (c.f., Fig. 8d) is possible even with lower convergence training sets, e.g., the level 4 convergence, while approaching the ultimate energy error level requires convergence level 5 training sets (c.f., Fig. 8d), which is consistent with the observation that the comparatively unbiased statistical properties of the forces make them more susceptible to being averaged out. Nonetheless, the effect is relatively modest in absolute terms, as even convergence level 4 training sets produce MLIPs whose errors are within 1.5 meV per atom and 5 meV Å−1 of the saturation limit. Perhaps most surprising is the observation that very good MLIPs can be obtained very efficiently. For example, models with energy and force errors within 3 meV per atom and 20 meV Å−1, respectively, of the saturation limit can be obtained with only approximately 2 h of total computational time for DFT simulation using convergence level 4. In fact, errors show a very fast decrease in the first hour of DFT simulations, followed by a significant slowing down where further improvements come at a high computational cost. This shows that it is possible to obtain accurate models at an extremely low computational cost. Interestingly, the results also show that the optimal convergence level in fact depends on the total computational resources available for creating the DFT training set, as a small amount of computational resources (10–20 minutes) favors convergence level 3 training sets, intermediate budgets (20–1000 minutes) tend to favor convergence level 4 training sets, and large budgets (>3000 minutes) tend to favor convergence level 5 or rarely even level 6 training sets. Intuitively, this suggests that capturing the broad features of the energy landscape is better achieved with a larger number of low-convergence configurations in the training set than with a small number of high-convergence configurations in the training set, but that refining the detailed features of the landscape gradually calls for higher-convergence configurations in the training sets.
With low-complexity MLIPs (2Jmax = 4 and 2Jmax = 6), limiting energy and force errors worsen significantly. Simultaneously approaching both limits using a simple energy–force weighting is no longer feasible due to a strong energy–force error trade-off. In contrast to high-complexity MLIPs, the decrease in errors is also significantly more gradual as more DFT simulations are added to the training set. For example, focusing on “intermediate” models that attempt to strike the balance between energy and force errors (e.g., energy errors between 30 and 40 meV per atom and force errors between 260 and 300 meV Å−1 for 2Jmax = 4), one observes that increasing the computational resources for DFT simulation from 102 to 103 minutes leads to a substantial decrease in energy and force errors (in absolute terms) by about 7 meV per atom and 20 meV Å−1, respectively, which is significantly larger than the corresponding decrease for 2Jmax = 10. This is perhaps counterintuitive, as simpler models should require less data to constrain, and could therefore be expected to converge faster with increasing training set size. However, this intuition does not extend to misspecified models where no combination of free parameters can perfectly reproduce the training data. In this case, adding more configurations to an already large training set (i.e., much larger than the number of adjustable parameters) can still significantly affect the MLIP’s accuracy, even leading to worsening test errors in some cases,18 a phenomenon that can also be observed here for energy errors at large force weights. Although the leverage sampling strategy should partially mitigate this trend by introducing the most influential points first, the slow convergence of energy and force errors with respect to the training set size is consistent with the behavior of highly misspecified models.
First, we find that very tightly converging DFT calculations can be wasteful when coupled with MLIPs of limited expressivity. Table 4 and 5, along with Fig. 7, show that MLIPs trained on convergence level 5 training sets provide the same level of accuracy as those trained on convergence level 6 at a 10-fold reduction in cost. This is consistent with the rule of thumb that the accuracy of the model and of the data benefits from being roughly matched.
Second, further savings are possible by considering how DFT errors translate into MLIP errors, which is itself dependent on the expressivity of the model. Our Pareto analysis (Fig. 7) shows that if applications allow for more expressive MLIPs (e.g., 2Jmax ≥ 6), training to medium convergence data (level 4 in our case) selected by leverage sampling can produce MLIPs nearly as accurate as those trained on convergence level 5 training sets. This approach can reduce the DFT computational cost by about 10 times compared to using the full convergence level 5 training set, and up to about 100 times compared to using the highest convergence level. Under optimal conditions, convergence with respect to the training set size can occur extremely quickly, using only a handful of GPU hours.
Third, if the application requires very computationally efficient MLIPs (e.g., 2Jmax = 4), these simpler MLIPs are less sensitive to DFT errors. Fig. 7d shows that in these cases, lower-convergence data can be used. This further reduces the computational cost of the DFT simulation while maintaining the same accuracy similar to higher-convergence training sets for the same simple MLIP. Using convergence level 3 saves about 50 times the computational cost for DFT simulation compared to convergence level 5, and over 500 times compared to convergence level 6 to characterize the whole training set. However, we also observed that very simple potentials can counterintuitively require a larger amount of data to achieve convergence with respect to the training set size due to misspecification effects.
Of course, when assembling datasets for applications that require accuracy at the expense of inference cost, or when curating databases of reference results, e.g., to train universal models, erring on the side of caution and employing tight DFT convergence settings remain safe strategies. However, even in this setting, quickly generating lower-convergence data to fine-tune universal models might prove to be the superior option.
Supplementary information is available. See DOI: https://doi.org/10.1039/d5dd00294j.
| This journal is © The Royal Society of Chemistry 2026 |