Open Access Article
Jack
Yang
*a,
Ziqi
Yin
a,
Lei
Ao
abc and
Sean
Li
a
aSchool of Material Science and Engineering, University of New South Wales, Sydney, New South Wales 2052, Australia. E-mail: jianliang.yang1@unsw.edu.au
bJiangxi Provincial Key Laboratory of Advanced Electronic Materials and Devices, Jiangxi Science and Technology Normal University, Nanchang 330018, China
cSchool of Physics, University of Electronic and Technology of China, Chengdu 610054, China
First published on 21st January 2026
Recent developments in materials informatics and artificial intelligence have led to the emergence of foundational energy models for material chemistry, as represented by the suite of MACE-based foundation models, bringing a significant breakthrough in universal potentials for inorganic solids. As to all method developments in computational materials science, performance benchmarking against existing high-level data with a focus on specific applications, is critically needed to understand the limitations in the models, thus facilitating the ongoing improvements in the model development process, and occasionally, leading to significant conceptual leaps in materials theory. Here, using our own published DFT (density functional theory) database of room-temperature dynamic stability and vibrational anharmonicity for ∼2000 cubic halide double perovskites, we benchmarked the performances of four different variants of the MACE foundation models for screening the dynamic stabilities of inorganic solids. Our analysis shows that, as anticipated, the model accuracy improves with more training data. The dynamic stabilities of weakly anharmonic materials (as predicted by DFT) are more accurately reproduced by the foundation model, than those of highly anharmonic and dynamically unstable ones. The predominant source of error in predicting dynamic stability arises from the amplification of errors in atomic forces when predicting the harmonic phonon properties through the computation of the Hessian matrix, less so is the contribution from possible differences in the range of the configurational spaces that are sampled by DFT and the foundation model in molecular dynamics. We hope that our present findings will stimulate future studies towards more physics-inspired approaches in assessing the accuracy of foundation models for atomistic modelling.
One of the key drawbacks of DFT is its
scaling behaviour to the system size measured as the number of atoms N, which makes it computationally very expensive to apply in large-scale modellings, such as for chemically disordered high-entropy materials,4 and systems in which many-body interactions significantly dictate their physical properties.5 Traditionally, this bottleneck was overcome by employing atom–atom force fields6 that are designed with fast-to-calculate analytical functions based on known physics (e.g. harmonic potential for bond stretching) with parameters fitted to a single or a specific set of material(s). However, this also imposes significant limitations in applying these tailor-made force fields to correctly model exotic materials behaviours such as anharmonic phonon vibrations and Coulomb interactions between polarised charge densities, thus restricting the model transferability to systems that have not been parameterised. This makes the development of universal force field a long-term challenge in materials modelling. This, nevertheless, is not a problem for DFT.
Over the past two decades, machine-learning interatomic potentials (MLIPs) have emerged and rapidly developed to bridge the gap between DFT- and force-field-based energy models. The early successes in this endeavour are largely rooted in using kernel regressions based on hand-crafted and physically inspired descriptors for local atomic environments.7,8 This knowledge has been fueled into the recent development of deep-learning potential models such as Schnet,9 NequIP,10 MACE11 and So3krates.12 Some of them12 have further incorporated deep-learning architectures, such as the attention mechanism13 from the large language models to capture long-ranged atom–atom interactions in materials, demonstrating the cross-paradigm nature in this field of research. In the meantime, the continuous expansions of large computational materials databases, such as the Materials Project14 and OMAT,15 have provided the community with rich resources of structural, energetic and property data of materials that are generated in a consistent level of theory. The scale and diversity of hundreds of millions of first-principles calculations provided by these databases unlock our capabilities to develop a transferable universal foundation energy model for (solid-state) materials across a significant portion of the existing chemical space.16
This significant milestone can be exemplified by the recent achievement behind the releases of a suite of foundation models17 based on the MACE11 (message-passing atomic cluster expansion) architecture, which is the focus of this study. More specifically, to learn the total atomic interaction energies in chemical systems, MACE combines the graph neural network18 that models chemical structures as graphs and utilises the message-passing mechanism19 to exchange chemical bonding information across multiple message-passing layers in the network, together with the atomic cluster expansion20 formalism to ensure that the equivariance of the local atomic environment is preserved as the messages are passed through the network.
In the first release,17 dubbed as mp-0, the foundation model was trained on the MPtrj21 dataset, which contains a large number of static calculations and structural optimisation trajectories for inorganic solids at the PBE+U level of theory. This includes approximately 1.5 M structures with 90% of them having less than 70 atoms per unit cell. With this level of coverage, the mp-0 model had been applied to demonstrate its applicability to 30 different categories of atomistic simulations, ranging from ice structures, metal organic frameworks, heterogeneous catalysts, amorphous structures, to complex liquid–solid interfaces.
However, even at this training scale, the intrinsic problem associated with any MLIPs cannot be overlooked in the foundation model, that is, at its best, the model accuracy can only be as good as the underlying theory that was applied to generate the training data. This issue has already been addressed,17 for example, the DFT setting for generating the MPtrj dataset is less tight compared to that required for accurate phonon calculations, as such, the error in reproducing the DFT phonon bandwidth with mp-0 is ∼1–2 THz, which is an order of magnitude larger than the results from a highly specialised model.22 Overcoming this kind of shortage is undoubtedly a key driving force for the ongoing improvement of the MACE-foundation models (Table 1). This is because many key physical properties of materials, such as dynamic stabilities,23,24 electron dynamics and superconductivities,5 all share strong links to the phononic behaviours of the materials. A notable improvement in predicting phonon properties is expected with the latest iteration of the model that was trained on the OMAT15 database.
| Model name | Elements covered | Training dataset | Level of theory | Notes |
|---|---|---|---|---|
| matpbs-pbe-omat-ft | 89 | MATPES-PBS25 | DFT (PBE) | No +U correction |
| mpa-0-medium | 89 | MPtrj21 + sAlex26 | DFT (PBE+U) | Improved accuracy particularly high pressure stabilities |
| mp-0b3-medium | 89 | MPtrj | DFT (PBE+U) | Improved phonon properties |
| omat-0-medium | 89 | OMAT15 | DFT (PBE+U) | Excellent phonon properties |
This is an interesting development, as the major improvement from MPtrj to OMAT was not necessarily on DFT settings that improved the phonon accuracy, but an expansion in the dataset size which contains, for example, rattled crystal structures sampled from Boltzmann distributions as well as molecular dynamic trajectories. This highlights the importance of including more training data that can closely trace the topologies of the underlying DFT potential energy surfaces (PES) for different materials as a key strategy for developing foundational models for materials chemistry.
A particularly relevant case is anharmonic27 solids, for which vibrating atoms tend to traverse a PES with complex topology that notably deviates from the idealised parabolic shape. A representative material system is the cubic perovskites, for which the high-symmetry cubic structure is a saddle point on a double-well-shaped PES, that can be expressed as a fourth-order polynomial.28 Solving the eigenvalue equation for the dynamical matrix of harmonic phonons for these systems typically leads to imaginary phonon frequencies29 at the high symmetry points in the reciprocal space, which correspond to the structurally-related antiferrodistortive30 or electronically-related ferroelectric31 instabilities. Distorting the high-symmetry cubic perovskite structure along the eigenvectors of these imaginary phonon eigenvectors corresponds to symmetry-breaking events that will drive the structure into an energetically more stable state. Physically, the depth of the double-well potential has a strong contribution towards the degree of vibrational anharmonicities. The latter is strongly related to the chemical constituents and bonding characteristics of the materials.
The above idea inspires our present investigation, in which we use our unique harmonic phonon and room-temperature ab initio molecular dynamics (AIMD) database of ∼2000 halide double perovskites (HDPs),32 covering a diverse range of material dynamic stabilities and vibrational anharmonicities,27 while maintaining structural homogeneity (all having Fm
m space group symmetry), to benchmark the performances of the MACE foundation models (Table 1) in tracing the topologies of PES across a range of different degrees of anharmonicity.
Our detailed analysis reveals the following. The previously established anharmonicity score27 is fundamentally equivalent to a measurement of force-fitting residue,33 which can be used to reveal (a) part of the chemical space where the foundational models performed well (poorly) in reproducing the DFT–PES, as well as (b) regions of the DFT–PES for an individual material that are well (poorly) reproduced by the foundation model. Overall, it shows that the highly anharmonic part of the chemical space and the DFT–PES for individual material are generally less well reproduced by the foundation model. Nevertheless, if both the harmonic and anharmonic contributions to the atomic forces are computed consistently with the same energy model, it should provide a reasonably good indication of the dynamic stabilities of a material that is quantitatively aligned with the DFT result, suggesting these foundation models17 are indeed sufficient for accelerating large-scale screening of finite-temperature materials stabilities, which is a critical component in the theory-driven materials discoveries.
(1) Harmonic force constant matrix Φαβij computed from the finite-displacement approach in real space.35 Physically, each matrix element of the force constant matrix corresponds to the force appearing on atom i along the α Cartesian direction when atom j is displaced along the β direction. The availability of the force constant matrix enables us to surrogate a (3N + 1)-dimensional (with N being the number of atoms in the simulation supercell) harmonic approximation to the PES in the vicinity of the local minimum that corresponds to the high-symmetry Fm
m structure of HDP. Diagonalising the Fourier transformation of Φ will provide us with the phonon eigenfrequencies {ω(q,n)}, where q is the phonon wavevector in the first Brillouin zone, and n is the band index for a given q. The phonon dispersion relationship can be obtained by connecting {ω(q,n)} with the same n across all symmetrically unique q-points in the first Brillouin zone, from which one can compute the corresponding phonon group velocities as vg(q,n) = ∂ω(q,n)/∂q.
(2) AIMD trajectory which contains a set of time-dependent atomic coordinates and forces {R(t), F(t)} that are sampled at 300 K for up to 1.6 ps at 1 fs time step under the NVT ensemble. AIMD simulations enable us to sample a wider (higher-energy) portion of the energy basin that is centred around the Fm
m local minimum. Since DFT does not take any assumption on the topology of the underlying PES (as opposed to traditional force fields), but solely determines the local PES gradient (encapsulated in F(t)) by solving the electronic structure at the given atomic configuration R(t), it is able to capture the nonparabolic aspect in the topology of the PES, especially at distant to the local minimum.
![]() | (1) |
Essentially, the anharmonicity is measured by comparing the standard deviation of the total (F) and anharmonic (FA) atomic forces sampled across the AIMD trajectory. Here, Fαi (Fα,Ai) is the total and anharmonic force on the i-th atom in the simulation cell along the α-Cartesian direction, and they are related to each other via Fα,Ai = Fαi − Fα,(2)i, in which the harmonic component of the atomic force can be computed from the force constant Φαβij as
, with uαi being the atomic displacement from its equilibrium position. Summation over all atoms and three Cartesian directions for a given AIMD frame gives the time-dependent σ(2)(t), which provides a measure of anharmonicity for the particular frame, whereas taking the average 〈σ(2)(t)〉t over the entire AIMD trajectory provides a single numerical measure of the anharmonicity of a given material at T. The later also determines the dynamical stability of the materials, as those with 〈σ(2)(t)〉t > 1 are considered as unstable at the simulated temperature T.
There are two important aspects of the anharmonicity score, which is rooted from its definition eqn (1). Firstly, as the harmonic force is directly proportional to the atomic displacements, σ(2)(t) can be treated as a single-valued proxy to gauge the range of the phase space being sampled in an MD simulation.36 Secondly, eqn (1) shows that the anharmonicity score is fundamentally a measure of standard deviation, which is also a measure of force fitting accuracy in all MLIP developments,33 hence, as shall be shown below, it can provide us with more physical and diagnostic insight into the chemical and structural phase spaces in which the foundation models performed well or extrapolate poorly in practical simulations.
![]() | (2) |
![]() | (3) |
As an example, Fig. 1(a) shows the relationship between RMSE(ω2) and RMSE(vg) for chloride HDPs computed with the omat-0-medium model. To showcase what these two error metrics reflect, we also show in Fig. S2 of the SI the corresponding comparisons of the phonon dispersion curves computed with the omat-0-medium model and DFT for the two extrema (Cs2KDyCl6 and Na2CuAuCl6) identified in Fig. 1(a). For Cs2KDyCl6 which has the lowest RMSE(ω2) value, it can be seen from Fig. S2 (left) that the phonon dispersion curves computed from the MACE model are well overlapped with the DFT ones, except some deviations near 0 THz around the X-high symmetry point. For the worst case of Na2CuAuCl6, Fig. S2 (right) shows the MACE underestimates the imaginary phonon frequencies which consequently led to a flatter dispersion curve that increases RMSE(vg). In this particular case, the material that is deemed unstable on the DFT–PES became less dynamically unstable on the MACE-PES.
![]() | ||
| Fig. 1 Accuracies of the MACE foundation models in predicting the harmonic phonon properties for HDPs. (a) Correlations between the RMSEs in predicting the phonon eigenfrequencies and group velocities with respect to the DFT results for chloride HDPs using the omat-0-medium foundation model. Each data point is colour-coded according to its anharmonicity score σ(2) computed from DFT.32 Two extrema with the best (lower left) and worst (upper right) prediction accuracies are highlighted, the corresponding phonon dispersion relationships for which are shown in Fig. S2 of the SI. (b) and (c) The box plots that present the ranges of RMSEs in predicting the phonon eigenfrequencies and group velocities using all four foundation models for each group of HDPs as categorised by the halide anion. The orange line indicates the medium RMSE. The box limits represent the 1st and 3rd quartiles. The whiskers show the range of RMSEs within 1.5× the interquartile range of the box limits, and the outliers are denoted by light cyan plus symbols. | ||
By further highlighting each point on Fig. 1(a) with the anharmonic score σ(2) computed with DFT for the corresponding HDP structure, a more intriguing trend is revealed, which shows that the accuracy of the phononic properties predicted by the MACE foundation model are strongly correlated with σ(2), such that structures with high dynamical stabilities (lower σ(2)) exhibit lower RMSEs, and vice versa. This is a systematic trend that occurs in all four halide systems investigated, regardless on which datasets the foundation models had been trained [Fig. S3–S6]. It is not a coincidence, as discussed in the Methodology section, that σ(2) is also a RMSE-type measurement, but with a fundamental geometric insight that captures the deviation of the shape of the PES from a hyperparabola.
In Fig. 1(b) and (c), we show the box plots that provide a more summative view over our benchmark results on the harmonic phonon properties for HDPs. It can be seen that, across the chemical space from fluorides to iodides, the accuracy in predicting the harmonic phonon properties increases as the atomic masses of the halide anions increase. This effect is particularly pronounced in reproducing the phonon eigenfrequencies [Fig. 1(c)]. As shown in our previous work,32 the vibrational anharmonicities of HDPs do exhibit a systematic decrease from light to heavier halides [Fig. S1(a)], hence the chemical trends behind the RMSE values shown in Fig. 1(b) and (c) are largely consistent with the trend shown in Fig. 1(a) for chlorides with respect to the variations in σ(2). The observed chemical trend is largely unchanged across all four parameterisations of the MACE foundation models, with the omat-0-medium being the best performing model, showing an order of magnitude improvement in RMSE(ω2) compared to the worst performing matpes-pbe-omat-ft model.
On a more fundamental level, the chemical trend observed in RMSEs can be further correlated with the phonon bandwidths (how wide ω spans, which can be equivalently reflected from the averaged phonon eigenfrequencies 〈ω〉) for HDPs with different halide anions. As shown in Fig. S1(b) extracted from our previous study,32 〈ω〉 increases systematically from iodides to fluorides. This indicates that, for lighter halides, the vibrating ions experience larger restoring forces that originate from a steeper topology of the PES. From the perspective of training atom–atom force fields,41 it is generally understood that steep or sharp rising parts of the PES are more challenging to be accurately trained, which would require more training data and/or more tailored functional forms to reduce the training complexity. In the domain of fully data-driven MLIPs, the quality of the potential energy model becomes more critically dependent on the breadth of the configuration space covered in the training set. Whilst the OMAT15 dataset already contains rattled atomic structures sampled according to the Boltzmann distribution up to 1000 K, the number of the rattled structures per compound was fixed. Our present findings suggest that, moving forward, a more adaptive scheme in constructing the training sets, particularly applying a weighting scheme to include more rattled structures following the high-frequency phonon modes for systems containing light elements, would be an interesting path to explore for increasing the accuracy of foundational energy models.
Nevertheless, directly comparing an AIMD trajectory with another one that is independently sampled with a different energy model, in this case, MLIP, it is often difficult to come up with good interpretations that can lead to direct and meaningful physical insights into the qualities of MLIPs. Fundamentally, this is because two different energy models correspond to two different PES, and even a small difference in the PES topologies can shift the equilibrium positions, barrier heights and transition states, such that the two MD trajectories may cover completely different configuration spaces.
To overcome this kind of complication, in this section, we shall first take the existing AIMD trajectory for each HDP36 (a total of 1682 valid ones), to recompute the atomic forces for each frame in every trajectory with the MACE foundation model, from which a new 〈σ(2)〉MACEt metric (σMACE for short-handed notation) can be attained, that is to be directly compared with 〈σ(2)〉DFTt (σDFT for short-handed notation). More specifically, for each AIMD trajectory and MACE foundation model that we benchmarked, we compute σMACE with two different approaches to obtain the harmonic component of the atomic forces (Fα,(2)i = −Φαβijuαi) via the force constant Φαβij: (a) ΦDFT-approach, where the atomic forces for each displaced configuration generated from the finite-displacement method35 are computed with DFT32 to construct the force constant Φ, and (b) ΦMACE-approach, with the atomic forces computed with the MACE foundation models.
Geometrically, the ΦDFT-approach can be considered as a way of providing a direct measure of the ability of the MACE energy models to exactly reproduce the topologies of the DFT–PES around the local minimum. When σMACE > σDFT, the MACE model overestimates the total forces, leading to a more anharmonic PES compared to the DFT baseline, which is the other way around when σMACE < σDFT. In other words, the discrepancy between σMACE and σDFT provided an absolute measure on the errors in predicting the total atomic forces. When σMACE = σDFT, the DFT energy landscape can be fully reconstructed by the MACE energy models over all {u}. For the ΦMACE-approach, σMACE takes no reference to the DFT-energy landscape, thus it reflects the degree of anharmonicity of the MACE-energy landscape itself. When σMACE = σDFT, it means that the relative contributions from the (an)harmonic force components to the total atomic forces are the same between the MACE and DFT energy models, and the MACE and DFT–PES differ from each other globally by some constant multiplicative factor.
Physically, comparing σMACE with σDFT provides a key indication to whether the degrees of finite-temperature dynamic stabilities of materials predicted by the MACE foundation models agree with those predicted by the DFT. In particular, the ΦMACE-approach provides a looser criterion in making this judgement as it only requires the relative contributions of the anharmonic forces to the total one being the same as predicted from MACE models and DFT, which may benefit from error cancellations in subtracting Φαβijuαi from total atomic force Fαi, when both terms are predicted from the MACE models. In contrast, the ΦDFT-approach is more strict, which would require the total atomic force predicted from the MACE-models to closely match those computed from DFT. This information is presented with the confusion matrices shown in Fig. 3. In each confusion matrix, the dynamic stabilities are characterised into three categories33 [Fig. 2]: (a) σ ∈ (0, 0.5), corresponding to highly stable structures with weak vibrational anharmonicity that is predominantly contributed by three-phonon scatterings (encapsulated by the third-order force constant Φαβγijk). (b) σ ∈ [0.5, 1], meaning the phase is still stable at the simulated temperature T but with stronger vibrational anharmonicity that is dominated by the force constants from fourth-order and above. And finally, (c) σ > 1, meaning the phase is unstable at T.
![]() | ||
| Fig. 2 Ranges of anharmonic scores computed from DFT and MACE models that are covered by different blocks of the confusion matrix. In particular, the diagonal blocks represent cases where the MACE model predicts the same dynamical stabilities as from DFT calculations, and they are categorised into cases corresponding to systems that are (a) weakly anharmonic (σ < 0.5), (b) strongly anharmonic (σ ∈ [0.5, 1]) and (c) dynamically unstable (σ > 1), following the results from our previous high-throughput screening studies on single perovskites.33 This convention is followed when presenting the results in Fig. 3 and 4. | ||
Results in Fig. 3 show that the confusion matrices are dominated by the diagonal elements, meaning that the consensuses in predicting the phase stabilities at 300 K between the MACE foundation models and DFT are generally acceptable across a wide range of stability regimes, supporting the argument that MACE foundation models is useful for fast pre-screening filter for unstable materials.17
In the ΦDFT-approach, one sees a clear trend of increasing the number of correctly predicting materials’ dynamic stabilities from the matpes-pbs-omat-ft to the omat-0-medium model, which is in line with the observations above. The likelihood of the MACE models to overly stabilise (destabilise) the DFT-predicted unstable (stable) materials, i.e. σMACE > 1 for σDFT < 1 or vice versa is generally low. Except for the matpes-pbs-omat-ft and mp-0b3-medium models, for which we see a significant number of weakly anharmonic HDPs being classified as strongly anharmonic by the MACE models.
The ΦMACE-approach reveals a different outcome. In this case, the number of HDPs that have their dynamical stabilities being correctly identified remain almost unchanged across different MACE models. As discussed above, this means that, across a large part of the chemical space, the relative anharmonic contributions to the overall topologies of the PES remain invariant from DFT to the different MACE models. However, we also observed from Fig. 3 that when the ΦMACE is used to extract the harmonic components of the atomic forces, the number of HDPs being placed in the lower off-diagonal parts of the confusion matrices increased significantly, meaning that when the MACE model is used solely to compute σ, it tends to over-stabilise the highly anharmonic and unstable HDPs (see Fig. S7(d) in the SI for an example). This observation can be better reflected when we plot the distributions of 〈σ(2)〉MACEt − 〈σ(2)〉DFTt in Fig. S9 of the SI, which show strong tailing in the negative part. As discussed in the harmonic phonon section, this reflects the poorer generalisability of the MACE foundational models in capturing the anharmonic features of the PES, particularly for materials with low dynamical stabilities.
Computationally, the MD samplings using the MACE foundation model, dubbed as MACE-MD, are carried out as follows. For each HDP, we randomly selected 2 frames from the previously sampled AIMD trajectory as the starting points to perform 2 independent MACE-MD samplings. This choice of starting points for MACE-MD imposes a weak constraint that the configurational space that is sampled by the MACE-MD should have some overlap with the configurational space sampled from AIMD, at least in the initial stage of the MACE-MD sampling. Each MACE-MD simulation was run for 2 fs at a 1 ps time step using the MD engine from the atomic simulation environment.44 The same as in our previous work,32 the Andersen thermostat45 with a collision probability of 0.5 was employed to maintain the simulation temperature at the target value of 300 K. The corresponding 〈σ(2)〉MACEt was averaged over all 4000 MACE-MD frames using the force constant ΦMACE computed with the same omat-0-medium model to extract the harmonic components of the atomic forces.
Fig. 4 presents the confusion matrices that compare the numbers of HDPs that share the same/different classifications of their dynamic stabilities as solely determined from either the DFT or the omat-0-medium foundation model. Similar to the results shown in Fig. 3, the confusion matrix [Fig. 4(a)] is still dominated by the diagonal elements, meaning that the performance in determining the room-temperature dynamic stabilities using the omat-0-medium foundation model alone is generally acceptable. Counting the numbers in the lower diagonal part of the confusion matrix, we found 38% chance of categorising HDPs with low stabilities to be more stable ones. In contrast, the upper diagonal part of the confusion matrix leads to only 6% chance of misplacing stable materials to be less stable, which predominantly comes from the fluorides [Fig. 4(b)]. This shows that the omat-0-medium model leads to more false positive cases than false negative one. This means that the chance of missing stable materials is lower, compared to including more unstable materials, when it comes to (pre-)screening dynamically stable materials using the omat-0-medium model, whereby more accurate models (such as DFT) can be used subsequently to further filter out the false positive results.
As mentioned above, comparing two energy models using results from MD simulations may introduce bias because the subtle differences in the PES topologies underpinned by the two energy models may cause MD simulations to sample two distinctly different configurational spaces that intrinsically exhibit different properties. To assess the extent of this bias that could have contributed to the results that are presented in Fig. 4, we have selected five extreme cases among the fluoride compounds (Table 2) and performed unsupervised machine learning to compare the similarities in the configurational spaces that are sampled by the AIMD and MACE-MD [see the Methodologies section for details]. Results from such analysis (Fig. 6) show that, first of all, the way we selected the initial structures for running the MACE-MD simulations did mitigate some of the bias by enforcing the configurational spaces sampled by the two different energy models to (at least partially) overlap with each other. The system that exhibits the largest overlap in the configuration spaces sampled by the two energy models is Rb3AlF6, of which the computed σMACE is literally identical to σDFT (Table 2). In this case, we can consider the DFT–PES around the local minimum for the cubic Rb3AlF6 has been well reproduced by the omat-0-medium model. In contrast, K2RbSbF6 represents the other extreme case where the configuration space sampled by the MACE-MD diverges quite significantly from the one sampled with AIMD. The other three compounds listed in Table 2 are somewhere between these two extreme cases, as revealed in Fig. 6.
| Compound | σ DFT | σ MACE | |σDFT − σMACE| |
|---|---|---|---|
| Rb3AlF6 | <0.5 | <0.5 | 0.000375 |
| Rb2NiAgF6 | <0.5 | [0.5, 1] | 0.400 |
| Cs2NaRuF6 | [0.5, 1] | <0.5 | 0.663 |
| K2InAgF6 | [0.5, 1] | [0.5, 1] | 0.00102 |
| K2RbSbF6 | [0.5, 1] | >0.5 | 0.440 |
To check that the above observations are not necessarily biased by the longer trajectories that are sampled by the numerically more efficient MLIP, we performed extra simulations which extended the original AIMD trajectories to 4 ps in length, and re-performed the same KPCA analysis. With the longer AIMD trajectory, Fig. S11 of the SI shows the divergence between AIMD and MACE-MD trajectories for K2RbSbF6 has reduced, but for Rb2NiAgF6, the divergence between the two trajectories increased. Taking into account the stochasticity of MD simulations when interpreting the KPCA maps that are shown in Fig. 6 and Fig. S12 of the SI, we can say that the topologies of the PES underpinned by DFT and the omat-0-medium model for most materials under the current investigation should be very similar, rendering sufficient similarities in the configuration spaces that are sampled from these two models. As such, dissimilarities in the sampled configuration spaces are not believed to be strongly contributing to the discrepancies in σDFT and σMACE.
By further colouring each point on the KPCA maps with the anharmonicity score for the corresponding MD snapshot (Fig. S10 and S12 of the SI), it becomes clear that when |σDFT − σMACE| is large, it can be generally attributed to a systematic error in which σMACE computed for the entire MACE-MD trajectory is collectively and significantly different from σDFT even in the regions of the configurational space where the overlap between those sampled from AIMD and MACE-MD is significant. This suggests that the error in computing σ must be inherited from the error in computing the Hessian matrix Φ, which is indeed supported by Fig. 5 showing that low (high) errors in predicting the phonon group velocities and frequencies generally translate to low (high) discrepancies between predicted values of σMACE and σDFT. The reason for this, is that, understandably, just as the higher accuracy that is required in calculating the atomic forces for determining the phononic properties with DFT, even small errors in predicting the atomic forces with the foundational models could translate into large notable differences in the phonon dispersion relationship, as the errors are amplified in the calculations of the derivatives of forces with respect to the atomic positions.
![]() | ||
| Fig. 5 Selected fluoride HDPs, the molecular dynamics trajectories of which sampled from AIMD and MACE-MD, will be compared with the SOAP-REMatch kernel. | ||
![]() | ||
| Fig. 6 KPCA maps for the five fluoride HDPs listed in Table 2 that compare the configuration spaces sampled by AIMD32 and MACE-MD with the omat-0-medium model. Each point on the map corresponds to a configuration in the MD trajectory. | ||
Based on the above primary findings, we further extended our benchmark by computing the anharmonicity scores for HDPs with both the harmonic force constants and MD samplings solely based on the omat-0-medium model. It is found that the dynamic stabilities determined using this kind of approach correlate well with the DFT results, suggesting that the omat-0-medium model is suitable for accelerating the screening of dynamic materials stabilities for materials discoveries. In more careful examinations of the HDP systems, on which the omat-0-medium model performed well (poorly) in reproducing the anharmonicity scores as determined by DFT, we think it is reasonable to believe that the topologies of the DFT PES are generally well reproduced by the MACE foundation model, whereby considerable overlaps in the configuration spaces sampled by the two approaches can be observed. The (large) discrepancies between the foundation-model- and DFT-predicted anharmonicity scores can be predominantly attributed to the amplification of the errors in predicting the atomic forces with the foundation model when calculating the Hessian matrices. This also shows the possible need of including Hessians in training materials’ foundation models, to promise their applications in scenarios where high numerical precision is a must in atomistic materials’ modelling.
We hope that the findings presented in this study have provided interesting and useful insights to facilitate the ongoing developments and fine-tunings of materials foundation models. For instance, proposing metrics such as the anharmonic score is particularly interesting, as it can not only be used for quantifying the model quality, but also be interpreted based on materials’ properties to provide more physical insights into understanding the model performances. We envisage that the continuous evolution of the foundation models will further advance the important statistical physics tools in configurational space sampling, particularly in tackling the challenge of meeting the ergodic condition, which bears implications in computing and understanding a wide range of physical and chemical properties of functional materials, such as thermal expansions, lattice thermal conductivities, catalytic activities under realistic (e.g. solvated) environments, and many others.
| This journal is © the Owner Societies 2026 |