Open Access Article
Achinthya Krishna Bheemaguliab,
Penghao Xiao*c and
Gopalakrishnan Sai Gautam
*b
aDepartment of Metallurgical and Materials Engineering, National Institute of Technology Karnataka, Surathkal 575025, India
bDepartment of Materials Engineering, Indian Institute of Science, Bengaluru 560012, India. E-mail: saigautamg@iisc.ac.in
cDepartment of Physics and Atmospheric Science, Dalhousie University, Halifax B3H 4R2, Nova Scotia, Canada. E-mail: penghao.xiao@dal.ca
First published on 30th March 2026
Fast, and accurate prediction of ionic migration barriers (Em) is crucial for designing next-generation battery materials that combine high energy density with facile ion transport. Given the computational costs associated with estimating Em using conventional density functional theory (DFT) based nudged elastic band (NEB) calculations, we benchmark the accuracy in Em and geometry predictions of five foundational machine learned interatomic potentials (MLIPs), which can potentially accelerate predictions of ionic transport. Specifically, we assess the accuracy of MACE-MP-0, MACE-OMAT-medium, Orb-v3, SevenNet, CHGNet, and M3GNet models, coupled with the NEB framework, against DFT-NEB-calculated Em across a diverse set of battery-relevant chemistries and structures. Notably, MACE-MP-0 and Orb-v3 exhibit the lowest mean absolute errors in Em predictions across the entire dataset and over data points that are not outliers, respectively. Importantly, Orb-v3, MACE-OMAT-medium, and SevenNet classify ‘good’ versus ‘bad’ ionic conductors with an accuracy of >82%, based on a threshold Em of 500 meV, indicating their utility in high-throughput screening approaches. Notably, intermediate images generated by MACE-MP-0 and SevenNet provide better initial guesses relative to conventional interpolation techniques in >71% of structures, offering a practical route to accelerate subsequent DFT-NEB relaxations. Finally, we observe that accurate Em predictions by MLIPs are not correlated with accurate (local) geometry predictions. Our work establishes the use-cases, accuracies, and limitations of foundational MLIPs in estimating Em and should serve as a base for accelerating the discovery of novel ionic conductors for batteries and beyond.
Accordingly, materials with a low Em – in both electrodes and (solid) electrolytes exhibit higher ionic conductivity and enable faster charge/discharge rates.7 In particular, emerging multivalent battery chemistries, such as Mg or Ca-based systems that promise higher volumetric energy densities,8 often suffer from poor rate performance.9–14 Some Na-ion cathodes such as marcite-Na(Mn/Fe)PO4, phosphate alluaudite-NaxMnFe2(PO4)3 and sulfate sodium superionic conductors (NaSICONs) that offer lower costs compared to LIB cathodes also suffer from poor rate performance.15,16 Therefore, understanding and minimizing the Em in candidate materials is crucial for advancing the next-generation of high-performance batteries.
Experimental techniques such as quasi-elastic neutron scattering,17 electrochemical impedance spectroscopy,18 nuclear magnetic resonance measurements,19 and galvanostatic intermittent titration techniques20 are commonly employed to study ion dynamics in solids.21 However, these methods often require access to large-scale facilities, and can exhibit chemistry or material-specific constraints/requirements, limiting their accessibility. As a result, computational approaches, particularly density functional theory (DFT22,23)-based nudged elastic band (NEB24) calculations are commonly used for computationally estimating Em with reasonable precision.25 While ab initio molecular dynamics (AIMD) simulations can also be used for estimating Em, such simulations are computationally expensive since they require the sampling over large length and long time scales across different temperatures to provide reasonable Em.26,27 Moreover, AIMD simulations can be unreliable for systems exhibiting high Em (i.e., the ‘true negatives’ among materials that can conduct ions) due to insufficient sampling of ion dynamics, resulting in DFT-NEB being the usual technique deployed for Em predictions.
Calculating Em using DFT-NEB requires an initial guess for the minimum energy path (MEP), which is typically constructed by linearly interpolating the coordinates of the initial and final configurations of the moving ion. Each ‘image’ generated by linear interpolation is subsequently connected via an auxiliary spring force. Note that the initial interpolated guess is often far from the true MEP, increasing the computational expense of DFT-NEB calculations and making them prone to convergence difficulties.25 Alternative approaches such as ‘ApproxNEB’,28 have been proposed to reduce computational intensity, with limited efficacy.
Recently, foundational machine learned interatomic potentials (MLIPs29–31), also referred to as universal potentials, have emerged as a new paradigm in computational materials science. The foundational MLIPs, pre-trained on large and diverse datasets, can generalize to a wide range of downstream tasks31 and are transferable across different materials and property prediction tasks,32–35 unlike classical MLIPs or force-fields that are constrained by a specific chemistry or property. Thus, foundational MLIPs are attractive candidates for accelerating atomistic simulations, including NEB calculations, by potentially improving initial MEP guesses and reducing the need for extensive DFT-based refinement or optimization, which can enable high-throughput screening of materials based on their Em. Indeed, a recent work by Kang et al.36 proposed an alternate to traditional DFT-based NEB calculations for Em estimations by using an MLIP for generating the potential energy surface on a spatial grid and extracting the MEP without the need for pre-defined NEB-images.
Several studies have benchmarked the performance of foundational MLIPs on diverse material properties,37–40 but not on predicting Em in solids. For instance, the ‘Matbench discovery’ platform41 provides a standardized framework for ranking universal potentials, but does not yet evaluate their integration with NEB workflows for Em predictions. Other MLIP benchmarking studies include the work by Zhao et al.42 that evaluated the MLIPs on transition state search for chemical reactions involving molecules. Bihani et al.43 benchmarked the performance of equivariant MLIPs on their generalizability to higher temperature simulations and unseen compositions, while Mannan et al. evaluated the performance of universal potentials against experimental measurements of elastic properties and structural accuracy among minerals.39 So far, there has been no benchmark of the performance of state-of-the-art universal potentials in predicting Em across a wide range of (battery) chemistries and materials, especially by integrating them with NEB workflows.
Here, we assess the performance of foundational MLIPs, namely, MACE-MP-0,44,45 MACE-OMAT-medium,45 SevenNet,46,47 Orb-v3,48,49 CHGNet,50 and M3GNet51 in predicting Em with NEB calculations. Using the dataset DFT-calculated Em compiled and curated by Devi et al.52,53 that spans a wide range of materials and compositions, we benchmark the Em predictions of the foundational MLIPs against conventional DFT-NEB values at the generalized gradient approximation (GGA54) level of exchange–correlation accuracy for 574 migration paths. Additionally, we introduce a metric to assess the similarity of MLIP-NEB relaxed structures with the ground truth of DFT-NEB computed MEPs from our previous works.25,55–57 Finally, we examine the correlation between accuracies in Em and geometry predictions by the MLIPs considered.
Notably, we find that M3GNet and CHGNet (invariant models) tend to underestimate Em and exhibit a high degree of confidence in predicting low Em over a narrow range of possible Em values, while the other potentials (equivariant models) exhibit no clear bias and deliver consistent accuracy over a wide range of Em values. Importantly, we observe that Orb-v3, MACE-OMAT-medium, and SevenNet classify systems as ‘good’ (Em < 500 meV) or ‘bad’ ionic conductors with 82% accuracy. Performing an MLIP-NEB, using any of the potential considered, does result in improved interpolated paths representing the MEP in over 66% of cases, indicating their utility in high-throughput screening workflows. Significantly, we find no evident correlation between the accuracy of Em and geometry predictions, with MLIPs yielding higher accuracy in Em predictions for systems with low Em values, while demonstrating better geometry predictions in systems with large Em. We hope that our study establishes use-cases and quantifies the reliability of using foundational MLIPs in predicting Em over a diverse set of chemistries and crystal structures, which in turn should accelerate materials discovery for novel battery applications and beyond.
![]() | ||
| Fig. 1 Overview of the methodology, indicating the use of two subsets of the Em dataset that were created for examining geometry predictions, geometry–barrier correlations, and barrier predictions. | ||
The larger dataset, referred to as ‘Dataset-2’, is a subset of a literature-derived collection of Em,52 which comprises of 621 DFT-calculated Em and the initial and final configurations for each migration pathway. Among the 621 datapoints, we excluded systems exhibiting Em > 2.5 eV, since such high Em values would not correspond to any tangible rate performance under battery operating conditions. We also excluded systems that presented significant convergence difficulties during NEB calculations using any of the foundational MLIPs considered (∼10 datapoints), so that a fair and quantitative comparison can be made across the MLIPs. Thus, the final subset that forms our Dataset-2 consists of 574 systems. The systems comprising both datasets are compiled in our https://github.com/sai-mat-group/mlips-migration-barriers repository, while Dataset-2 is also available as a json file on Zenodo.
| Model | Training data | Model type and key features |
|---|---|---|
| MACE-MP-0 | MPtrj dataset | E(3)-equivariant GNN that captures many-body interactions |
| MACE-OMAT-medium | OMat24 | |
| SevenNet-MF-ompa | MPtrj, OMat24, and sAlex | Equivariant GNN incorporating multifidelity learning with efficient parallelization |
| Orb-v3 | MPtrj, OMat24, and Alex | Roto-equivariance inducing regularized GNN with analytical energy gradients (conservative forces) and (effectively) infinite neighbors |
| CHGNet | MPtrj dataset | Invariant GNN including magnetic moment inputs, thus incorporating information on atomic charges |
| M3GNet | MPtrj dataset | Includes three-body interactions within its GNN (invariant) |
For NEB calculations of materials in Dataset-1 using all universal potentials, we generated seven intermediate images, mirroring the number used in the corresponding DFT-NEB calculations. The initial interpolated images were connected by a spring constant, k = 5 eV Å−2, and we utilized the NEB implementation following the elastic band (EB65) method with full spring force, given our benchmarking with MACE-MP-0 (see Section S1). We did not include the climbing image technique24 in any of our MLIP-NEBs, as we did not see significantly different results with or without climbing image in our previous work.25 We deemed the NEB converged when the band forces fell below 0.05 eV Å−1, while using the Broyden–Fletcher–Goldfarb–Shanno optimizer.66–69 In the case of Dataset-2, we employed only three intermediate images for all foundational MLIPs considered, to reduce computational costs. Note that employing seven intermediate images with MACE-MP-0 NEB calculations on a random subset of 100 structures did not significantly change the Em predictions (average deviations of ∼75 meV, which is similar to typical DFT-NEB Em errors), indicating that the data and trends reported in our work based on three intermediate images should be robust. Also, we used the set of optimized NEB parameters that were used for calculations on Dataset-1 (i.e., k = 5 eV Å−2, IDPP interpolation, and the EB method) for all calculations involving Dataset-2.
| dDFTxy, dMLIPxy, dLIxy |
| ΔdMLIP = |dDFTxy − dMLIPxy| |
| ΔdLI = |dDFTxy − dLIxy| |
for all {x, y} ⊂ {i, j, k, l, m, n, c}, where x ≠ y.
| ΩDFTx, ΩMLIPx, ΩLIx |
In the above notation, two Ω, say ΩDFT, ΩMLIP, having the same x indicates that the polyhedral faces correspond to the same set of neighboring atoms. The absolute differences with the DFT-NEB relaxed structures are then calculated and stored as six-dimensional vectors:
| ΔΩMLIP = |ΩDFTx − ΩMLIPx| |
| ΔΩLI = |ΩDFTx − ΩLIx| |
for all x ∈ {a, b, c, d, e, f}.
We expect the local geometry of an MLIP-NEB relaxed structure to be a poorer approximation of the DFT-NEB relaxed structure than the corresponding LI structure, if at least one of the following conditions is met: (i) one of the 21 pairwise distances or 6 solid angles of the MLIP-NEB relaxed structure deviates significantly more from the DFT-NEB geometry than the corresponding LI structure, or (ii) the average difference in pairwise distances or solid angles of the MLIP-NEB relaxed structure with the DFT-NEB reference is significantly higher compared to LI. To quantify these two conditions, we calculate δ, which represents the maximum value among the differences in the mean and maximum errors of distances and angles between the MLIP-NEB and LI structures:
and
represent the mean of the absolute errors in distances and solid angles, respectively. max(Δd) and max(ΔΩ) represent the maximum absolute errors.
Finally, the metric θ would classify the structure as:
![]() | (1) |
Thus, δ quantifies the difference between the deviations of the MLIP-NEB and LI structures with respect to the DFT-NEB reference based on key local geometric features. A smaller (ideally negative) δ value signifies that the MLIP-NEB structure exhibits consistently lower errors, indicating it's a better approximation of the true DFT-NEB pathway. Conversely, larger (more positive) δ suggests that LI performed as well or even better than the MLIP-NEB for at least one of the local geometric attributes. Therefore, we numerically represent the ‘good’, ‘comparable’ and ‘bad’ structure as 1, 0 and −1 with θ. Finally, for a given system containing i intermediate images, we define g as,
In the case where all the i image local geometries are better (worse) predicted by MLIP-NEB compared to LI, g will take the value of 1 (−1).
To obtain a more representative picture of MLIP performance, we exclude 17 systems that act as common outliers across all MLIPs, with each outlier exhibiting absolute errors exceeding 1 eV. Notably, excluding the common outliers also reveals a similar performance hierarchy as with retaining the entire dataset: MACE-MP-0 emerges with the best MAE of 0.239 eV, followed closely by Orb-v3 with 0.245 eV. The remaining MLIPs, namely SevenNet, CHGNet, and M3GNet show MAEs of 0.251, 0.275, and 0.290 eV, respectively. Specific details about the outliers of respective MLIPs can be found in Tables S3–S7 of the SI, while the distribution of outliers across crystal classes is compiled in Fig. S7.
Besides accuracy, we analyze the distribution of datapoints relative to the ideal parity line to determine whether the MLIPs exhibit systematic prediction biases (i.e., under- or over-estimation of Em). Interestingly, we observe MACE-MP-0, SevenNet, and Orb-v3 to demonstrate a relatively balanced prediction behavior with fairly symmetric distributions of under and over-estimated datapoints. Represented as (number of under-estimated datapoints, number of over-estimated datapoints) pairs, MACE-MP-0, SevenNet, and Orb-v3 exhibit distributions of (299,
275), (244,
330), and (242,
332), respectively. In contrast, CHGNet and M3GNet show a bias toward under-estimating barriers, with under-estimated datapoints accounting for 73.1% and 78.2% of all predictions, respectively. Represented as (under-estimated, over-estimated) pairs, CHGNet and M3GNet exhibit distributions of (420,
154) and (449,
125), respectively.
To further understand individual MLIP capabilities, we examined each potential's performance after excluding the outliers specific to each potential (i.e., systems with absolute errors >1 eV as predicted by a given potential) to gain insight into the ‘best-case’ scenario of Em predictions. Notably, despite having 37 outliers, Orb-v3 achieves the lowest MAE of 0.198 eV on its remaining (non-outlier) predictions. With 35 outliers, MACE-MP-0 is a close second with an MAE of 0.202 eV, while SevenNet, with 37 outliers, displays an MAE of 0.203 eV. CHGNet and M3GNet show higher MAE values of 0.248 eV and 0.257 eV with 31 and 36 outliers, respectively. Also, varying training data seems to have a marginal impact on the performance of the model, with MACE-OMAT-medium exhibiting an MAE of 0.35 eV on the entire dataset and an MAE of 0.20 eV excluding its specific outliers. Thus, we find that Orb-v3 can achieve higher accuracies on systems that it describes well while MACE-MP-0 achieves a better balance of both low errors and fewer outliers compared to other MLIPs.
Trends in Fig. 3 indicate that all MLIPs struggle with high Em predictions, with only small percentage of systems exhibiting the acceptable accuracy in the highest barrier range (∼1.31–2.50 eV). Specifically, the percentage of predictions with acceptable accuracy in the highest Em range are 20.7% for Orb-v3, 18.3% for MACE-MP-0, 14.6% for SevenNet, 6.1% for M3GNet, and 3.7% for CHGNet. To examine whether the strategy used for binning influenced the trends we observe in Fig. 3, we performed an identical exercise while keeping the width of each bin a constant and compiled the results in Fig. S10 (bin width of 0.5 eV) and S11 (0.25 eV) of the SI. Importantly, we find no qualitative change in the performance of the models, with the accuracy in Em predictions declining with increasing Em values for all MLIPs considered.
Importantly, we identify a “sweet spot” of Em values where all MLIPs perform reasonably well. For example, in the low-barrier range (∼0.0025–0.25 eV), more than 50% of predictions achieve acceptable accuracy across all MLIPs. Within this range, CHGNet shows the highest success rate (59.8%), followed by M3GNet and SevenNet (both 58.5%), while MACE-MP-0 and Orb-v3 achieve 53.7% and 57.3%, respectively. Additionally, Orb-v3 and SevenNet achieve their best performance (i.e., highest fraction of predicted datapoints with acceptable accuracy) in the 0.25–0.36 eV range, achieving 62.2% and 61% acceptable predictions, respectively. MACE-MP-0 performs best in the slightly higher 0.36–0.50 eV range with 57.8% accuracy. Meanwhile, CHGNet and M3GNet perform best in the lowest Em range (∼0.0025–0.25 eV) with 59.8% and 58.5% accuracy, respectively.
While all MLIPs show declining accuracy with increasing Em, Orb-v3 exhibits the slowest degradation, maintaining better performance across a broader range of Em values compared to other potentials. Thus, we find that ‘simpler’ graph models such as CHGNet and M3GNet demonstrate superior performance for materials with intrinsically low Em values but lack consistency in their predictions over a wider range of Em. On the other hand, increasing complexity among the graph models, such as in Orb-v3 or MACE-MP-0 allows for a more robust performance across a wide range of Em values while sacrificing ‘peak’ performance for materials with low Em, making them better suited for Em predictions in novel materials. This variation in the performance of ‘simple’ and ‘complex’ MLIPs also reveals the general trade-off between building specialized and generalized models in the field of machine learning.
From Fig. 4, we observe that Orb-v3 achieves the highest combined number of TP and TN, correctly classifying 487 out of 574 systems (i.e., an accuracy of 84.84%). In comparison, M3GNet yields the lowest TP + TN count of 422 systems (73.52%). MACE-OMAT-medium, SevenNet, MACE-MP-0 and CHGNet correctly classify 477 (83.1%), 476 (82.93%), 456 (79.44%), and 424 (73.87%) systems, respectively. These results highlight Orb-v3 as the most reliable model for distinguishing good and poor ionic conductors, followed closely by SevenNet (accuracy >80%), making both reliable for high-throughput classification tasks.
Among all the MLIPs, SevenNet exhibits the highest fraction of good geometries (0.719), indicating that it frequently generates accurate local geometries. On the other hand, MACE-MP-0 exhibits the lowest fraction of bad geometries (0.190), indicating that it frequently avoids generating inaccurate structures. The difference between the fraction of good and bad geometry predictions for both MACE-MP-0 and SevenNet are similar (0.527 and 0.526, respectively), indicating that both models perform equally well in generating good local geometries.
Other MLIPs show poorer geometry predictions, with Orb-v3, M3GNet, and CHGNet displaying good (bad) fractions of 0.683 (0.236), 0.674 (0.219), and 0.660 (0.257), respectively, with CHGNet showing the smallest difference between the good and bad geometry fractions (0.403). Thus, MACE-MP-0 and SevenNet show significantly better local geometry predictions upon relaxation with NEB compared to Orb-v3, M3GNet and CHGNet, while all MLIPs provide better initial guesses to the MEP than LI in at least 66% of structures (i.e., intermediate images). Also, we note that IDPP generated structures are statistically much farther from DFT than MLIPs, with LI being better than IDPP in 43% of the cases. Given our definition of θ and the specific systems present in Dataset-1, we find that IDPP does not make a significant difference in enhancing the initial guess for the MEP as compared to LI across all MLIPs.
Overall, Fig. 6 reveals the absence of any positive correlation between barrier and geometry prediction performance, and more strikingly, an inverse relationship. For example, all models perform poorly in predicting high Em (bin 5), which is consistent with our observations in Fig. 4. However, all models also achieve their best geometry predictions for bin 5. In other words, the best geometry predictions are coincident with the worst Em predictions. The geometry prediction success rates within bin-5 are 66.7%, 75.0%, 66.7%, 66.7%, and 58.3% for MACE-MP-0, SevenNet, Orb-v3, CHGNet, and M3GNet, respectively, and the corresponding Em prediction success rates (i.e., ΔE ≤ 0.1 eV) are 16.7%, 0%, 16.7%, 8.3%, and 0%, respectively.
To further assess geometry–barrier correlation, we examine instances where MLIPs perform well in both metrics. Note that, we term a model to exhibit a ‘good performance in both metrics’ if both the fractions in a given bin are >0.5. Only two potentials show this good performance, and only in a single bin (bin-1), namely, MACE-MP-0 with a success rate of 58.3% in Em prediction and 50.0% in geometry prediction, and M3GNet with a 50.0% success rate for both metrics. SevenNet, Orb-v3, and CHGNet do not achieve this good performance in any bin. Moreover, we find no consistent pattern across all bins and all MLIPs and no instances where good Em predictions coincide with good geometry prediction. Instead, the data suggests that these two performance metrics are largely independent, and that a good Em prediction does not necessarily arise from a good local geometry prediction (and vice versa).
Analyzing Em predictions across the entire Dataset-2, we find that MACE-MP-0 exhibits the lowest MAE (Fig. 2), followed in order by Orb-v3, CHGNet, SevenNet, and M3GNet. On excluding outliers that are common to all models, we observe SevenNet to exhibit a slightly lower MAE than CHGNet, with the rest of the performance order being the same. Interestingly, when assessing each model independently after removing their respective outliers, Orb-v3 demonstrates the best MAE of 0.198 eV, marginally outperforming MACE-MP0 (0.202 eV), with the other models exhibiting larger errors (0.203–0.257 eV). Thus, Orb-v3 provides the best prediction errors for Em, among MLIPs considered, in systems with a robust description of the corresponding potential energy surface.
Based on the distribution of outliers across different crystal systems (Fig. S7), we observe that while certain outliers are model-specific, systems containing orthosilicates and phosphates consistently pose challenges for all MLIPs, which may be attributed to the inherently complex potential energy surface of these polyanionic frameworks. Specifically, the intricate ionic migration paths within these structures may be difficult for MLIPs to capture accurately, likely due to spurious ‘smoothing’ of the potential energy surface during model training.
While minor inconsistencies in Hubbard U73,74 values between the datapoints in Dataset-2 and the calculation scheme of Materials Project do exist, we expect such inconsistencies to be unlikely as the primary source of error among the identified outliers for each MLIP considered. Indeed, GGA-calculated Em values are the predominant contributor to Dataset-2, accounting for 88.05% of the datapoints, while calculations including a Hubbard U correction only contribute 7.27% of the datapoints. GGA-calculated Em values dominate the literature so far, since the U corrections are frequently omitted in NEB calculations due to significant convergence difficulties and electronic metastability along the migration pathway, as noted by Liu et al.75 Furthermore, benchmarks by Devi et al.25 indicate that for GGA + U calculations, even a substantial change in the U parameter (≈1 eV) typically results in a Em variation of only 15 meV – a value well within the acceptable error margin for DFT-NEB calculations.
Notably, during endpoint relaxations for Orb-v3, 153 systems failed to converge within the threshold forces over 1000 optimization steps despite attempting multiple optimization algorithms, namely limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS)76 and fast inertial relaxation engine (FIRE).77 While the unconverged structures did not satisfy our rigorous force convergence threshold of |0.05| eV Å−1, the residual forces were marginally above the threshold (typically ∼|0.08| eV Å−1). This suggests that the convergence issues may stem from the inherent features of the learned Orb-v3 potential energy surface, such as noise or shallow local minima, rather than a failure of the optimization algorithm itself. However, to maintain consistency with the other models in the study, we did not modify the obtained results and included the Orb-v3 results as is.
We observe that M3GNet and CHGNet exhibit a systematic bias toward underestimating Em, whereas MACE-MP-0, SevenNet, and Orb-v3 do not display such a tendency. A more granular analysis (Fig. 3) reveals that all models struggle with accurately predicting high Em values. Among them, Orb-v3 shows a relatively slow decay in prediction accuracy as the Em value increases. Interestingly, the simpler (invariant) models CHGNet and M3GNet outperform their more complex (equivariant) counterparts within a very narrow range of low Em but exhibit a rapid decline in performance as the range expands.
The systematic biases observed in Em predictions likely arise from the interplay between training data distribution and architectural inductive biases, such as the level of feature equivariance. Architectural differences, including the handling of many-body interactions and local environment cutoffs, further influence how a model learns the potential energy surface. In terms of the influence of training data on the model performance, we observe MACE-OMAT-medium to exhibit quite similar quantitative performance (in terms of MAEs) and a marginally better classification accuracy compared to MACE-MP-0. Thus, we conclude that architectural choices influence the performance of a model more significantly compared to the choice and size of the training data itself, at least for Em predictions.
Using a threshold Em of 500 meV to categorize structures as ‘good’ or ‘bad’ conductors of ions (Fig. 4), we find that all MLIPs are able to identify good conductors with reasonable accuracy (>73%). Orb-v3 and SevenNet display the highest accuracies in classifying good (or bad) conductors, with ∼85% and ∼83% accuracy, respectively, making them highly suitable for high-throughput screening of candidate battery materials.
Our study on Dataset-1 indicates that MLIP-NEB relaxations tend to produce image geometries that are as close as (or closer to) DFT-NEB structures than those obtained through simple LI or IDPP interpolation in the majority (∼66%, Fig. 5) of cases. Among the considered models, MACE-MP-0 and SevenNet stand out in geometry predictions, relaxing to geometries that are worse than LI or IDPP ones in only 19% of migration paths, suggesting that employing MACE-MP-0 or SevenNet NEB-relaxed images as initial guesses for DFT-NEB calculations could significantly accelerate convergence and reduce computational costs.
Although our metric, θ (see eqn (1)), captures critical local geometric features, it can be improved further to decisively quantify local structural similarity. Nevertheless, we performed DFT-NEB calculations using initial path guesses derived from MACE-MP-0-based NEB for a subset of structures exhibiting high geometric similarity. With all other DFT parameters held constant, we observed a reduction in the number of ionic and electronic steps required to achieve convergence in 5 out of 6 cases compared to LI initialization, as documented in Table S2. This reduction in computational cost provides empirical evidence that MLIP-based path initialization has the potential to accelerate subsequent DFT-NEB calculations.
Finally, when simultaneously evaluating the likelihood of accurate barrier prediction and better geometry initialization (Fig. 6), we observe no evident correlation between the two among all MLIPs considered. Thus, we find that accurate barrier predictions do not necessarily imply better geometry predictions, and vice versa. One possible explanation for this counterintuitive trend is that for systems with low Em, the potential energy surfaces are likely ‘flat’ with variations in local geometries, indicating that even large errors in local bond distances or local bond angles made by the MLIPs do not significantly change the predicted Em, thus leading to accurate Em even with inaccurate geometries. On the other hand, for systems with large Em, the potential energy surfaces should exhibit ‘deep’ minima associated with the ‘stable’ sites occupied by the migrating ion, signifying that even small errors in predicting local bond distances or angles by the MLIPs can cause large errors in the Em predictions, thus resulting in inaccurate Em even with mostly accurate geometries.
Supplementary information (SI): optimal nudged elastic band parameters, parity plots, analysis of outliers and binning strategies, performance of the MACE-OMAT-medium model, nudged elastic band calculations post MLIP relaxation, and range of migration barrier values. See DOI: https://doi.org/10.1039/d5dd00534e.
| This journal is © The Royal Society of Chemistry 2026 |