Open Access Article
Michał Sanockiab and
Julija Zavadlav
*ab
aMultiscale Modeling of Fluid Materials, Department of Engineering Physics and Computation, TUM School of Engineering and Design, Technical University of Munich, Germany. E-mail: julija.zavadlav@tum.de
bAtomistic Modelling Center, Munich Data Science Institute, Techincal University of Munich, Germany
First published on 13th April 2026
The vastness of chemical space makes generalization a central challenge in the development of machine learning interatomic potentials (MLIPs). While MLIPs could enable large-scale atomistic simulations with near-quantum accuracy, their usefulness is often limited by poor transferability to out-of-distribution samples. Here, we systematically evaluate different MLIP architectures with long-range corrections across diverse chemical spaces and show that such schemes are essential, not only for improving in-distribution performance but, more importantly, for enabling significant gains in transferability to unseen regions of chemical space. To enable a more rigorous benchmarking, we introduce biased train–test splitting strategies, which explicitly test the model performance in significantly different regions of chemical space. Together, our findings highlight the importance of long-range modeling for achieving generalizable MLIPs and provide a framework for diagnosing systematic failures across chemical space. While this study focuses on metal–organic frameworks and related systems, the proposed methodology is not limited to this class of materials and may inform the design of more robust and transferable MLIPs in other systems.
This is especially problematic for the development of machine learning interatomic potentials (MLIPs), which have become increasingly popular in recent years due to their ability to deliver near-DFT accuracy at a fraction of the computational cost,8 enabling simulations of larger systems and longer timescales that would otherwise be unfeasible.9,10 The increasing popularity of MLIPs has also led to a surge in new architectures, with graph neural networks (GNN) emerging as a particularly promising approach.11 Further advances, such as equivariant GNNs (ensuring the preservation of physical symmetries) and message-passing, have significantly improved the viability of these models.12 This has made MLIPs a practical alternative to both classical force fields and quantum methods. However, their usefulness is ultimately constrained by the issue of generalization to out-of-distribution samples.13 This limitation arises from not only the vastness of chemical space but also the conformational diversity of individual molecules. At the same time, there has been a growing interest in developing universal or foundational MLIPs.14–17 This trend is motivated by the desire to cover broader regions of chemical space with a single model, making generalizability even more important.Achieving a truly generalizable MLIP would require capturing different types of interactions without overestimating any single one. Typically the total energy of a system is decomposed into ESR (short-range) and ELR (long-range) contributions:
| Etotal = ESR + ELR | (1) |
Due to computational constraints, MLIPs can only access interactions within a finite effective cut-off radius. Thus, they are likely to overestimate short-range interactions to compensate for the missing or incorrectly modeled long-range contributions, especially in strictly local models such as Allegro,12 which only exchange information within a short cut-off11 (see Fig. 1). This would suggest that such models may overfit to the training data, thereby reducing their generalizability.
The issues with strictly local cutoffs can also be partially offset by message-passing neural networks (MPNNs), which gained significant popularity partially due to their longer effective receptive fields. However, they face challenges related to parallelization and scalability,8 particularly in the context of molecular dynamics simulations of large systems, due to the growth of their receptive fields,12 bottlenecks in per-layer communication,18 and layer-wise synchronization barriers.19 These limitations have motivated the development of strictly local models, which are well suited for incorporating long-range correction schemes due to their lower computational cost.11 Additionally, increasing the number of propagation layers beyond a certain point leads to diminishing returns and eventually feature collapse.20 We hypothesize that separate modeling of short- and long-range energy contributions may improve performance by encouraging more balanced representations of both interaction types. While such separation does not by itself guarantee improved generalization, it may mitigate overfitting arising from compensatory adjustments within the short-range component.
Recently, many different approaches to long-range modeling have been proposed with solutions ranging from charge equilibration schemes21–23 to charge-independent methods such as self-consistent neural networks,24 reciprocal space transformations,25 Gaussian multipliers,26 attention-based architectures,27 and methods based on Ewald summation.28 It is worth noting that, despite their proven viability, many foundational models do not incorporate long-range schemes,15,29 which may explain why they often struggle to predict experimental measurements.30
The challenges with chemical diversity and long-range effect modeling outlined above are particularly pronounced in the context of metal–organic frameworks (MOFs). As their modular architecture allows for precise control over porosity, surface area, and chemical functionality, making them highly tunable for a wide range of applications, including gas storage, catalysis, separations, and sensing.31,32 Unlike traditional porous materials such as zeolites, MOFs offer exceptional structural diversity; tens of thousands of variants have already been synthesized,33 and computational design opens the door to virtually limitless hypothetical structures.34,35 This vast chemical and configurational space makes it essentially impossible to identify optimal materials for specific applications through empirical methods alone.36–38 Computational methods are therefore essential to accelerate MOF discovery and to probe their behavior under working conditions, using methods such as classical MD,39 DFT,40 and grand canonical Monte Carlo.41,42
However, classical modeling techniques face several limitations in their applicability to MOF modeling, as they face trade-offs between accuracy and computational efficiency, making accurate large-scale MOF simulations unfeasible.38,43 Therefore, MOFs present an ideal application for MLIPs, as classical force fields often lack the accuracy required to capture the complex interactions and are very difficult to parameterize, while DFT is too computationally expensive to model structures with large unit cells and at longer time scales,9,10,44 thus making MLIPs a perfect candidate for large scale MOF simulations. As a result, a wide range of studies have employed MLIPs to investigate the chemical and mechanical properties of MOFs.9,43,45–48 However, two issues limit the usefulness of MLIPs. First, they often struggle to generalize to out-of-distribution samples, which confines their applicability to a narrow region of chemical space.13 Second, long-range effects and electrostatics, which are particularly challenging to model for MLIPs,10 can significantly affect MOF behavior.49,50 Additionally, foundational models also struggle with modeling MOFs, often failing to outperform simple classical force fields,51 suggesting that large training sets might not be enough to achieve truly universal MLIPs. All of these factors make MOFs an ideal target for our study, as they are highly relevant to experimental and computational chemists, have been previously investigated using MLIPs, and exhibit a vast and complex chemical space, providing an ideal test case for evaluating model generalizability across chemical space.
In this work, we test the generalizability of three widely used baseline architectures: DimeNet++,52 MACE,53 and Allegro,12 on diverse chemical spaces defined by subsets of QMOF,34 ODAC25,54 and OMOL25,55 datasets. We perform a direct comparison of different long-range correction schemes; thus, we test two recently introduced frameworks in combination with different baseline models: the Charge Equilibration Layer for Long-range Interactions (CELLI), recently introduced by the authors,23 and Euclidean Fast Attention (EFA).27 We show that such corrections not only allow cheaper models to achieve state-of-the-art accuracy but are also essential for improving generalizability across chemical space, even in MPNNs. Unlike most prior works,56,57 which have mainly focused on conformational generalization, we center our analysis on chemical diversity. Furthermore, we demonstrate that partial charges cannot be inferred without training on reference partial charge labels in the case of challenging datasets such as the ones tested in this work. This is also the case for the Latent Ewald Summation (LES)28 framework, which recently claimed the opposite.28,58–60 In these challenging environments, models trained with CELLI based on reference charges consistently produce physically meaningful results, highlighting that leveraging accurate charge information remains critical for developing truly generalizable long-range MLIPs.
525 unique molecules from OMOL25 (up to 350 atoms), 20
869 MOFs from ODAC25 (up to 616 atoms), and the full QMOF dataset with 20
407 MOFs (up to 500 atoms).
To evaluate how well different MLIPs generalize to out-of-distribution samples, we consider four evaluation strategies. First, we use a previously introduced method, where the model is trained on a subset of structures with 100 or fewer atoms and then tested on a subset of larger molecules.49 Since larger molecules are likely to be significantly different from those in the smaller subset. Secondly, we introduce two additional biased train–test split methods: cluster and maximal separation, to systematically investigate differences in model generalization to unseen regions of chemical space. Lastly, a regular random split was used as a comparison to biased split methods.
To create our biased split methods, we employ SOAP (Smooth Overlap of Atomic Positions) descriptors,61 which combine radial and angular information into rotationally invariant atomic features. First, we calculate SOAP descriptors of each atom and average them to create a global similarity measure.62 Unfortunately, this vector is uninterpretable; however, it allows us to define an architecture-independent descriptor space for molecules, enabling the investigation of MLIP's performance in out-of-distribution samples. The SOAP average similarity kernel has also been previously utilized by Rosen et al. on the QMOF dataset to identify local trends in feature space and for band gap predictions, achieving good results,34 and demonstrating SOAP's ability to produce a meaningful representation of the MOF's atomic environment.
The cluster method groups molecules into structurally similar clusters using K-Means clustering on their SOAP descriptors. A subset of clusters is then randomly selected for the training set, while the remainder forms the test set. This enforces a structural distinction between splits.
The maximal separation method computes pairwise cosine similarities between descriptor vectors A and B
![]() | (2) |
Starting with a random training data sample, the maximal separation method iteratively adds a candidate to the test dataset. The selected candidate  has minimal similarity to training data samples B
![]() | (3) |
The proposed biased split methods evaluate distinct aspects of generalizability: the small-large split assesses performance on cells of varying sizes; the maximal separation method measures performance on a subset maximally different from the training set to stress-test MLIP; and the cluster method examines generalization to distinct structural families by partitioning chemical space into structurally coherent clusters. This addresses a critical limitation of MLIPs: available datasets likely under-sample the vast chemical space, creating biases that may lead to overoptimistic performance evaluation and, as a result, limit their applicability to only its narrow part.
To ensure a sufficiently large test set for meaningful generalizability analysis, we allocated 50% of the OMOL25 and QMOF datasets, and 30% of the ODAC25 dataset, to testing. In all cases, the validation set was drawn from the training data, comprising 10% of its datapoints. To account for random effects introduced by the splitting process, we generated each split three times with different seeds for the ODAC25 and QMOF datasets, and only once for OMOL25 due to higher computational cost. Then, to visualize how different split methods divide given chemical space, we applied Uniform Manifold Approximation and Projection (UMAP),63 a nonlinear dimensionality reduction technique that preserves both local and global structure, and has been often used to visualize chemical space.64–66 Fig. 2 shows that the proposed biased splitting methods, and especially the maximal separation method, produce splits that are significantly different (at least in the SOAP descriptor space) and can serve as a challenging benchmark for evaluating the generalizability of MLIPs.
The SOAP descriptors were generated using the Describe package62 and parameters provided by Rosen et al.34 The maximal separation method is computationally expensive and scales poorly with dataset size, as it requires computing similarity between all members of the training set and remaining molecules. For this reason, in the OMOL25 dataset, a simplified version was used, where similarity was computed only between subsets of 1000 molecules in the training set and the remaining molecules. For the cluster method, the number of clusters was set to 20, and 50% (70% in ODAC25) were used for training.
These differences make them ideal candidates for this study, as they allow for a comparison of the advantages and disadvantages of physics-inspired versus purely AI-driven approaches. In principle, physics-based models are expected to generalize better due to their grounding in physical theory.67 However, imposing such constraints may also limit the expressiveness of the model.68 On the other hand, data-driven methods often offer greater flexibility but are prone to overfitting and may generalize poorly to out-of-distribution samples.13 In addition, when fitted to partial charges, CELLI can only model electrostatic interactions and charge transfer, whereas EFA can represent all long-range effects, including van der Waals interactions. To ensure comparability, the same hyperparameters were used for each architecture across different splits.
Those parameters are then used to redistribute charge based on the following energy minimization problem. The total Qeq energy is defined as
![]() | (4) |
![]() | (5) |
![]() | (6) |
![]() | (7) |
under the charge conservation constraint.
Once charges are obtained, they are embedded into the GNN by generating charge-dependent feature vectors via a second MLP, which are combined with existing latent scalar features through a residual update. This allows downstream layers to incorporate non-local information while maintaining the local structure of the network. Although CELLI is model agnostic and can be utilized with multiple frameworks, it's best suited for strictly local models such as Allegro,12 as the benefits of long-range scheme correction are larger compared to message passing architectures. However, here we also consider an MACE implementation. Regardless of the MLIP architecture used, the long-range electrostatic interactions could be partially accounted for by the long-range contribution and partially by the short-range contribution to the total energy. However, when CELLI is trained on partial charges, electrostatic interactions are assumed to be entirely accounted for by the long-range contribution. In other words, the electrostatic interactions are not double-counted.
Specifically, given a position vector
, a feature vector x, a frequency
, and a unit vector u ∈ S2, ERoPE modulates the feature as
| ϕu(X,r) = X·eiωu·r, | (8) |
![]() | (9) |
By integrating over all directions on the unit sphere, the resulting attention weights depend only on interatomic distances, ensuring rotational invariance. To support directional reasoning, EFA incorporates a tensor product with spherical harmonics, allowing the mechanism to operate on equivariant inputs. This enables the network to model both long-range interactions and geometric relationships without requiring explicit neighbor cutoffs, making it well-suited for large atomistic systems.
Originally, the EFA scheme was proposed for a generic MPNN architecture, which consisted only of node features and a message passing step, which strongly limits its applicability. Thus, there is a need to adapt this scheme to popular frameworks. Unlike in the network utilized by Frank et al.,27 there are several ways in which EFA can be integrated into MACE or Allegro. However, the detailed analysis of the optimal EFA implementation is outside of the scope of this study, and in this section, we provide a basic description of our integration.
The EFA block takes as input the current node embeddings and atomic positions. To integrate it with MACE, the output is then fused with the node features before being passed into subsequent MACE interaction layers. By inserting EFA at configurable points in the message passing stack, the architecture captures long-range geometric dependencies prior to or between equivariant tensor-product updates. This design enables the network to blend fast, global spatial awareness with deep local equivariant reasoning, allowing for flexible integration of EFA alongside or in place of conventional MACE message passing without disrupting equivariance or scalability.
In the Allegro integration, EFA features are combined with the original species embeddings and linearly projected into a refined feature space just before the Allegro tensor product layer (initial invariant scalar latent features in the first MLP only utilize original species embeddings). This insertion point ensures that EFA-enhanced node-level information flows into the pairwise feature prediction pipeline of Allegro. EFA was initially implemented for MPNNs, however, message passing is not fully required and can be combined with a strictly-local Allegro architecture.
![]() | (10) |
![]() | (11) |
For isolated systems, LES uses a screened direct Coulomb sum:
![]() | (12) |
LES also enables computation of Born effective charge tensors, which can be used for simulations of systems under an electric field59(for details, see SI Section 9). After computing the long-range energy, LES includes electrostatic information into the model only through its total energy contribution. No additional solvers, constraints, or charge-conservation steps are utilized. In this work, we utilized the MACE53,71 integrations of LES due to the high computational cost of CACE (for details on reproducibility of previous results, see SI Section 8).
For Allegro, we combine learned embedding of total charge with the species embeddings and linearly project into a refined feature space just before the Allegro tensor product layer (similarly to the EFA integration initial invariant scalar latent features in the first MLP, only utilize original species embeddings). Although Allegro is strictly local, the global charge embedding provides a simple mechanism to modulate all node features consistently based on system-level charge, enhancing expressivity without modifying the underlying equivariant structure.
387 samples), as EFA may outperform CELLI in larger datasets.
For MACE trained on the random split, the effect of adding long-range schemes is minimal and comparable to statistical noise; however, in more challenging test cases (maximum separation split), the improvement becomes evident even for MACE, as it fails to generalize to out-of-distribution test sets. Although CELLI and EFA models also struggle with biased splits, they perform significantly better than baseline models, suggesting that long-range corrections are crucial for achieving robust generalization. Our results also show that increasing the number of message passing steps can lead to overfitting and a decrease in performance compared to the baseline model. For example, in the cluster and maximum separation splits, an increase in the number of message passing steps more than doubles the RMSE, showing that the addition of separate long-range corrections is advantageous.
Interestingly, DimeNet++, which obtains reasonable results in the random, size, and maximum separation splits, fails to generalize under cluster split. This shows that the proposed benchmarks probe different aspects of generalizability and can be used in future studies focusing on the development of long-range correction schemes. Our results also support our earlier hypothesis on the necessity of inclusion of a dedicated mechanism for modeling long-range interactions. The issues with generalizability are particularly significant for MACE, as foundational and universal models have been developed using this architecture.14,16 However, our results show that MACE without any long-range correction does not generalize well to out-of-distribution samples, suggesting it may not be suitable for this task.
In our next benchmark, we utilized the OMOL25 dataset to evaluate the performance of different schemes on charged molecules. First, we trained MACE and Allegro on the neutral molecules of our OMOL25 subset, with and without long-range schemes. While performance on the neutral subset is not strongly affected by CELLI, the baseline versions of both MACE and Allegro are degenerate with respect to total charge, meaning they cannot distinguish between molecules with different charge states and assign identical energies to species that differ only in charge. This limitation becomes critical once multiple charge states are present (see Fig. 4); therefore, we limited our subsequent analysis to the charge-dependent models.
To enable a fair comparison for CELLI, we added total charge embeddings to baseline Allegro and MACE (for details, see Methods). Interestingly, models with total charge embeddings perform slightly better compared to CELLI in the majority of cases. This demonstrates that CELLI can effectively act as a total charge embedding scheme, while also offering two key practical advantages over explicit embeddings: (1) it provides long-range capabilities, which are likely to be relevant in simulations of larger systems, and (2) although accounting for nonzero total charges is necessary for training, MD simulations are typically conducted under neutral conditions, meaning those embeddings would not be used outside of training. Additionally, total charge embeddings can also be combined with all long-range methods, including CELLI; hence, they should not be viewed as a replacement, but rather as a potential extension. The integration of these embeddings into long-range schemes is, however, beyond the scope of this study.
![]() | ||
| Fig. 6 Inferring charge distribution in MOFs. Probability density of charges inferred by different models on the ODAC25 dataset (A) and reference DDEC6 charges from the QMOF dataset (B). MACE-LES L and MACE-LES S differ in hyperparameters, with MACE-LES L being more robust. Probability densities were obtained using Gaussian kernel density estimation in SciPy.82 Parity plot of charges (C) and born effective charges (D) for models trained on a random split of the QMOF dataset (LES was excluded from this plot for clarity and moved to SI). As in ODAC25, LES was not able to recover the charge distribution. Despite MACE-CELLI being fitted to reference charges, a sufficiently low weight for the charge in the loss function allowed it to deviate from the reference charges. For details and additional plots, see SI Section 9. | ||
The issue of relying on partial charges to model electrostatic interactions is well known,68 as charge-based schemes require predefined charges and their accuracy depends strongly on the chosen charge partitioning method, which often yields significantly different results.80 To address this, several approaches have been proposed to bypass the need for reference charges by inferring them directly from forces and energies, such as the LES framework.28,58,59 King et al. demonstrated that LES can successfully infer charges for small systems (e.g., polar dipeptides from the SPICE dataset59,81); however, to our knowledge, it has never been tested on MOFs, which exhibit significantly more complex electrostatic environments than small biomolecules.The MACE-LES model trained on the random split achieved MAE of 8.7 meV per atom and a force MAE of 24.6 meV Å−1, meaning that the force error is around 50% higher than that of the baseline MACE model, whereas energy error is 50% lower (we also tested whether we can reproduce results reported before by King et al.59 to ensure that this is due to performance of the MACE framework rather than technical issues on our side for details see SI Section 8). Further investigation shows that LES is also unable to infer correct charges without reference, and similarly to CELLI, it predicts most partial charges to be almost zero. To further validate this result, we trained an additional MACE-LES model on the QMOF dataset, which includes DFT-derived partial charges and allows direct comparison to reference values. As shown in Fig. 6, the same behavior was observed: LES again failed to infer meaningful charges and collapsed to predicting near-zero values. Although LES frequently assigns charges opposite in sign to the reference values, this is not inherently problematic-Coulomb's law is symmetric under global charge inversion, meaning forces and energies would remain unchanged if all charges were flipped. What is problematic is the lack of consistency, as LES sometimes predicts correct signs and sometimes their opposites. Based on this, we attribute better performance on energies of MACE-LES compared to baseline MACE are likely due to discrepancies in MACE implementations rather than properties of LES. Namely, baseline MACE was trained in JAX76 and MACE-LES in PyTorch.78 However, given that the ODAC25 subsets used in this study are relatively small, we cannot rule out the possibility that CELLI or LES may recover correct charge distributions when trained on sufficiently large datasets. Nevertheless, these results indicate that when no partial charges are available, other approaches are necessary, as none of the schemes achieved a significant improvement, except for the Allegro–EFA model. We would also like to highlight that in this work, accurate partial charge prediction is not treated as a goal in itself, but rather as a diagnostic indicating whether a model is able to infer meaningful long-range electrostatics, with the near-zero charges predicted by LES and CELLI in the absence of reference data signaling a failure to capture such effects in MOFs.
Nevertheless, the reliance on SOAP descriptors may partly explain the limited decrease in performance observed for the ODAC25 subset, as this representation likely restricts the effectiveness of our splitting strategy. SOAP may fail to provide a sufficiently meaningful representation of certain atomic environments, resulting in a biased split that, in practice, is not very different from a random one. Additionally, SOAP descriptors are inherently short-ranged, and therefore may not fully distinguish structures that differ primarily in long-range electrostatics, potentially limiting how strictly these splits probe out-of-distribution behavior with respect to long-range physics. Nevertheless, if long-range interactions are not properly captured, models may compensate by overfitting short-range contributions, which can degrade short-range generalizability across chemically diverse environments. Future work could explore alternative representations, such as learned embeddings, electronic descriptors,83 or graph similarity measures,84 to construct more reliable biased splits. However, results from the QMOF dataset suggest that this limitation is not universally detrimental, and that SOAP-based splits can still expose meaningful performance differences when applied to sufficiently diverse datasets.
The performed benchmark tests clearly show that the incorporation of long-range schemes is necessary to achieve accurate and generalizable MLIPs. Especially in the QMOF benchmark, both Allegro and MACE benefited from the introduction of either EFA or CELLI, and only models based on CELLI exhibited good results in all three biased split benchmarks. We also show that the same cannot be achieved by increasing message passing layers. Furthermore, all variants share identical hyperparameters and nearly identical parameter counts, hence the observed improvements cannot be attributed to increased model capacity and instead arise from explicit long-range corrections. Although total charge embeddings can recover much of CELLI's performance on the OMOL25 dataset, they lack true long-range capabilities and generalize poorly in size-based extrapolation, indicating that embeddings alone cannot substitute for physically grounded treatments. In contrast, charge-based methods such as CELLI are inherently suited to modeling charged systems without relying on auxiliary embeddings, making them more robust and transferable, while embeddings should be viewed as complementary enhancements rather than replacements.
Interestingly, our results reveal some discrepancies with earlier studies that introduced several of the long-range methods evaluated here. For instance, Fuchs et al.23 reported that adding CELLI to MACE does not substantially improve performance over the baseline model, whereas in our out-of-distribution test cases, we observe clear gains. Similarly, our results indicate that LES is not able to infer charges from forces and energies alone, in contrast to conclusions drawn from smaller and less complex systems.59 These differences can likely be attributed to two main factors: our use of out-of-distribution splits and the fact that we apply these models to more challenging systems than those commonly considered in similar studies. In addition, LES relies on a fixed smearing parameter,85 which may further restrict transferability across chemically diverse environments.86 This also highlights a broader issue in the development of long-range methods: they are often benchmarked on datasets containing relatively small molecules or systems in which long-range interactions are weak or negligible. Our results indicate that such evaluations can be misleading. Long-range schemes should instead be tested on more challenging systems, such as MOFs, where the electrostatic environment is highly complex and long-range contributions play an important role.
We should also note that long-range interactions can extend beyond pairwise Coulomb terms due to polarization and higher-order multipole effects.26 Those effects are not explicitly modeled in the present CELLI and LES implementations. However, they can, in principle, capture polarization through charge prediction at each timestep. Future work could explore replacing QEq with polarizable charge equilibration schemes such as PQEq,85,87,88 or employing more expressive long-range kernels, including sum-of-Gaussians approaches (SOG-Net),26 to better capture these contributions. In addition, EFA may also capture higher-order and polarization effects. While detailed inference-time benchmarking is beyond the scope of this work, existing studies indicate that the considered long-range approaches do not significantly increase evaluation time,23,27 further suggesting that they can be routinely integrated into MLIPs.
Our results also provide a cautionary perspective on the limits of inferring charges solely from energy and force. While CELLI delivers substantial improvements when reliable reference charges are available, its performance collapses in their absence, as it fails to recover meaningful electrostatic information and degrades overall accuracy. LES, which was designed with charge inference in mind, exhibits a similar breakdown in MOFs, as it consistently converges to near-zero or inconsistently signed charges across both ODAC25 and QMOF, suggesting that charge inference becomes unreliable once the electrostatic environment becomes too complex. Collectively, these findings suggest that charge-inference schemes, while appealing, are not yet robust enough for systems with complex long-range physics such as MOFs. Although none of the tested models were able to correctly infer charges without reference data, it is possible that more robust architectures or optimized hyperparameter choices could achieve this; however, a systematic investigation of such configurations is beyond the scope of this study. A potential solution to this issue may involve pretraining on reference charges followed by training solely on forces and energies, resulting in a scheme at least partially decoupled from the chosen charge partitioning method. Other promising directions could include modifying the loss function, introducing additional constraints into the learning process, predicting the whole electron density, or using methods that skip charges altogether, like EFA. Ultimately, the most appropriate long-range strategy depends on whether access to accurate partial charges is beneficial. CELLI is preferred when reliable electrostatics and charge-transfer are important and difficult to infer directly from data, whereas charge-free approaches such as EFA can be used for systems where explicit electrostatic information is not needed.
Code availability: the software
72 used to train MLIPs is publicly available at https://github.com/tummfm/chemtrain (DOI: https://doi.org/10.5281/zenodo.15438477, version 0.1.0). The CACE and MACE models71 with LES were trained using code available at https://github.com/ChengUCB/les/tree/main, https://github.com/BingqingCheng/cace, and https://github.com/ChengUCB/mace (accessed 09.2025). All code utilized in this work, including requirements, is available at https://github.com/msan9908/MLIP_generalizability (DOI: https://doi.org/10.5281/zenodo.19486082).
Supplementary information (SI) is available. See DOI: https://doi.org/10.1039/d5dd00570a.
| This journal is © The Royal Society of Chemistry 2026 |