EGraFFBench: Evaluation of Equivariant Graph Neural Network Force Fields for Atomistic Simulations

Equivariant graph neural networks force fields (EGraFFs) have shown great promise in modelling complex interactions in atomic systems by exploiting the graphs' inherent symmetries. Recent works have led to a surge in the development of novel architectures that incorporate equivariance-based inductive biases alongside architectural innovations like graph transformers and message passing to model atomic interactions. However, thorough evaluations of these deploying EGraFFs for the downstream task of real-world atomistic simulations, is lacking. To this end, here we perform a systematic benchmarking of 6 EGraFF algorithms (NequIP, Allegro, BOTNet, MACE, Equiformer, TorchMDNet), with the aim of understanding their capabilities and limitations for realistic atomistic simulations. In addition to our thorough evaluation and analysis on eight existing datasets based on the benchmarking literature, we release two new benchmark datasets, propose four new metrics, and three challenging tasks. The new datasets and tasks evaluate the performance of EGraFF to out-of-distribution data, in terms of different crystal structures, temperatures, and new molecules. Interestingly, evaluation of the EGraFF models based on dynamic simulations reveals that having a lower error on energy or force does not guarantee stable or reliable simulation or faithful replication of the atomic structures. Moreover, we find that no model clearly outperforms other models on all datasets and tasks. Importantly, we show that the performance of all the models on out-of-distribution datasets is unreliable, pointing to the need for the development of a foundation model for force fields that can be used in real-world simulations. In summary, this work establishes a rigorous framework for evaluating machine learning force fields in the context of atomic simulations and points to open research challenges within this domain.


Introduction
Graph neural networks (GNNs) have emerged as powerful tools for learning representations of graph-structured data, enabling breakthroughs in various domains such as social networks, mechanics, drug discovery, and natural language processing [Perozzi et al., 2014, Wu et al., 2020, Zhang and Chen, 2018, Stokes et al., 2020, Zhou et al., 2020, Miret et al., 2023, Lee et al., 2023].In the field of atomistic simulations, GNN force fields have shown significant promise in capturing complex interatomic interactions and accurately predicting the potential energy surfaces of atomic systems [Park et al., 2021, Sanchez-Gonzalez et al., 2020, Schütt et al., 2021, Qiao et al., 2021].These force fields can, in turn, be used to study the dynamics of atomic systems-that is, how the atomic systems evolve with respect to time-enabling several downstream applications such as drug discovery, protein folding, stable structures of materials, and battery materials with targeted diffusion properties.
Recent work has shown that GNN force fields can be further enhanced and made data-efficient by enforcing additional inductive biases, in terms of equivariance, leveraging the underlying symmetry of the atomic structures.This family of GNNs, hereafter referred to as equivariant graph neural network force fields (EGRAFFs), have demonstrated their capability to model symmetries inherent in atomic systems, resulting in superior performance in comparison to other machine-learned force fields.This is achieved by explicitly accounting for symmetry operations, such as rotations and translations, and ensuring that the learned representations in EGRAFFs are consistent under these transformations.
Traditionally, EGRAFFs are trained on the forces and energies based on first principle simulations data, such as density functional theory.Recently work has shown that low training or test error does not guarantee the performance of the EGRAFFs for the downstream task involving atomistic or molecular dynamics (MD) simulations [Fu et al., 2023].Specifically, EGRAFFs can suffer from several major issues such as (i) unstable trajectory (the simulation suddenly explodes/becomes unstable due to high local forces), (ii) poor structure (the structure of the atomic system including the coordination, bond angles, bond lengths is not captured properly), (iii) poor generalization to out-of-distribution datasets including simulations at different temperatures or pressures of the same system, simulations of different structures having the same chemical composition-for example, crystalline (ordered) and glassy (disordered) states of the same system, or simulations of different compositions having the same chemical components-for example, Li 4 P 2 S 6 and Li 7 P 3 S 11 .Note that these are realistic tasks for which a force field that is well-trained on one system can generalize to other similar systems.As such, an extensive evaluation and comparison of EGRAFFs is needed, which requires standardized datasets, well-defined metrics, and comprehensive benchmarking, that capture the diversity and complexity of atomic systems.
An initial effort to capture the performance of machine-learned force fields was carried out [Fu et al., 2023].In this work, the authors focused on existing datasets and some metrics, such as radial distribution functions and diffusion constants of atomic systems.However, the work did not cover the wide range of EGRAFFs that has been newly proposed, many of which have shown superior performance on common tasks.Moreover, the metrics in Fu et al. [2023] were limited to stability, mean absolute error of forces radial distribution function, and diffusivity.While useful, these metrics either do not capture the variations during the dynamic simulation (e.g., how the force or energy error evolves during simulation) or require long simulations (such as diffusion constants, which requires many steps to reach the diffusive regime).Further, the work does not propose any novel tasks that can serve as a benchmark for the community developing new force fields.
With the increasing interest in EGRAFFs for atomic simulations, we aim to address the gap in benchmarking by performing a rigorous evaluation of the quality of simulations obtained using modern EGRAFF force fields.To this extent, we evaluate 6 EGRAFFs on 10 datasets, including two new challenging datasets that we contribute, and propose new metrics based on real-world simulations.By employing a diverse set of atomic systems and benchmarking metrics, we aim to objectively and rigorously assess the capabilities and limitations of EGRAFFs.The main contributions of this research paper are as follows: • EGRAFFs: We present a benchmarking package to evaluate 6 EGRAFFs for atomistic simulations.As a byproduct of this benchmarking study, we release a well-curated codebase of the prominent Equivariant GNNforce fields in the literature enabling easier and streamlined access to relevant modeling pipelines • Challenging benchmark datasets: We present 10 datasets, including two new datasets, namely GeTe and LiPS20.The datasets cover a wide range of atomic systems, from small molecules to bulk systems.The datasets capture several scenarios, such as compounds with the same elements but different chemical compositions, the same composition with different crystal structures, and the same structure at different temperatures.This includes complex scenarios such as melting trajectories of crystals.
• Challenging downstream tasks: We propose several challenging downstream tasks that evaluate the ability of EGRAFFs to model the out-of-distribution datasets described earlier.
• Improved metrics: We propose additional metrics that evaluate the quality of the atomistic simulations regarding the structure and dynamics with respect to the ground truth.

Preliminaries
Every material consists of atoms that interact with each other based on the different types of bondings (e.g., covalent and ionic).These bonds are approximated by force fields that model the atomic interactions.Here, we briefly describe atomistic simulations and the equivariant GNNs used for modeling these systems.

Atomistic simulation
Consider a set of N atoms represented by a point cloud corresponding to their position vectors (r 1 , r 2 , . . ., r N ) and their types ω i .Specifically, the potential energy of a system can be written as the summation of one-body U (r i ), two-body U (r i , r j ), three-body U (r i , r j , r k ), up to N -body interaction terms as Since the exact computation of this potential energy is challenging, they are approximated using empirical force fields that learn the effective potential energy surface as a function of two-, three-, or four-body interactions.In atomistic simulations, these force fields are used to obtain the system's energy.The forces on each particle are then obtained as . The acceleration of each atom is obtained from these forces as F i /m i where m i is the mass of each atom.Accordingly, the updated position is computed by numerically integrating the equations of motion using a symplectic integrator.These steps are repeated to study the dynamics of atomic systems.

Equivariant GNN force fields (EGRAFF)
Figure 1: Equivariant transformation G on a molecule under rotation R.
GNNs are widely used to model the force field due to the topological similarity with atomic systems.Specifically, nodes are considered atoms, the edges represent interactions/bonds, and the energy or force is predicted as the output at the node or edge levels.Equivariant GNNs employ a message passing scheme that is equivariant to rotations, that is, G(Rx) = RG(x), where R is a rotation and G is an equivariant transformation (see Fig. 1).This enables a rich representation of atomic environments equivariant to rotation.Notably, while the energy of an atomic system is invariant to rotation (that is, a molecule before and after rotation would have the same energy), the force is equivariant to rotation (that is, the forces experienced by the molecules due to the interactions also get rotated when the molecule is rotated).

Models Studied
All EGRAFFs employed in this work rely on equivariance in the graph structure.All models use a one-hot encoding of the atomic numbers Z i as the node input and the position vector r i as a node or edge input.Equivariance in these models is ensured by the use of spherical harmonics along with radial basis functions.The convolution or message-passing implementation differs from model to model.Further hyperparameters details for all models are tabulated in App.A.10 NEQUIP [Batzner et al., 2022], based on the tensor field networks, employs a series of self-interaction, convolution, and concatenation with the neighboring atoms.The convolution filter  ALLEGRO Musaelian et al. [2022] merges the precision of recent equivariant GNNs with stringent locality, without message passing.Its inherent local characteristic enhances its scalability for potentially more extensive systems.In contrast to other models, ALLEGRO predicts the energy as a function of the final edge embedding rather than the node embeddings.All the pairwise energies are summed to obtain the totatl energy of the system.ALLEGRO features remarkable adaptability to data outside the training distribution, consistently surpassing other force fields in this aspect, especially those employing body-ordered strategies.BOTNET Batatia et al. [2022a] is a refined body-ordered adaptation of NEQUIP.While maintaining the two-body interactions of NequIP in each layer, it increments the body order by one with every iteration of message passing.Unlike NEQUIP, BOTNET uses non-linearities in the update step.MACE Batatia et al. [2022b] offers efficient equivariant messages with high body order computation.Due to the augmented body order of the messages, merely two message-passing iterations suffice to attain notable accuracy.This contrasts with the usual five or six iterations observed in other GNNs, rendering MACE both scalable and amenable to parallelization.TORCHMDNET Thölke and Fabritiis [2022] introduces a transformer-based GNN architecture, utilizing a modified multi-head attention mechanism.This modification expands the traditional dot-product attention to integrate edge data, which can enhance the learning of interatomic interactions.EQUIFORMER [Liao and Smidt, 2023] is a transformer-based GNN architecture, introducing a new attention mechanism named 'equivariant graph attention'.This mechanism equips conventional attention used in the transformers with equivariance.PaiNN [Schütt et al., 2021] is a polarizable atom interaction neural network consisting of equivariant message passing architecture that takes into account the varying polarizability of atoms in different chemical environments, allowing for a more realistic representation of molecular behavior.DimeNeT++ [Gasteiger et al., 2020] is a directional message passing neural network where each rotationally equivariant message is associated with a direction in coordinate space.

Benchmarking Evaluation
In this section, we benchmark the above-mentioned architectures and distill the insights generated.The evaluation environment is detailed in App.A.8.

Datasets
Since the present work focuses on evaluating EGRAFFs for molecular dynamics (MD) simulations, we consider only datasets with included time dynamics-i.e., all the datasets represent the dynamics of atom (see Fig. 2).We consider a total of 10 datasets (see Tab. 10 and App.A.1).
MD17 is a widely used Batzner et al. [2022], Liao and Smidt [2023], Batatia et al. [2022a,b], Thölke andFabritiis [2022], Fu et al. [2023] dataset for benchmarking ML force fields.It was proposed by Chmiela et al. [2017] and constitutes a set of small organic molecules, including benzene, toluene, naphthalene, ethanol, uracil, and aspirin, with energy and forces generated by ab initio MD simulations (AIMD).Here, we select four molecules, namely aspirin, ethanol, naphthalene, and salicylic acid, to cover a range of chemical structures and topology.Further, zero-shot evaluation is performed on benzene.We train the models on 950 configurations and validate them on 50.3BPA contains a large flexible drug-like organic molecule 3-(benzyloxy)pyridin-2-amine (3BPA) sampled from different temperature MD trajectories Kovács et al. [2021].It has three consecutive rotatable bonds leading to a complex dihedral potential energy surface with many local minima, making it challenging to approximate using classical or ML force fields.The models can be trained either on 300 K snapshots or on mixed temperature snapshots sampled from 300 K, 600 K, and 1200 K.In the following experiments, we train models on 500 configurations sampled at 300 K and test 1669 configurations sampled at 600 K. LiPS consists of lithium, phosphorous, and sulfur (Li 6.75 P 3 S 11 ), which is used in similar benchmarking analysis Fu et al. [2023], as a representative system for the MD simulations to study kinetic properties in materials.Note that LiPS is a crystalline (ordered structure) that can potentially be used in battery development.We have adopted this dataset from [Batzner et al., 2022]and benchmarked all models for their force and energy errors.The training and testing datasets have 19000 and 1000 configurations, respectively.Acetylacetone (AcAc) The dataset was generated by conducting MD simulations at both 300K and 600K using a Langevin thermostat [Batatia et al., 2022a].The uniqueness of this dataset stems from the varying simulation temperatures and the range of sampled dihedral angles.While the training set restricts sampling to dihedral angles below 30°, our models are tested on angles extending up to 180°.The model must effectively generalize on the Potential Energy Surface (PES) for accurate generalization at these higher angles.This challenge presents an excellent opportunity for benchmarking GNNs.The dataset consists of 500 training configurations and 650 testing configurations.GeTe is a new dataset generated by a Car-Parrinello MD (CPMD) simulations Hutter [2012] of Ge and Te atoms, which builds on a density functional theory (DFT) based calculation of the interatomic forces, prior to a classical integration of the equations of motions.It consists of 200 atoms, of which 40 are Ge and 160 are Te, i.e., corresponding to the composition GeTe 4 whose structural properties have been investigated in detail and reproduce a certain number of experimental data in the liquid and amorphous phase from neutron/X-ray scattering Micoulaut et al. [2014a], Gunasekera et al. [2014] and Mössbauer spectroscopy Micoulaut et al. [2014b].As GeTe belongs to the promising class of phase-change materials Zhang et al. [2019], it is challenging to simulate using classical force fields because of the increased accessibility in terms of time and size.Thus, an accurate force field is essential to understand the structural changes in GeTe during the crystalline to disordered phase transitions.Here, our dataset consists of 1,500 structures in training, 300 in test, and 300 in validation.LiPS20 is a new dataset generated from AIMD simulations of a series of systems containing Li, P, and S elements, including both the crystalline and disordered structures of elementary substances and compounds, such as Li, P, S, Li 2 P 2 S 6 , β-Li 3 PS 4 , γ-Li 3 PS 4 , and xLi 2 S-(100 − x)P 2 S 5 (x = 67, 70, 75, and 80) glasses using the CP2K package Kühne et al. [2020].Details of dataset generation, structures, and compositions in this dataset are given in App.A.2.

Evaluation metrics
Ideally, once trained, the forward simulations by EGRAFFs should be close to the ground truth (first principle simulations) both in terms of the atomic structure and dynamics.To this extent, we propose four metrics.Note that these metrics are evaluated based on the forward simulation, starting from an arbitrary structure for n steps employing the force fields; a task for which it is not explicitly trained.All the forward simulations were performed using the Atomic Simulation Environment (ASE) package [Larsen et al., 2017].The simulations were conducted in the canonical (N V T ) ensemble, where the temperature and timesteps were set in accordance with the sampling conditions specified in the respective datasets.See details in App.A.3

Structure metrics
We propose two metrics to evaluate the proximity of structures predicted by the EGRAFF to the ground truth.Wright's Factor (WF), R χ : Grimley et al. [1990] represents the relative difference between the radial distribution function (RDF) of the ground truth atomic structure (g ref (r)) and the structure obtained from the atomistic simulations employing the EGRAFFs (g(r)) as RDF essentially represents the local time-averaged density of atoms at a distance r from a central atom (see App. A.4). Hence, it captures the structure simulated by a force field concisely and one-dimensionally.A force field is considered acceptable if it can provide a WF less than 9% for bulk systems Bauchy [2014].Jensen-Shannon Divergence(JSD) of radial distribution function: Jensen-Shannon Divergence (JSD) Cover and Thomas [1991], Shannon [1948] is a useful tool for quantifying the difference or similarity between two probability distributions in a way that overcomes some of the limitations of the KL DivergenceKullback and Leibler [1951].Since the RDF is essentially a distribution of the atomic density, JSD between two predicted RDF and ground truth RDF can be computed as: where ĝ(r) = 1/2(g(r) + g ref (r)) is the mean of the predicted and ground-truth RDFs.(see App. A.4) Table 1: Energy (E) and force (F) mean absolute error in meV and meV/ Å, respectively, for the trained models on different datasets.Darker colors represent the better-performing models.We use shades of green and blue color for energy and force, respectively.

Dynamics metrics
We monitor the energy and force error over the forward simulation trajectory to evaluate how close the predicted dynamics are to the ground truth.Specifically, we use the following metrics, namely, energy violation error, EV(t), and force violation error, FV(t), defined as: where Ê(t) and E(t) are the predicted and ground truth energies respectively and F (t) and F(t) are the predicted and ground truth forces.Note that this metric ensures that the energy and the force violation errors are bounded between 0 and 1, with 0 representing exact agreement with the ground truth and 1 representing no agreement.Further, we compute the geometric mean of EV (t) and F V (t) over the trajectory to represent the cumulative EV and F V .

Energy and Forces
To evaluate the performance of the trained models on different datasets, we first compute the mean absolute error in predicting the energy and force (see Table 1).First, we observe that no single model consistently outperforms others for all datasets, highlighting the dataset-specific nature of the models.TORCHMDNET model has notably lower energy error than other models for most datasets, while NEQUIP has minimum force error on majority of datasets with low energy error.On bulk systems such as LiPS and LiPS20, MACE and BOTNET show the lowest energy error.Interestingly, GeTe, the largest dataset in terms of the number of atoms, exhibits significant energy errors across all models, with the EQUIFORMER having the lowest energy error.EQUIFORMER also exhibits lower force error for datasets like Acetylacetone, 3BPA, and MD17, but suffers high force error on GeTe, LiPS, and LiPS20.Overall, NEQUIP seems to perform well in terms of both energy and force errors for several datasets.It is also interesting to note that the models exhibiting low energy error often exhibit high force error, suggesting that the gradient of energy is not captured well by these models.This will potentially lead to poor simulations as the updated positions are computed directly from the forces.

Forward simulations
To evaluate the ability of the trained models to simulate realistic structures and dynamics, we perform MD simulations using the trained models, which are compared with ground truth simulations, both employing the same initial configuration.For each model, five forward simulations of 1000 timesteps are performed on each dataset.Tables 2 and 3 show the JSD and WF, and EV and FV, respectively, of the trained models on the datasets (see App. A.5, A.6 and A.7 for figures).Both in terms of JSD and WF, we observe that NEQUIP performs better on most datasets.Interestingly, even on datasets where other models have lower MAE on energy and force error, NEQUIP performs better in capturing the atomic structure.Altogether, we observe that NEQUIP followed by TORCHMDNET performs best in capturing the atomic structure for most datasets.We now evaluate the models' EV and FV during the forward simulation.Interestingly, we observe that NEQUIP and ALLEGRO exhibit the least FV for most datasets, while MACE and BOTNET perform better in terms of EV.

Generalizability to unseen structures
The first task focuses on evaluating the models on an unseen small molecule structure.To this extent, we test the models, trained on four molecules of the MD17 dataset (aspirin, ethaenol, naphthalene, and salicylic acid), on the benzene molecule, an unseen molecule from the MD17 dataset.Note that the benzene molecule has a cyclic ring structure.Aspirin and Salicylic acid contain one ring, naphthalene is polycyclic with two rings, while ethanol has a chain structure with no rings.Table 5 shows the EV and FV and Table 6 shows the corresponding JSD and WF.We observe that all the models suffer very high errors in force and energy.EQUIFORMER trained on ethanol and salicylic acid exhibits unstable simulation after the first few steps.Interestingly, non-cyclic ethanol models perform better than aspirin and salicylic acid, although the latter structures are more similar to benzene.Similarly, the model trained on polycyclic Naphthalene performs better than other models.Altogether, we observe that despite having the same chemical elements, models trained on one small molecule do not generalize to an unseen molecule with a different structure.

Generalizability to higher temperatures
At higher temperatures, the sampling region in the energy landscape widens; hence, the configurations obtained at higher temperatures come from a broader distribution of structural configurations.In the 3BPA molecule, at 300K, only the stable dihedral angle configurations are present, while at 600K, all configurations are sampled.Here, we   evaluate the model trained at lower temperatures for simulations at higher temperatures.Table 7 shows the obtained mean energy and force violation of the forward simulation trajectory, and Table 8 shows the corresponding JSD and WF.We observe that the models can reasonably capture the behavior, both structure and dynamics, at higher temperatures.

Out of distribution tasks on the LiPS20 dataset
Unseen crystalline structures: Crystal structures are stable low-energy structures with inherent symmetries and periodicity.Predicting their energy accurately is an extremely challenging task and a cornerstone in materials discovery.
Here, we train the models on liquid (disordered) structures and test them on the out-of-distribution crystalline structures to evaluate their generalizability capabilities.Table 9 shows that BOTNET performs appreciably well with almost the same energy and force error on crystal structures as the obtained training error.Both the transformer models have poor performance on the LiPS20 system, in terms of both the training and testing datasets.TORCHMDNET has significantly high energy and force errors, whereas EQUIFORMER exhibits instability during the forward simulation.
Generalizability to unseen composition: The LiPS20 dataset consists of 20 different compositions with varying system sizes and cell geometries (see App. A.2).In Tables 9a(a) and 9b, we show the results on the test structures that are not present in the training datasets.The test dataset consists of system sizes up to 260 atoms, while the models were trained on system sizes with < 100 atoms.It tests the models' generalization as well as inductive capability.We observe that MACE and BOTNET have the lowest mean energy, force violation, and low WF.NEQUIP and ALLEGRO have significantly higher test errors.

Concluding Insights
In this work, we present EGRAFFBench, a benchmarking suite for evaluating machine-learned force fields.The key insights drawn from the extensive evaluation are as follows.

Dataset matters:
There was no single model that was performing best on all the datasets and all the metrics.Thus, the selection of the model depends highly on the nature of the atomic system, whether it is a small molecule or a bulk system, for instance.

2.
Structure is important: Low force or energy error during model development does not guarantee faithful reproduction of the atomic structure.Conversely, models with higher energy or force error may provide reasonable structures.Accordingly, downstream evaluation of atomic structures using structural metrics is important in choosing the appropriate model.
3. Stability during dynamics: Models exhibiting low energy or force errors during the model development on static configurations do not guarantee low errors during forward simulation.Thus, the energy and force violations during molecular dynamics should be evaluated separately to understand the stability of the simulation.
4. Out-of-distribution is still challenging: Discovery of novel materials relies on identifying hitherto unknown configurations with low energy.We observe that the models still do not perform reliably on out-of-distribution datasets, leaving an open challenge in materials modeling.
5. Fast to train and fast on inference: We observe that some models are fast on training, while others are fast on inference.For instance, TORCHMDNET is slow to train but fast on inference.While MACE is fast both on training and inference, it does not give the best results in terms of structure or dynamics.Thus, in cases where larger simulations are required, the appropriate model that balances the training/inference time and accuracy may be chosen.
Limitations and future work: Our research clearly points to developing a foundation model trained on large datasets.Further, improved training strategies that (i) ensure the learning of gradients of energies and forces, (ii) take into account the dynamics during simulations, and (iii) reproduce the structure faithfully need to be developed.This suggests moving away from the traditional training approach only on energy and forces and rather focusing on the system's dynamics.Further strategies combining experimentally observed structures and simulated dynamics can be devised through experiment-simulation fusion to develop reliable force fields that are faithful to both experiments and simulations.Another interesting aspect is the empirical evaluation of which particular architectural feature of a model helps in giving a superior performance for a given dataset or system (defined by the type of bonding, number of atoms, crystalline vs disordered, etc.).Such a detailed analysis can be a guide to designing improved architecture while also providing thumb rules toward the use of an appropriate architecture for a given system.
strengthen the capability of the force field in reproducing the glass structures of different lithium thiophosphates.The training dataset consists following compositions, shuffled randomly: Li, Li 2 S, Li 48 P 16 S 61 , P 4 S 3 , Li 7 P S 6 .Crystal structures set included beta − Li 3 P S 4 , Li 2 P S 3 − hex, gamma − Li 3 P S 4 , Li 2 P S 3 − orth, and rest compositions were used as the test dataset.

A.3 Timestep and temperature details
Table 12 displays the temperature in Kelvin and the corresponding timestep in femtoseconds for various datasets utilized in the forward simulations.These values remain consistent with the original sampled datasets.

A.5 Mean Energy and force violation
Figure 4 shows the obtained geometric mean of energy and force violation errors for the trained models on all the datasets.We observe that the variation of energy error among the models is quite large for some datasets like MD17 and LiPS20, and very small for datasets like 3BPA and Acetylacetone.

A.6 Rollout Energy and force violation
The evolution of energy violation error, EV(t), and force violation error, FV(t), obtained as average over five forward simulations for different datasets are shown in Figure 5.

A.7 Comparative Analysis
Figure 6 shows the comparative radial plots for different metrics for all the datasets.For better interpretability, we normalize all the metrics with respect to the its largest value in the dataset.Figure 7 shows the comparison of different pairs of related metrics for all the datasets and models.

A.8 Hardware details
All the models are trained using A100 80GB PCI GPUs, and inference performed using AMD EPYC 7282 16-Core Processor @ 2.80GHz with 1TB installed RAM.All the models uses PyTorch environment, with Atomic simulation environment (ASE) package for forward simulations.Specific versions details are given on the code repository.
A.9 Root mean square displacement plots

A.10 Hyperparameter details
The details of hyperparameters used for training each of the models are provided in the following tables.NEQUIP in Table 13,ALLEGRO in Table 14,BOTNET in Table 15,MACE in Table 16,EQUIFORMER in Table 17,,TORCHMDNET in Table 18,PaiNN in Table 20,and DimeNET++ in represented as a product of radial basis function R and spherical harmonics Y l m ensures equivariance.This was the first EGRAFF proposed for atomistic simulations based on spherical harmonics.
Figure3shows the reference and generated radial distribution functions(RDFs) for 3BPA, Acetylacetone, LiPS and GeTe.The generated RDFs are obtained after averaging over five simulations trajectories of 1000 steps.

Figure 3 :
Figure 3: Pair distribution function(PDF) over the simulation trajectory.Reference PDF in red and generated PDF in blue represent ground truth and predicted RDFs.The values are computed as the average of five forward simulations for 1000 timesteps on each dataset with different initial conditions.

Figure 4 :
Figure 4: Geometric mean of energy (×10 −5 ) and force violation error over the simulation trajectory.The error bar shows a 95% confidence interval.The values are computed as the average of five forward simulations for 1000 timesteps on each dataset with different initial conditions.

Figure 5 :
Figure 5: Energy (×10 −5 ) and force violation error over the simulation trajectory.The error bar shows a 95% confidence interval.The values are computed as the average of five forward simulations for 1000 timesteps on each dataset with different initial conditions.

Figure 6 :
Figure 6: Comparative analysis of different metrics for all models across datasets.The color of the line indicates model identity.The values are normalized by dividing their respective maximum values and then multiplying it by 100.

Figure 7 :
Figure 7: Comparision of (a) Energy violation and Force Violation,(b) JSD and WF, (c) Training time and Inference time, and (d) Mean absolute energy error(MAE) and Mean absolute force error (MAF), for all dataset.The values are normalized by the largest values to scale between 0 and 1.

Figure 8 :
Figure 8: Root mean square displacement plots for models on all datasets.The values are computed as the average of five forward simulations for 1000 timesteps on each dataset with different initial conditions.

Figure 9 :
Figure 9: Root mean square displacement plots for all the models on all datasets.The values are computed as the average of five forward simulations for 1000 timesteps on each dataset with different initial conditions.

Table 2 :
JSD and WF for six EGRAFFs on all the datasets.The values are computed as the average of five forward simulations for 1000 timesteps on each dataset with different initial conditions.Table 4 shows different models' training and inference time.MACE and TORCHMDNET have the lowest per epoch training time.The total training time is higher for transformer models TORCHMDNET and EQUIFORMER because of the larger number of epochs required for training.Although NEQUIP and ALLEGRO require more time per epoch, they get trained quickly in fewer epochs.LiPS dataset, having the largest dataset size in training of around 20000, has the largest per epoch training time.Since MD simulations are generally performed on CPUs, we report inference time as a mean over five simulations for 1000 steps performed on a CPU.TORCHMDNET is significantly fast on all the datasets while ALLEGRO and MACE show competitive performance.A visual analysis of the models on these metrics are given in App.A.7.
Interestingly, TORCHMDNET, despite having the lowest MAE on energy for most datasets, does not exhibit low EV, indicating that having low MAE during model development does not guarantee low energy error during MD simulation.

Table 3 :
Geometric mean of energy (×10 −5 ) and force violation error over the simulation trajectory.The values are computed as the average of five forward simulations for 1000 timesteps on each dataset with different initial conditions.Values in the parenthesis represent the standard deviation.

Table 4 :
Training time (T) per epoch and inference time (I) in minutes/epoch and minutes, respectively, for the trained models on all the datasets.Inference time is the mean over 5 forward simulations of 1000 steps on the CPU.

Table 5 :
EV (E) and FV (F) on the forward simulation of benzene molecule by the models trained on aspirin, ethanol, naphthalene, and salicylic acid.

Table 6 :
JSD and WF over simulation trajectory of benzene molecule using models trained on aspirin, ethanol, naphthalene, and salicylic acid.

Table 7 :
Geometric mean of energy (×10 − 5) and force violation at 300 K and 600 K using model trained at 300 K for acetylacetone and 3BPA dataset.

Table 8 :
JSD and WF at 300 K and 600 K using the model trained at 300 K for acetylacetone and 3BPA.

Table 9 :
LiPS20 test on train structures, unseen crystalline structures, and test structures: (a) Energy and Force violation and (b) JSD and WF.

Table 19
A.11 Literature comparison