Open Access Article
Chuin Wei Tan†
a,
Marc L. Descoteaux†
a,
Mit Kotak
b,
Gabriel de Miranda Nascimentoc,
Seán R. Kavanagh
d,
Laura Zichia,
Menghang Wang
a,
Aadit Salujaa,
Yizhong R. Hu
a,
Tess Smidte,
Anders Johansson
f,
William C. Witta,
Boris Kozinsky
*ag and
Albert Musaelian*ah
aJohn A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA. E-mail: bkoz@g.harvard.edu; amusaelian@alumni.harvard.edu
bCenter for Computational Science and Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
cDepartment of Materials Science and Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
dCenter for the Environment, Harvard University, Cambridge, MA, USA
eResearch Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, MA, USA
fSandia National Laboratories, Albuquerque, NM, USA
gRobert Bosch LLC Research and Technology Center, Watertown, MA, USA
hMirian Technologies Inc., Boston, MA, USA
First published on 26th March 2026
Machine learning interatomic potentials, particularly those based on deep equivariant neural networks, have demonstrated state-of-the-art accuracy and computational efficiency in atomistic modeling tasks like molecular dynamics and high-throughput screening. The size of datasets and demands of downstream workflows are growing rapidly, making robust and scalable software essential. This work presents a major overhaul of the NequIP framework focusing on multi-node parallelism, computational performance, and extensibility. The redesigned framework supports distributed training on large datasets and removes barriers preventing full utilization of the PyTorch 2.0 compiler at train time. We demonstrate this acceleration in a case study by training Allegro models on the SPICE 2 dataset of organic molecular systems. For inference, we introduce the first end-to-end infrastructure that uses the PyTorch Ahead-of-Time Inductor compiler for machine learning interatomic potentials. Additionally, we implement a custom kernel for the Allegro model's most expensive operation, the tensor product. Together, these advancements speed up molecular dynamics calculations on system sizes of practical relevance by up to factors of 5 to 18.
This work presents a major revamp of the NequIP software framework for MLIPs to dramatically improve computational performance and versatility across training and inference tasks of all sizes. Within this framework, NequIP29 pioneered the development of interatomic potentials based on equivariant neural networks, which incorporate rotational, mirror, and other physical symmetries directly in the structure of the model, while Allegro30 extended this approach with a scalable design for large-scale systems.31 More generally, equivariant model architectures have been shown to improve data efficiency, enhance generalization, and achieve state-of-the-art accuracy in MLIP-based simulations,29,30,32,33 and a number of further leading models now adapt the NequIP architecture and/or codebase.21,22,34 However, equivariant neural networks pose practical computational challenges due to their use of mathematical operations for which libraries like PyTorch do not provide optimized implementations. Native PyTorch implementations of these operations tend to suffer from kernel launch overhead and the materialization of large intermediate tensors, which can limit both speed and memory efficiency. Overcoming these limitations with compiler-based optimization and custom GPU kernels is a core motivation of this work.
We begin in Section 2 by describing the methods we used to accelerate training and inference. Then, we present a case study demonstrating the performance improvements using the Allegro deep equivariant network architecture30 in Section 3.
TorchInductor serves as the compiler backend and can be invoked through two complementary compilation modes: just-in-time (JIT) compilation via
, and ahead-of-time (AOT) compilation through AOT Inductor. Both modes share almost the same underlying compiler infrastructure and optimization pipeline, differing mainly when and where compilation occurs—JIT compilation happens dynamically within Python, whereas AOT compilation produces standalone binaries for use in non-Python environments. As will be discussed, JIT compilation with
is used for training (Section 2.2) while AOT compilation is used for Python-free inference where the Python-based PyTorch JIT machinery is unavailable (Section 2.3).
The main challenge in applying TorchInductor to NequIP and Allegro is the models' use of automatic differentiation to predict forces and virials as exact derivatives of the potential energy. Enforcing these derivative relationships ensures that forces remain conservative fields of the energy, which is essential for maintaining energy conservation in molecular dynamics simulations.42 Evidence suggests that using conservative forces improves the accuracy of physical property predictions,27 particularly for phonon-related properties,43 and enhances the stability of geometry relaxation tasks.44 Computing the derivative, however, is not a primitive PyTorch operation that is supported directly by the compiler machinery. To overcome this challenge, we adapted our code to trace the entire model, including the calculation of the derivative(s), into a single computational graph that is represented using PyTorch's
library.45 The
graph resulting from this procedure represents both the forward pass, which computes the potential energy, and the backward-mode automatic differentiation pass, which computes forces and other derivatives. Because this graph contains only primitive PyTorch operations, it is compatible both with train-time compilation (Section 2.2) and ahead-of-time compilation for inference (Section 2.3).
A key advantage of our TorchInductor approach is that it can accommodate dynamic tensor shapes, while many other neural network compilers enforce static tensor shapes. In MLIP-based calculations—molecular dynamics in particular—both the number of atoms and the number of pairs of neighboring atoms input to the model can vary significantly over the course of a simulation. TorchInductor generates a single set of optimized kernels that are applicable to the entire range of relevant dynamic tensor shapes.
to this graph to generate optimized kernels for both the forward pass, which computes all of the model's predictions, and the train-time backward pass, which computes gradients of the loss function with respect to model parameters.
To enable training on large datasets,13–16 we have added distributed multi-GPU training to the NequIP framework. We adopt a Distributed Data Parallel (DDP) paradigm46 via PyTorch Lightning.47 Instead of using PyTorch's standard DDP implementation, which employs a gradient bucketing strategy that overlaps computation with communication, we implemented a custom DDP approach that performs communications only after the full backwards pass has completed. This design choice allows TorchInductor to compile the entire backward pass, which can reduce overhead and introduce additional opportunities for fusion, rather than splitting it into multiple subgraphs with communication operations in between. Practically, omitting gradient bucketing has not imposed meaningful limitations on our MLIP training, but a gradient bucketing strategy for larger models could be reintroduced in the future if needed.
and
to load and call TorchScript-compiled models directly inside LAMMPS. This approach has subsequently been adopted by other deep learning interatomic potential frameworks such as MACE,32 SevenNet,22 and BAMBOO.49 The
plugin also uses the Kokkos performance portability library50 to reduce overhead in the interface between Allegro and LAMMPS by eliminating CPU–GPU data transfers wherever possible.31 This feature also allows LAMMPS to use GPU-aware MPI communication.
Here, we introduce the first end-to-end use of AOT Inductor (AOTI) compilation for MLIPs as a replacement for TorchScript export. AOTI is a variant of TorchInductor specialized to export compiled PyTorch models as self-contained native code for use in non-Python environments. AOTI compilation allows us to realize the performance benefits of MLIP compilation, including advanced fusion, inlining, and loop/layout reordering, in high-performance codes like LAMMPS without resorting to complicated, high-overhead solutions such as embedded Python interpreters.51,52 Models compiled with AOTI can also still be used in Python-based codes like the Atomic Simulation Environment (ASE)53 and TorchSim.54
To overcome the fusion limitations of TorchInductor, we implemented fused custom kernels for the Allegro tensor product using Triton, a cross-platform language for writing high-performance compute kernels.37 Because Triton is the native backend for TorchInductor, the kernel can be seamlessly integrated with our compilation infrastructure. We implemented kernels for the tensor product and its first derivative to accelerate the inference-time prediction of energy and its first derivatives, like forces. Importantly, custom Triton kernels integrate seamlessly with AOTI, ensuring that our custom kernel can be exported for use in LAMMPS and other inference software. The additional second-derivative kernels required for training are left for future work and would also straightforwardly integrate with train-time TorchInductor compilation.
Our custom kernel takes advantage of the mathematical structure of the tensor product by using a compressed sparse format59 to represent the combined tensor of Wigner 3-j contraction coefficients described in previous work.31 It also avoids materializing intermediate outer products between the input tensors, which results in significant memory savings. We direct to ref. 60 for more implementation details.
The SPICE 2 dataset contains systems with a range of total charge states, which means that a given atomic configuration can have multiple possible DFT energy and force labels depending on the total charge used to compute them. To avoid this degeneracy, the Allegro models in this work were trained on a subset of the SPICE 2 data limited to systems with neutral total charge. This training strategy is less restrictive than that of Kovács et al.,19 who trained only on systems with neutral per-atom formal charge, but more restrictive than the approach of Eastman et al.,23 who trained on the full SPICE 2 dataset.
Using this subset of SPICE 2, we prepared three Allegro models with different cost-accuracy trade-offs and will refer to them as “small”, “medium”, and “large”. The main differences between these models are the maximum rotation order
, the number of tensor features, and the number of Allegro layers, which are listed in Table 1. These hyperparameters control the cost and complexity of the Allegro tensor products that mix the models' equivariant features.30 All models employ a radial cutoff rmax of 6.0 Å, and the SI gives a full description of the hyperparameters and training procedure.
, number of tensor features, and number of Allegro layers
Designed to assess the generalizability of MLIPs trained on SPICE 2, the SPICE 2 test set contains systems distinct from those in the training set, and it is divided into four categories: small ligands, large ligands, pentapeptides, and dimer pairs.23 To understand how our models (trained on neutral-total-charge data) perform on atomic configurations with different total- and per-atom-charge states, we evaluate the Allegro models on three subsets of the SPICE 2 test set: an unrestricted set that includes all systems in the test set regardless of charge state, a subset restricted to systems with neutral total charge (consistent with the data used to train the Allegro models), and a more restrictive subset of systems that contain only atomic species with neutral formal per-atom charge. Fig. 1 shows a consistent trend across all system types and charge schemes where accuracy improves from the small to medium to large model. While energy errors are large on the unfiltered test set, which includes molecular charge states that are entirely absent from the training data, the models perform well on both the neutral total charge subset and the neutral per-atom formal charge subset. Force errors on the unfiltered test set are only slightly larger than on the two filtered subsets. This disparity in performance across the three subsets is attributed to the fact that the present models have no explicit mechanism for representing global charge state. Approaches that incorporate the total charge as an input feature17 or that employ global charge-equilibration schemes that enforce global charge conservation67,68 could be integrated in future work for the models to distinguish geometrically similar configurations with different global charges.
![]() | ||
| Fig. 1 Allegro model results for the SPICE 2 test set.23 Energy and force mean absolute error (MAE) of the three Allegro models on three different subsets of the SPICE 2 test set: (1) the original unrestricted test set, (2) a subset containing systems with neutral total charge, and (3) a subset of systems that contain only atomic species with neutral per-atom formal charge. The error metrics are grouped by system type: large ligands, small ligands, peptides, and dimers. | ||
The second benchmark dataset, TorsionNet 500, also considered by Kovács et al.19 and Eastman et al.,23 assesses an MLIP's ability to predict the relative energy differences between molecular conformers. It contains 12
000 atomic configurations scanning through different values for the torsion angles along one bond in 500 drug-like molecules.66 A key property of a torsion angle scan is the height of its energy barrier, which controls the likelihood of a conformational change involving that torsional angle. Table 2 shows that the three Allegro models have barrier height errors comparable to those of Kovács et al.19 and Eastman et al.23 (see Table 1 from the SI for a detailed comparison between model accuracies and hyperparameters).
| Metric | Small model | Medium model | Large model |
|---|---|---|---|
| Barrier MAE [meV] | 22.75 | 15.42 | 11.37 |
| Barrier RMSE [meV] | 32.36 | 21.77 | 15.38 |
Additional accuracy benchmarks for a held-out 5% portion of the SPICE 2 training set, unused in our training, are presented in the SI.
in a distributed multi-GPU setting. We performed the evaluation on both AMD MI250X GPUs using the Frontier supercomputer at the Oak Ridge Leadership Computing Facility (OLCF) and NVIDIA A100 GPUs using the Perlmutter supercomputer at the National Energy Research Scientific Computing Center (NERSC). To reflect realistic training conditions, all metrics, callbacks, logging, and other sources of overhead typical of production training runs were included in the timings.
Fig. 2 presents the training performance as the number of MPI ranks increases, where each MPI rank runs a fixed per-rank local batch size of eight atomic configurations. In this scaling regime, the effective total global batch size increases with the number of ranks, which allows the entire dataset to be processed (a single “epoch”) in a smaller number of stochastic gradient descent steps, reducing the wall time per epoch. Distributed training scales well up to 128 ranks. At 256 ranks, it remains reasonably efficient and achieves parallel efficiencies of 40% and 24% on the AMD and NVIDIA systems, respectively, when running
and using the 2-rank performance as a baseline. We observe that training with
achieves between 2.4–5.0 times speedups over TorchScript on MI250X and A100 (80 GB) GPUs across the range of MPI ranks investigated.
tool to portably archive the models, which were trained on Frontier, before compiling them on each inference platform using
. All inference benchmarks were performed in LAMMPS using our Kokkos-based integration for Allegro models.
First, for a single AMD MI250X, NVIDIA A100 (80 GB), and NVIDIA H100 GPU, we compare molecular dynamics performance for both small and large systems using: (1) small-molecule benchmarks (Fig. 3), introduced by Eastman et al.,23 consisting of 25-, 50- and 100-atom organic molecules in vacuum, and (2) periodic boxes of water (Fig. 4) at a density of 1 g cm−3 and temperature of 300 K with system sizes ranging from 24 to 5184 atoms.
![]() | ||
| Fig. 3 Single-rank inference acceleration on small molecule systems. The inference speed in LAMMPS of the small, medium, and large Allegro models deployed using TorchScript, AOTI, or AOTI with the optimized tensor product kernel (AOTI + custom TP) on small molecule systems ranging from 25 to 100 atoms without periodic boundary conditions,23 for AMD MI250X, NVIDIA A100 (80 GB), and NVIDIA H100 GPUs. The inference speeds were averaged over three runs with different random seeds for the initial velocities generated by LAMMPS. One MPI rank was used. Note that one MPI rank corresponds to one of the two available graphics compute dies on an MI250X device. | ||
For all systems, AOTI exhibits a consistent performance advantage over TorchScript, and further speedups are achieved with the custom Triton tensor product. The speedup factor tends to be greater for the larger Allegro models that have correspondingly more expensive tensor product operations for the custom kernel to accelerate. Across all hardware and models, the small molecule examples (Fig. 3) see accelerations ranging from 4–18 times when comparing the TorchScript baseline to AOTI with the optimized TP; the same range of speedups is seen in the water simulations (Fig. 4).
Fig. 4 demonstrates that AOTI improves memory efficiency and that the combination of AOTI and the optimized tensor product kernel dramatically increases the maximum number of atoms that can fit on one rank. While the large model runs out of memory at water system sizes of around 100 atoms on all GPU types when using TorchScript, AOTI compilation extends this limit to 288 atoms. The custom kernel then dramatically increases the maximum system size that the large model can run on one rank to 4320 atoms on an MI250X (specifically, one of the two available graphics compute dies), 4320 atoms on an A100 (80 GB), and 5184 atoms on an H100.
We also conducted multi-GPU strong scaling tests with the medium model on two all-atom water-solvated biomolecules: the 23
558-atom dihydrofolate reductase (DHFR) protein system and the 408
609-atom cellulose sugar polymer system, both from the Amber20 benchmark.69 Fig. 5 shows excellent strong scaling up to 256 nodes on both benchmark systems when run on Frontier (AMD MI250X GPUs) and Perlmutter (NVIDIA A100 40 GB GPUs). Inference throughput on Perlmutter begins to saturate between 256 and 512 nodes for AOTI-compiled models (both with and without the custom kernel), while runs on Frontier continue to improve up to 512 nodes, where the AOTI compiled model using the custom kernel achieves over 200 timesteps per second. We observe a maximum throughput of 205 timesteps per second on the 24k atom DHFR system using 256 nodes of Perlmutter with AOTI and the custom kernel. The largest observed speedup over TorchScript is 10.9 times for DHFR on 64 nodes of Perlmutter. For node counts where TorchScript inference fits in memory and can be run, we observe that the combination of AOTI and the kernel achieves an average speedup of 6.6 times on Frontier and 6.9 times on Perlmutter.
![]() | ||
Fig. 5 Strong scaling of the medium Allegro model on biomolecular systems. The molecular dynamics throughput of the Allegro model deployed using TorchScript, AOTI, or AOTI with the optimized tensor product kernel (AOTI + custom TP) on the 23 558-atom dihydrofolate reductase (DHFR) and 408 609-atom cellulose systems from the Amber20 benchmark69 is measured on a number of nodes ranging from 1 to 512 on Frontier (AMD MI250X) and Perlmutter (NVIDIA A100 (40 GB)). Note that there are twice as many logical GPU devices and corresponding MPI ranks on an MI250X node (8) than on an A100 node (4). No TorchScript result is shown for cellulose on A100 GPUs because TorchScript required more GPU memory than was available on these nodes. | ||
Similar to Fig. 4, the combination of AOTI compilation and the custom tensor product kernel dramatically improve memory efficiency. For the larger cellulose system, the TorchScript-compiled model requires too much GPU memory for the 40 GB A100s and fails to run even when using 512 nodes. The AOTI and custom kernel approach, in contrast, runs successfully on as few as 16 nodes, which corresponds to approximately 6384 atoms per GPU.
The NequIP framework implements a number of techniques that could be applied to further accelerate the Allegro models presented here, and an in-depth study of their strengths and weaknesses is an important topic for future work. In particular, the framework supports per-edge-type cutoff radii that depend on the atomic types of both the central and the neighbor atoms, which we used in ref. 31. Because the computational cost of Allegro scales linearly with the total number of edges, tuning per-edge-type cutoff radii can have a significant impact on speed. The framework also supports the faster but less numerically accurate TensorFloat32 floating-point precision for intermediate computations, which we did not enable in this work. Additionally, we observed that the throughput gains from the custom TP kernel vary across hardware platforms, and improving the performance of our custom kernels on MI250X systems is an area of ongoing optimization.
Finally, the updated NequIP framework is an extensible foundation for the development of MLIP architectures and training strategies. The modular structure it enforces facilitates the development of extension packages like Allegro and ensures that they can easily leverage both the computational acceleration techniques described here and the broader set of tools, integrations, and infrastructure offered by the framework.
Importantly, the accelerations introduced in this work are not specific to Allegro and can be applied to any model implemented within the NequIP framework. With the exception of the custom TP kernel specific to the Allegro TP, all other accelerations, including the compilation and distributed training infrastructure, are fully general. The NequIP model, while not presented in this work for clarity of focus, also benefits from these performance advances. More broadly, other recently developed architectures27,70,71 can also take advantage of these optimizations, provided they comply with the framework's infrastructural and compilation interfaces. This design makes it possible for future MLIP architectures built on top of the framework to benefit from these performance improvements with minimal additional engineering effort.
More recently, these infrastructural advancements have enabled the development of NequIP and Allegro foundation potentials pretrained on OMat24,15 and fine-tuned on MPTrj13 and Alexandria,14 now publicly available at https://www.nequip.net/and highlighted on the Matbench Discovery benchmark.72 A detailed assessment of these models will be published elsewhere; we mention them here simply to underscore the scalability of the NequIP framework to large datasets beyond the SPICE 2 case study.
v0.7.0,
v0.4.0, and
v0.7.0, are archived on Zenodo as a curated software release (DOI: https://doi.org/10.5281/zenodo.18211443). These correspond to the tagged releases available on GitHub at https://github.com/mir-group/nequip, https://github.com/mir-group/allegro, and https://github.com/mir-group/pair_nequip_allegro. Up-to-date development versions of the software remain available at these repositories. Scripts used for data processing, model training, benchmarking, and analysis are archived on Zenodo (DOI: https://doi.org/10.5281/zenodo.18193586) and are publicly available at https://github.com/mir-group/nequip-infra-paper-scripts. Details and links to the datasets used in this work. Supplementary information: model training and inference details, and additional experiments and benchmarks. See DOI: https://doi.org/10.1039/d5dd00423c.
Footnote |
| † Co-first author. |
| This journal is © The Royal Society of Chemistry 2026 |