Lingyu Konga,
Nima Shoghia,
Guoxiang Huc,
Pan Lib and
Victor Fung
*a
aSchool of Computational Science and Engineering, Georgia Institute of Technology, USA. E-mail: victorfung@gatech.edu
bSchool of Electrical and Computer Engineering, Georgia Institute of Technology, USA
cSchool of Materials Science and Engineering, Georgia Institute of Technology, USA
First published on 23rd July 2025
Geometric machine learning models such as graph neural networks have achieved remarkable success in recent years in chemical and materials science research for applications such as high-throughput virtual screening and atomistic simulations. The success of these models can be attributed to their ability to effectively learn latent representations of atomic structures directly from the training data. Conversely, this also results in high data requirements for these models, hindering their application to problems which are data sparse which are common in this domain. To address this limitation, there is a growing development in the area of pre-trained machine learning models which have learned general, fundamental, geometric relationships in atomistic data, and which can then be fine-tuned to much smaller application-specific datasets. In particular, models which are pre-trained on diverse, large-scale atomistic datasets have shown impressive generalizability and flexibility to downstream applications, and are increasingly referred to as atomistic foundation models. To leverage the untapped potential of these foundation models, we introduce MatterTune, a modular and extensible framework that provides advanced fine-tuning capabilities and seamless integration of atomistic foundation models into downstream materials informatics and simulation workflows, thereby lowering the barriers to adoption and facilitating diverse applications in materials science. In its current state, MatterTune supports a number of state-of-the-art foundation models such as ORB, MatterSim, JMP, MACE, and EquformerV2, and hosts a wide range of features including a modular and flexible design, distributed and customizable fine-tuning, broad support for downstream informatics tasks, and more.
Almost all GNN models of this class operate on the general principle of taking the atomic identity and structure of a molecule or crystal as inputs, and mapping this geometric information to their corresponding property labels. In the case of GNNs, this information is encoded in the node and edge attributes of a graph, which are then processed through message passing operations to yield latent atom-level and system-level embeddings or representations. From this embedding, the property labels can then be obtained via non-message passing layers, commonly referred to as a readout function or an output head. Starting from seminal examples such as SchNet10 and CGCNN,11 subsequent models have incorporated increasingly sophisticated advancements including the incorporation of many-body interactions,12–14 equivariant features,15–17 and transformer-like architectures,18,19 though nearly all still follow the same aforementioned general principles. Although these GNNs have become increasingly accurate and scalable with these improvements, they are inherently data driven and invariably function poorly for instances where training data is sparse. This limitation prevents their widespread application to the majority of materials science-related problems where data may range in the hundreds or even fewer samples.
A rapidly growing area of research towards greater data efficiency of GNNs is in the pre-training of GNNs. This approach generally involves first training these models on large upstream datasets (the “pre-training” stage) before continuing the training on the smaller downstream dataset(s) of interest (the “fine-tuning” stage). This process enables the models to learn robust, transferable representations without requiring the final property labels. Two general strategies exist for pre-training: supervised and unsupervised. In supervised pre-training, GNNs are initially trained on certain explicit property labels which are sufficiently generalizable to downstream needs. Properties such as energies, forces, and sometimes stresses from quantum mechanical calculations were found to be particularly effective for pre-training,20–22 among others.23,24 In unsupervised pre-training, unlabeled data are used instead, and the model is then trained on objectives such as a contrastive loss or denoising loss.25–28 While pre-training can be applied to datasets of any size and complexity, including artificial ones, there is a growing effort to pre-train GNNs on datasets which attempt to cover the full range of the chemical and materials space. Once pre-trained, these models should, in theory, be generalizable to downstream datasets of arbitrary complexity and properties. These models we term as “atomistic foundation models (FMs)” (Table 1). A growing numbers of studies have shown atomistic FMs can improve accuracies of GNNs significantly over models trained from scratch (i.e. without pre-training), as well as reduce data requirements by an order of magnitude or more.20–22
Model | Release year | Num. params | Dataset size | Training obj. |
---|---|---|---|---|
MACE-MP-0 | 2023 | 4.69M | 1.58M | Energy, forces, stress |
GNoME | 2023 | 16.2M | 16.2M | Energy, forces |
MACE-MPA-0 | 2024 | 9.06M | 12M | Energy, forces, stress |
MatterSim-v1 | 2024 | 4.55M | 17M | Energy, forces, stress |
ORB-v1 | 2024 | 25.2M | 32.1M | Denoising + energy, forces, stress |
JMP-S | 2024 | 30M | 120M | Energy, forces |
JMP-L | 2024 | 235M | 120M | Energy, forces |
EqV2-S | 2024 | 31.2M | 1.58M | Energy, forces, stress |
EqV2-M | 2024 | 86.6M | 102M | Energy, forces, stress |
DPA3-v1-MPtrj | 2025 | 3.37M | 1.58M | Energy, forces |
DPA3-v1-OpenLAM | 2025 | 8.18M | 143M | Energy, forces |
Here, it is important to note the parallel development of universal interatomic potentials (UIPs), which are models trained to be broadly applicable force fields for systems of arbitrary complexity on compositions across the periodic table.12,21,28–32 Whereas UIPs are intended to be used out-of-the-box for one specific task (as force fields), pre-trained models require an additional fine-tuning step before they can be used, but are applicable to tasks beyond force fields. Nevertheless, the distinction between UIPs and pre-trained models can become blurred as in some cases, the training procedures and datasets for UIPs can be identical to those used in the creation of pre-trained atomistic models, namely when the pre-training objective is on energies and forces. Consequently, one can note that while not all pre-trained models can serve as UIPs, in general most UIPs should serve as capable pre-trained models.
Despite the demonstrated potential of atomistic FMs, general adoption by the broader scientific community is currently lacking, in large part due to the limitations of the available software infrastructure for its usage. While there is, to date, ample infrastructure for UIPs, this does not extend to any tasks beyond being used as force fields, such as materials property prediction. There is also limited standardization across different UIPs and atomistic FMs, resulting in a different package being needed for each different model, hampering benchmarking and workflow development. Finally, there is limited support for the customizability of the fine-tuning procedure, which is often hard-coded as a black-box method. As such, these existing packages do not currently fulfill the role of servicing atomistic FMs for general-purpose usage.
To address these limitations, we developed a modular, integrated, and user-friendly framework, called MatterTune, for fine-tuning atomistic FMs to be applied to a broad range of materials science applications. The development of MatterTune follows several general design principles:
(1) Highly generalizable and flexible abstractions that enable systematic extension while enforcing the necessary standardization.
(2) Modular framework decoupling models, data, algorithms, and applications, enabling a high degree of adaptability and customizability for different materials informatics tasks.
(3) Intuitive and user-friendly interfaces that simplify model fine-tuning and their application to downstream tasks.
So far, MatterTune has integrated several open-source atomistic FMs including JMP,20 ORB,28 EquiformerV2,18 MatterSim-V1,21 and MACE.33 We fine-tuned these models using the MatterTune platform and evaluated them on representative materials informatics tasks, including molecular dynamics simulations, property screening, and materials discovery, demonstrating the performance and reliability of the MatterTune platform and its capabilities for data-efficient learning.
• Data abstraction: the purpose of data abstraction is to provide unified support for as many input formats as possible for training and inference. We develop a minimal data abstraction that defines a dataset as a mapping
, where
represents the space of atomic structures in a standardized format. Given that different atomistic FMs require varying input formats, we choose
in ASE package34 as the standardized format of
. Individual atomistic FMs can then implement the necessary transformations from
to their respective input formats. Since the
format can store all structural and label information needed for training and prediction, this abstraction is broadly applicable.
• Property abstraction: we introduce a property schema system that formally separates the specification of physical properties from their model implementation, allowing users to focus solely on the types of properties they require from the model without concerning themselves with the details of how these properties are realized in FMs. This separation also enables the framework to handle both established properties like energy and forces as well as custom properties defined by users, and enforces type safety and physical constraints (e.g., energy conservation) in a property-specific manner.
• Backbone abstraction: the purpose of backbone abstraction is to provide a set of unified functional interfaces for using different backbones, regardless of various FMs' completely different model architectures. For example, some key functions include the function, which handles forward propagation during prediction, and the
function, which converts input structures from the
format into the format required by the model. This abstraction ensures simplicity and consistency in model usage while enabling each model to retain its native internal representations and implementations.
• The data subsystem follows the aforementioned data abstraction and handles conversion between various materials science formats and a universal internal representation used by the MatterTune framework. Currently we have provided built-in support for common formats like XYZ, JSON, and ASE databases, which can be readily expanded to include additional formats as needed.
• The model subsystem is designed around the backbone and property abstractions, allowing users to simply specify the type of atomistic FM and the desired properties to predict in order to declare and construct a model. All implementation details—such as loading checkpoints, constructing output heads, handling input data, and performing forward passes, are automatically managed by MatterTune, respecting the original implementation of each atomistic FM. This approach enables users to leverage atomistic FMs without requiring in-depth knowledge of their underlying architecture and implementation.
• The trainer subsystem handles the general training, validation, and checkpointing of FMs. A key design choice is made to integrate the training subsystem with PyTorch Lightning,35 a widely used and feature-rich training platform. This enables a range of critical capabilities while maintaining a clean separation of concerns between the model implementation and the training process. The integration of Lightning's abstractions allows MatterTune to maintain a modular and extensible architecture while still providing a simple, high-level interface for end users. Currently, MatterTune provides support for various optimizers and learning rate schedulers on the Lightning platform. It also includes implementations of data preprocessing statistics, exponential moving average, and other fine-tuning techniques, allowing users to select them freely. In addition, Lightning's callback features allow for ample flexibility for implementing more advanced fine-tuning strategies.
• The property prediction subsystem provides the means for users to access the trained FMs in an easy-to-use and intuitive manner to enable to quick integration of downstream materials tasks. This is accomplished by providing implementations of flexible wrapper classes for both general and targeted use-cases without having to deal with model architecture complexities or Lightning internals. As a starting point, we have implemented an which heritages the ASE34 calculator interface, enabling direct use with established molecular dynamics and structure optimization algorithms available within the ASE package. For high-throughput material property prediction tasks, we have designed the
as a wrapper around atomistic FMs, enabling batch prediction in parallel. We are working on implementing interfaces for additional materials informatics simulation and computation software, such as LAMMPS.
• A variety of optimizers and learning rate schedulers. So far, MatterTune has supported the Adam,36 AdamW,37 and SGD optimizers, as well as various learning rate schedulers including linear, step, exponential, cosine, and reduce-on-plateau. In addition to these, MatterTune also enables customization of optimizers and learning rate schedulers based on user needs. We support the application of different learning rates to different parts of the model, a technique that has been used and shown to be beneficial in the fine-tuning of models of the JMP series. Furthermore, we support combining multiple learning rate schedulers to achieve more sophisticated dynamic adjustments, such as cosine warm-up.
• Training generalization techniques such as Exponential Moving Average (EMA). Although the ablation studies in ref. 20 suggest that EMA does not significantly improve fine-tuning performance on datasets such as MD17: aspirin, MD22: stachyose, QM9: Δε, MatBench: MP E form, QMOF: band gap, and SPICE: solvated amino acids, we believe these datasets are still not small enough in scale. In our experiments described in Section 3, where models are fine-tuned using only 30 data points, we observed that EMA actually helps improve both the stability and performance of fine-tuning.
• A comprehensive normalization system that handles both standard statistical normalization and physics-informed normalization schemes. Fine-tuning of FMs may involve multiple targets—for example, training a force field model typically involves three targets: energy, forces, and stress. Proper normalization helps balance the loss scales of these targets, ensuring that the training process converges more smoothly without being dominated by any single target. MatterTune currently supports not only standard normalization methods such as mean-std and root-mean-square, but also composition-based normalization using element-wise regression. Additionally, the normalization system is designed to be composable, allowing multiple normalization schemes to be applied in sequence.
We first note that MatterTune maintains strict adherence to the original implementations of each integrated atomistic FM. However, many models do not provide openly available details on fine-tuning parameters and techniques for specific tasks. Given that hyperparameter tuning is both complex and computationally expensive, we did not perform exhaustive hyperparameter optimization for the benchmarks shown below. As a result, we cannot guarantee that each atomistic FM achieves its best possible performance on these tasks. Nonetheless, for tasks with publicly accessible reference results, we have made a dedicated effort to reproduce them.
The performance of JMP-S, ORB-V2, and Equiformer-31M-mp fine-tuned on Matbench is shown in Table 2. In our current tests, we perform fine-tuning on fold 0 of each dataset. In the table, we also list the fine-tuning performance on Matbench from the original JMP-S paper, as well as the best performance on the Matbench leaderboards. It should be noted that, since model fine-tuning can be a delicate process, variations in fine-tuning methods and hyperparameter configurations can lead to significant differences in the results. In our experiments, all models across all tasks were fine-tuned using the same configuration, so we cannot guarantee that the results reported on MatterTune represent the optimal performance of the models. Nonetheless, by comparing the fine-tuning results of JMP-S on MatterTune with those reported in the original paper, we found that we reproduced the reported accuracy in most tasks, with the only exception being formation energy, where our fine-tuning result was inferior to the original. Moreover, out of the three fine-tuned models JMP-S, ORB-V2, and Equiformer-31M-mp, the best model in each task significantly outperforms the current leading models trained from scratch on Matbench leaderboard.
Task (units) | Best on leaderboards (mean) | JMP-S-baseline (fold0) | JMP-S (fold0) | ORB-V2 (fold0) | EqV2-31M-mp (fold0) |
---|---|---|---|---|---|
Dielectric (unitless) | 0.271 | 0.133 | 0.146 | 0.142 | 0.111 |
JDFT2D (meV per atom) | 33.19 | 20.72 | 19.42 | 21.44 | 23.45 |
Log![]() |
0.067 | 0.06 | 0.059 | 0.053 | 0.056 |
Log![]() |
0.049 | 0.044 | 0.033 | 0.046 | 0.046 |
MP E_form (meV per atom) | 17.0 | 13.6 | 25.2 | 9.4 | 24.5 |
MP gap (eV) | 0.156 | 0.119 | 0.119 | 0.093 | 0.098 |
Perovskites (eV per unitcell) | 0.027 | 0.029 | 0.029 | 0.033 | 0.026 |
Although Matbench provides a train-test split for evaluating fine-tuned models, they are drawn from the same original dataset distribution, which prevents them from accurately reflecting the models' performance on unseen new materials. To address this, we further performed high-throughput property predictions on approximately 404763 new structures provided by the GNoME dataset (release 2024-11-21)39 which are distinct from the original Materials Project dataset. For demonstrative purposes, we also screened out structures with band gaps between 1 eV and 3 eV and compared the classification performance with the ground truth. The results are shown in Table 3. The results indicate that the ORB-V2 model, which achieved the highest test accuracy on the band gap task in Matbench, also delivered the best performance in band gap property screening on the GNoME dataset.
JMP-S | ORB-V2 | Equiformer-31M-mp | |
---|---|---|---|
MAE (eV) | 0.052 | 0.039 | 0.044 |
Accuracy (%) | 98.16 | 98.80 | 98.53 |
Recall (%) | 86.25 | 90.33 | 89.92 |
F1 | 0.826 | 0.884 | 0.861 |
To demonstrate the few-shot capability of the atomistic FMs in general, we followed the same experimental setup described in the original MatterSim paper. Out of the entire 1000 available ambient water data,40–42 we uniformly sampled 100 structures based on the energy distribution as a validation set and used the rest as the 900-sample dataset. We then randomly selected 30 structures from the 900-sample dataset and subsequently repeated them until a new dataset comprising 900 samples was obtained. We refer to this dataset as the 30-sample dataset. We fine-tuned various FMs on both the 900-sample and the 30-sample dataset and evaluated models' mean absolute errors on the validation set. The results are shown in Table 4.
MatterSim V1-1M | JMP-S | ORB-V3 Omat-conserv. | EqV2-31M mp | MACE-MP-0a medium | ||||||
---|---|---|---|---|---|---|---|---|---|---|
900-Sample | 30-Sample | 900-Sample | 30-Sample | 900-Sample | 30-Sample | 900-Sample | 30-Sample | 900-Sample | 30-Sample | |
MAEE (meV per atom) | 1.21 | 1.20 | 3.06 | 5.65 | 2.50 | 1.15 | 2.76 | 4.98 | 1.19 | 3.01 |
MAEF (meV Å−1) | 38.37 | 40.65 | 19.98 | 30.17 | 33.73 | 34.04 | 22.41 | 35.21 | 48.65 | 51.24 |
We further conducted 200 ps molecular dynamics (MD) simulations of a water structure with 192 atoms per unit cell at 298 K using the FMs fine-tuned on the dataset of 30 samples shown above. The MD thermostat engine employs the NPT ensemble implemented in ASE (without external stress to keep the cell fixed). The results of the radial distribution function analysis are shown in Fig. 2. Interestingly, we observed that although all five models performed well in terms of MAE, as shown in Table 4, the results of MD simulation varied significantly. MatterSim-V1-1M fits the experimental data best, and the results from MACE-MP-0a-medium, EquiformerV2-31M-mp, and ORB-V3-Omat-Conserv remain broadly acceptable, though show some discrepancies. In contrast, the JMP-S model, which has the lowest force MAE on the validation set, produces the RDF curve with the largest deviation. This observation echoes the statement in ref. 43, which cautions that evaluating models solely based on force MAE can lead to misleading conclusions. One possible explanation is that MatterSim-V1-1M, MACE-MP-0a-medium, and ORB-V3-Omat-Conserv are energy-conserving force-field models, whereas JMP-S employs direct force prediction, which makes its MD simulations less stable. However, EquiformerV2-31M-mp also uses direct force prediction, yet still yields reasonably accurate RDF results.
![]() | ||
Fig. 2 Oxygen–oxygen radial distribution functions of ambient water under 298 K obtained from foundation model based MD simulations. Black dots represent experimental references from ref. 44 and 45. All models shown are versions fine-tuned on a 30-sample dataset. The legend also reports the root-mean-square error between each model's MD-derived RDF curve and the experimental data. |
During these MD simulations of the ambient-water system, we assessed whether the FMs integrated and trained within MatterTune incur any systematic runtime overhead in either training or prediction relative to their original implementations. The results show that no systematic overhead is brought by MatterTune. Full numerical details are provided in Section S2 of the ESI.†
EqV2 S DeNS baseline | EqV2 S DeNS | MatterSimV1 5M baseline | MatterSimV1 5M | ORB-V2 baseline | ORB-V2 | |
---|---|---|---|---|---|---|
a Metric definitions: all metrics listed in the table follow the same definitions as their counterparts on the Materials Discovery leaderboard; upward (↑) and downward (↓) arrows denote that higher or lower values are preferred, respectively. F1: harmonic mean of precision and recall for stable/unstable materials classification; DAF: discovery acceleration factor measuring how much better the models classify thermodynamics stability compared to random guessing; Prec: precision of classifying thermodynamics stability; Acc – accuracy of classifying thermodynamics stability; MAE – mean absolute error of predicted vs. DFT convex hull distance; R2 – coefficient of determination. | ||||||
F1 ↑ | 0.815 | 0.792 | 0.862 | 0.842 | 0.880 | 0.866 |
DAF ↑ | 5.042 | 4.718 | 5.852 | 5.255 | 6.041 | 5.395 |
Prec ↑ | 0.771 | 0.756 | 0.895 | 0.876 | 0.924 | 0.899 |
Acc ↑ | 0.941 | 0.925 | 0.959 | 0.949 | 0.965 | 0.957 |
MAE ↓ | 0.036 | 0.035 | 0.024 | 0.024 | 0.028 | 0.027 |
R2 ↑ | 0.788 | 0.780 | 0.863 | 0.848 | 0.824 | 0.817 |
In Fig. 3, the representations of four different atomistic FMs are visualized using the t-SNE algorithm. The results show that the MatterSim and JMP models clearly capture the clustering of elements within the same group. This is to be expected from a chemical perspective, and suggests some level of transferable chemical knowledge is trained into these models. In contrast, the clustering for Equiformer and ORB models is less pronounced, especially for ORB. These results highlight the remarkable diversity in the internal representations of current atomistic FMs, which may arise due to differences in training objectives, training data, and model architectures.
We further visualized the representation spaces of the fine-tuned models (Fig. 4). We selected the MP_E_form dataset motivated by the fact that the fine-tuning results of the three models on this dataset showed notable differences (as detailed in Table 2). The visualization results reveal apparent similarities between the fine-tuned JMP-S and Equiformer models which cluster around element type, whereas ORB has no clear clustering similar to the non fine-tuned case. This pattern is consistent with JMP-S and Equiformer having similar MP_E_form accuracies, while ORB is significantly lower. However, the underlying link between the differences in the representation and the fine-tuning performance is still unclear and deserves further investigation, which can be easily facilitated with MatterTune.
To summarize, MatterTune is an effort to standardize and unify atomistic FMs while providing user-friendly interfaces for fine-tuning and applications. MatterTune also serves as a playground for experimenting and applying advanced fine-tuning algorithms to atomistic FMs. By lowering the barrier to the use of atomistic FMs, we aim to make them broadly applicable across a wide range of materials science challenges, especially in materials simulations and informatics. Furthermore, we hope that the MatterTune platform can provide a foundation for exploring how to fine-tune atomistic FMs more effectively to meet the increasingly demanding requirements of materials science research.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d5dd00154d |
This journal is © The Royal Society of Chemistry 2025 |