Open Access Article
Davide Bidoggia†
*a,
Nataliia Manko†
ab,
Maria Peressi
a and
Antimo Marrazzo
*b
aDipartimento di Fisica, Università di Trieste, I-34151 Trieste, Italy. E-mail: davide.bidoggia@units.it
bScuola Internazionale Superiore di Studi Avanzati (SISSA), I-34136 Trieste, Italy. E-mail: amarrazz@sissa.it
First published on 9th April 2026
Crafting neural-network interatomic potentials (NNIPs) remains a complex task, demanding specialized expertise in both machine learning and electronic-structure calculations. Here, we introduce AiiDA-TrainsPot, an automated, open-source, and user-friendly workflow that streamlines the creation of accurate NNIPs by orchestrating density-functional-theory calculations, data augmentation strategies, and classical molecular dynamics. Our active-learning strategy leverages on-the-fly calibration of committee disagreement against ab initio reference errors to ensure reliable uncertainty estimates. We use electronic-structure descriptors and dimensionality reduction to analyze the efficiency of this calibrated criterion, and show that it minimizes both false positives and false negatives when deciding what to compute from first principles. AiiDA-TrainsPot has a modular design that supports multiple NNIP backends, enabling both the training of NNIPs from scratch and the fine-tuning of foundation models. We demonstrate its capabilities through automated training campaigns targeting pristine and defective carbon allotropes, including amorphous carbon, as well as structural phase stability in monolayer WxMo1−xTe2 alloys.
Machine learning (ML) and neural-network interatomic potentials (NNIPs) have revolutionized the field, by bridging the gap between ab initio accuracy and computational efficiency.2 These models are trained on high-fidelity density functional theory (DFT) data, allowing them to capture complex quantum mechanical interactions with far greater accuracy than traditional empirical potentials. Unlike fixed functional forms, ML potentials can flexibly generalize to diverse chemical environments while maintaining computational costs significantly lower than AIMD. This breakthrough has enabled large-scale and long-timescale simulations of materials with near-quantum accuracy, essentially extending the scope of what can be simulated from first principles.
Over the last years, accuracy and data efficiency of ML interatomic potentials have improved remarkably, often at the price of increased algorithmic complexity. In this context, NNIPs based on equivariance have emerged as particularly promising and several architectures have been proposed, including NequIP,3 Allegro,4 MACE5 and Point Edge Transformer (PET).6 More recently, foundation models leveraging large-scale pretraining on diverse chemical datasets, provided a path towards improved transferability, data efficiency, and accuracy across a wide range of materials and molecular systems.7,8
As of today, the training of an accurate NNIP remains a complex and time-consuming task. First, high-quality NNIPs require training datasets of the order of thousands of supercell single-point ab initio calculations with hundreds of atoms in the unit cell; millions if foundation models for the entire periodic table are targeted. Most notably, the accuracy and extrapolation capabilities of NNIPs hinge on the careful choice of the training structures, which have to be sufficiently diverse and abundant to avoid overfitting. While foundation models promise transferability across chemical space, fine-tuning them to high accuracy for a given material family still requires curated datasets.
In this work, we introduce AiiDA-TrainsPot, an automated, open-source, and user-friendly framework for training neural-network interatomic potentials (NNIPs). It integrates automated workflows for DFT calculations with neural-network training and classical MD to systematically explore the potential-energy surface (PES) via random distortions, strain, interfaces, neutral vacancies, and trajectories across a range of temperatures and pressures.
Widely used platforms such as DP-GEN,9 SchNetPack,10 FLARE,11 AL4GAP,12 ASPARAGUS,13 and others implement or support active-learning strategies—including on-the-fly learning during molecular dynamics, uncertainty-driven sampling, and ensemble-based selection criteria—thereby demonstrating their effectiveness for ML interatomic potentials. Active-learning training is nowadays widely used, also thanks to key contributions such as the extrapolation-grade framework for moment tensor potentials14 and on-the-fly force-field learning during MD.15,16 More recent efforts have started incorporating these methods into standardized open-source workflow implementations, also improving exploration strategies to broaden configurational coverage, including uncertainty-driven dynamics for sampling17 and the integration of enhanced sampling (e.g., metadynamics) into active-learning loops.18 As of today, an ecosystem of active-learning tools for ML interatomic potentials exists (including DP-GEN, SchNetPack and others), spanning different model families and exploration protocols. At the same time, most existing implementations remain closely coupled to specific potential families or training engines, while typically lacking automatic restart strategies and systematic provenance tracking across the full training pipeline—including dataset generation, model selection, and first-principles calculations. In addition, while committee-based methods are widespread in the community, they are not guaranteed to provide an accurate quantitative proxy for the true deviation from the reference quantum mechanical calculations.19,20 Here we focus on a complementary contribution that fills the gap: a provenance-tracked, restartable and automated workflow across all stages of training and simulation, and that calibrates committee disagreement against ab initio errors on-the-fly to provide quantitative selection thresholds and uncertainty estimates. We provide in Table S1 of the SI a detailed feature-by-feature comparison between AiiDA-TrainsPot and the widely used active-learning frameworks DP-GEN, SchNetPack, FLARE, and AL4GAP.
Indeed, AiiDA-TrainsPot is conceived as a modular and extensible automation layer rather than just as an active-learning protocol. It offers (i) code-agnostic modularity across quantum engines, ML architectures, and MD codes; (ii) an extensive suite of automated dataset-augmentation strategies (defects, slabs, clusters, substitutions, alloys); and (iii) a calibrated committee-disagreement scheme that provides quantitative uncertainty estimates, including in production runs. The architecture deliberately decouples sampling strategies from model training and first-principles backends, facilitating the future integration of additional quantum engines, ML frameworks, and MD protocols within the same automated workflow.
We validate the method through fully automated NNIP training and fine-tuning campaigns on a diverse set of carbon allotropes, including amorphous carbon, and on structural phase stability in monolayer WxMo1−xTe2 alloys, achieving state-of-the-art accuracy and data efficiency. Combined with AiiDA's reproducibility infrastructure, these features position AiiDA-TrainsPot as both an empowering tool for domain scientists and a robust platform for future foundation-model development.
The active learning loop continues until errors on energy, forces and stress tensor are below a user-defined threshold. We emphasize that AiiDA-TrainsPot supports multiple use cases depending on the available input data, which goes beyond the generation of a NNIP from scratch and includes the fine-tuning of foundation models (see Sec. 2.2). In the following, we discuss in detail each step of the workflow.
, determined by boundary conditions (periodic vs. open), cell parameters, atomic species and atomic positions. The number and diversity of input structures should reflect the target applications: for example, the study of temperature-dependent properties of diamond might require a single input structure, the development of NNIP for all carbon allotropes would probably include at least all known crystalline prototypes of carbon, while universal (a.k.a. foundation) models for the entire periodic table might require tens of thousands of input structures, which could be obtained from computational materials databases such as Materials Cloud,21 Materials Project,22 or crystallographic databases such as ICSD,23 COD,24,25 MPDS.26–28 While the user is responsible for providing these fundamental structures, the workflow progresses with automatic data augmentation to enhance dataset diversity without requiring exhaustive manual curation.
. All manipulations can be controlled through customizable parameters to tailor the augmentation process according to specific user needs; we group them in the following categories:• Supercells: initial structures are replicated aiming to ensure cells larger than a minimum threshold value (default: 18 Å, corresponding to twice the MACE default receptive field) while keeping the total atom count below a user-defined maximum limit (default: 450 atoms).
• Random distortions: atomic positions are perturbed with random displacements, where the magnitude follows a uniform distribution up to a user-defined fraction (default: 30%) of the original nearest-neighbor distance. This introduces configurations away from equilibrium while preventing unphysically close atoms, which could lead to large forces that are difficult—and potentially uninformative—for NNIPs to learn.
• Strain: strain can be applied to crystal structures by rescaling lattice parameters by a factor randomly sampled from uniform distribution between a user-defined range (default from −20% to +60%). This is key for predicting elastic properties.
• Vacancies: vacancies are created by removing atoms at randomly selected sites, to learn about defect energetics and local relaxations around missing atoms. By default, vacancies are introduced in 30% of the randomly distorted structures, with 2 atoms removed for each configuration.
• Clusters: atomic clusters are constructed by assembling atoms with a next-neighbor distance between 1 and
times a user specified distance (default: 1.5 Å). This creates non-periodic environments that can become useful for training potentials capable of describing isolated molecules or clusters, and more in general surface or edge terminations.
• Slabs: slabs are created by cutting bulk supercells along selected crystallographic directions (default: (111), (110) and (100)) ensuring a minimum slab thickness (default: 10 Å) unless a maximum number of atoms (default: 450) is reached. Those structures allow the NNIPs to learn and predict surface energetics, surface-specific forces, reconstructions and other relaxation phenomena.
• Isolated atoms: single-atom configurations are included to establish reference energies for accurate calculations of the dissociation limit. These structures are computed using the same DFT settings as the rest of the dataset to ensure consistency, though we note that the procedure does not take into account the actual magnetic configuration for some atomic species.
• Atomic substitutions: for multi-species systems, randomly selected atoms of different elements are swapped to create chemical disorder and explore different local chemical environments. The user can define both the fraction of previously generated structures to undergo substitutions (default: 20%) and the number of swaps per structure as a fraction of total atom count (default: 20%). This helps to explore various chemical environments, substitutional defects, and atomic site preferences, ensuring robustness across different chemical compositions within the same structural motif.
• Alloys: alloy configurations are generated by randomly mixing atomic species (specified via alloy_species), while optionally keeping some species fixed (specified via fixed_species). If no target compositions are specified, the workflow samples random alloy configurations with concentrations spanning the full range from 0 to 1.
The presented dataset augmentation techniques are applied by default; however, users can choose to apply only a subset of them or even skip the dataset augmentation stage altogether. After this stage the resulting dataset
integrates structures of different dimensions and boundary conditions: fully periodic (bulks), partially periodic (surfaces, nanowires, or 2D materials), and non-periodic (molecules, clusters) configurations. However, since DFT and MD calculations are performed in full periodic boundary conditions, for all structures that are non-periodic at least along one direction, the workflow ensures the presence of an appropriate vacuum buffer (default: 15 Å thick) along such directions in order to eliminate spurious interactions between periodic images.
in the augmented dataset is labeled through DFT calculations to obtain high-fidelity reference values for energies, forces, and stress tensors. We use the compact notation
to represent these computed properties. In subsequent sections, we denote specific quantities of interest as
where α ∈ {E, F, σ}. Ultimately, ab initio calculations directly determine an upper bound for the accuracy and precision of the trained NNIPs. While the overall accuracy is typically limited by considerations of computational efficiency and resources, precision can be substantially improved by enforcing well converged calculations and a consistent choice of key simulation parameters over cells of different sizes and dimensions. In this context, even if the workflow allows users to have full control over the DFT level of theory, by default AiiDA-TrainsPot enforces the use of well established simulations protocols originally introduced for high-throughput calculations. In particular, PBE pseudopotentials and cutoff parameters are given by version 1.3 of the branch of the Standard Solid State Pseudopotentials (SSSP) library optimized for precision,29–31 while the reciprocal space k-point density and smearing follow the stringent protocol defined by Nascimento et al.32By default, the workflow does not incorporate additional van der Waals (vdW) corrections at the DFT level, since NNIPs would in principle require rather large radial cutoffs to accurately learn long-range interactions from the training data. However, for systems where dispersion forces are critical (e.g., layered materials or molecular crystals), users can enable empirical vdW corrections (such as Grimme-D2, D3) during subsequent MD simulations.33
is used to train a committee of M NNIPs {Φj}Mj=1, each with identical architecture but initialized with different random seeds. Prior to training, all structures are systematically partitioned into three subsets, ensuring representative sampling across different structural motifs while maintaining similar distributions of atomic environments:- Training set (default: 80%): used for model parameter optimization through gradient-based learning;
- Validation set (default: 10%): used for hyperparameter tuning, early stopping decisions, and selection of optimal checkpoints during training;
- Test set (default: 10%): reserved exclusively for final model evaluation, providing an unbiased assessment of generalization performance.
Throughout the active learning iterations, each structure remains in its initially assigned set, ensuring that the test set remains completely independent from the training and validation sets for reliable performance assessment. The training is performed using either MACE or Metatrain34 with default hyperparameters, though these can be fully customized by the user of AiiDA-TrainsPot to suit specific requirements.
At each iteration, one NNIP (Φ1) is randomly selected from the committee to perform the MD simulations. By default, a set of 20 structures is randomly sampled from the initial augmented dataset
to serve as starting configurations. This selection strategy deliberately uses structures from the initial dataset rather than from the most recent active learning iteration dataset to ensure sampling of a configuration space that is not biased by previous explorations and remains representative of the user's target application domain. Users can customize this selection by specifying either the number of structures to sample or providing an explicit list of starting configurations.
All MD simulation parameters are fully customizable, including ensemble type, temperature, pressure, simulation time, and timestep. By default, simulations run for 10 ns with a 1 fs timestep, while temperatures and pressures range from 0 to 1000 K and −5 to 5 kbar respectively, systematically sampling diverse thermodynamic conditions. AiiDA-TrainsPot automatically selects an isothermal-isobaric (NPT) ensemble with barostats acting only on periodic directions for bulk systems, or an isochoric-isothermal (NVT) ensemble for non-periodic systems. Additionally, even though dispersion corrections are disabled by default, users can activate Grimme's D2 van der Waals correction35 during MD simulations, with coupling parameters automatically selected based on the atomic species present in the system.
To minimize correlations between sampled configurations, trajectory frames are extracted at regular intervals (default: 1 ns), ensuring the collection of statistically uncorrelated dataset that efficiently represent the accessible regions of the PES.
sampled from MD trajectories is evaluated by all NNIPs of the committee and we quantify the uncertainty of prediction through the committee disagreement metric
, which is calculated separately for each property α (energy, forces—averaged over all the atoms and components—and stress tensor):
![]() | (1) |
represents the prediction of property α from potential Φj for structure
, and
denotes the mean prediction across all committee members. Structures exhibiting disagreement beyond a specific threshold τthrα are flagged as uncertain and prioritized for quantum mechanical labelling.The error
with respect to the ab initio labels is
![]() | (2) |
and is not, at least in principle, strictly correlated to the committee disagreement, although empirical evidence suggests they might be linearly related19,20 (as we show and discuss more in detail in Sec. 2.3 through a validation study on carbon allotropes). After linear regression, the user-defined error tolerance threshold εthrα (default 1 meV per atom, 100 meV Å−1, 1 meV Å−3 for energy, forces and stress tensor respectively) is transformed into an equivalent disagreement threshold τthrα that serves as the selection criterion for new structures by τthrα = aαεthrα, where coefficients aα are the slopes determined from fitting. The calibration, made at each iteration on the already ab initio labeled structures and compared with
computed with the last generation potentials, ensures that the uncertainty quantification mechanism remains effective throughout the iterative learning process, as the NNIP committee's overall accuracy improves with each active learning cycle. Notably, the calibrated committee disagreement
can be used not only in the active learning scheme for the selection of worse predicted structure, but also in production runs, where it can provide an estimate of the uncertainty of the predictions of the NNIPs.
For large exploration sets
, computational constraints often necessitate limiting the number of structures labeled in each active learning iteration. However, selecting only the structures with the highest disagreement values may not be optimal, as these could represent configurations far from the original dataset and the PES region of interest to the user. While such structures cannot be discarded a priori—since they may represent poorly predicted configurations that are nonetheless close to relevant regions of the PES—a more balanced selection strategy is needed. Therefore, we randomly select structures from those that exceed the threshold τα for any property α. This approach naturally prioritizes structures closer to the original dataset, as they are more likely to be represented in the configurations extracted from MD trajectories, while still capturing the most uncertain predictions. By default, at each active learning iteration a maximum number of 1000 new structures are selected. The active learning loop continues until one of two termination criteria is met: either all structures exhibit disagreement below the specified threshold τthrα, indicating the achievement of the desired confidence across the configuration space of interest, or the maximum number of iterations L is reached. The final output of AiiDA-TrainsPot includes the latest NNIP committee {Φ}, the full ab initio dataset
and quantitative RMSE performance metrics for each potential.
The workflow architecture follows a hierarchical design of nested AiiDA WorkChains (Fig. 2), enabling both end-to-end automation and selective execution of individual components. The top-level TrainsPotWorkChain coordinates five specialized sub-processes that correspond to the major stages of a training campaign:
• DatasetAugmentationWorkChain for generation of highly-diverse training structures
• AbInitioLabellingWorkChain for quantum mechanical calculations
• TrainingWorkChain for training NNIP committees
• ExplorationWorkChain for exploration of the PES based on MD
• EvaluationCalculation for committee-based error estimation and identification of structures for which the NNIP yields predictions with low accuracy.
A key advantage of our implementation strategy and of the use of AiiDA as a workflow engine is the extensive reuse of existing AiiDA plugins, minimizing duplication of software and ensuring robustness through well-tested components. The AbInitioLabellingWorkChain calls PwBaseWorkChain from the AiiDA-Quantum ESPRESSO plugin,39,40 while the ExplorationWorkChain leverages LammpsBaseWorkChain from the AiiDA-LAMMPS plugin.41 For specialized functionality not available in existing plugins, we develop custom components such as the DatasetAugmentationWorkChain, which implements various structure manipulation techniques through dedicated AiiDA calculation functions based on the Atomic Simulation Environment (ASE).42 Similarly, the TrainingWorkChain interfaces with NN and ML training codes and can be configured by the user to choose the training instrument, either MaceTrainWorkChain for MACE or MetatrainWorkChain for Metatrain, handling all preprocessing, training, and postprocessing steps in a fully automated manner. We note that Metatrain currently implements both the PET architecture and other ML architectures not all based on neural-networks, such as Sparse Gaussian Approximation Potentials (GAP)43 and Behler–Parrinello neural networks with SOAP features (SOAP BPNN).44
To optimize computational efficiency, AiiDA-TrainsPot implements parallel execution strategies within three main WorkChains (AbInitioLabellingWorkChain, TrainingWorkChain, and ExplorationWorkChain). Multiple DFT calculations, neural-network training sessions, and MD simulations are submitted concurrently, with results collected and analyzed collectively before proceeding to the next workflow stage.
This modular design provides several advantages beyond efficient execution. First, it offers maximum flexibility through multiple entry points, allowing users to bypass specific stages depending on their needs (see Fig. 3). For instance, users who would like to leverage access to existing datasets of labeled structures can proceed directly to training. The workflow also supports fine-tuning pretrained models, including foundation models.
Second, the architecture enables selective execution of individual components, permitting users to run specific stages independently. This is particularly valuable for scenarios such as structure labelling with DFT calculations without proceeding to the training phase, or for evaluating the accuracy of existing potentials on new trajectories using the committee disagreement.
Finally, the architecture maintains extensibility through AiiDA's plugin system, facilitating future integration with additional quantum engines, classical MD codes, or emerging ML frameworks. Individual components like MaceTrainCalculation, MetaTrainCalculation or EvaluationCalculation can be used as standalone tools outside the high-level WorkChain, making AiiDA-TrainsPot both a comprehensive platform and a flexible toolkit for specialized tasks in interatomic potential development.
To efficiently manage large datasets within the AiiDA framework, we introduce a custom AiiDA data type, PESData, a subclass of aiida.orm.Data. This specialized data type enables the storage and manipulation of extensive sets of atomic structures along with their associated properties, while maintaining full compatibility with AiiDA's provenance tracking system.
PESData offers significant advantages over existing data types such as TrajectoryData, which is limited to structures containing identical numbers of atoms. Our implementation can seamlessly handle datasets with varying numbers of atoms per structure, making it ideal for diverse training sets that include bulk materials, surfaces, clusters, and defect structures. Beyond structural information and labeled properties, PESData can also store custom metadata including references to the original data sources, committee evaluation results, accuracy metrics, computational parameters used for labelling, and other application-specific information.
For performance optimization, the dataset is stored in the AiiDA repository as an HDF5 file using the h5py library.45 The class implements Python iterators to read data in chunks, enabling efficient handling of large datasets without overwhelming memory resources. This approach is crucial when working with training sets containing thousands of structures with hundreds of atoms each, and to maintain high performance and usability throughout the NNIP development workflow.
We conduct two independent validation runs, both initiated from the same set of 48 carbon structures primarily sourced from the Materials Cloud 2D and 3D databases21 (44 structures), supplemented with 4 selected low-dimensional structures (nanoribbons and fullerenes) from the dataset developed by Drautz et al.33
The first run, designated as run A or “fast exploration”, is designed to assess the workflow's capability to explore the PES and study the convergence of the active learning scheme for increasingly large datasets. The run starts by augmenting structures near equilibrium configurations (see Methods Sec. 4 for more details), that is followed by MD with a wide range of temperatures from 0 up to 5000 K. This also tests the model's ability to capture extreme conditions such as melting and formation of amorphous phases. As we want to explore convergence for rather large training datasets, which involve up to 104 single-point DFT calculations, we employ here the AiiDA-QuantumESPRESSO fast protocol32 for DFT calculations.39,40
The second run, designated as run B or “accuracy and data-efficiency”, focuses, instead, on achieving rather accurate predictions for equilibrium and near-equilibrium conditions with small training datasets (around 2 × 103 calculations). The settings of data augmentation and the MD runs are refined for more efficient exploration, including smaller distortions of atomic positions, lower temperatures (up to 1000 K) and larger range of pressures (see Methods Sec. 4). Run B employs the AiiDA-QuantumESPRESSO stringent protocol, which uses more accurate pseudopotentials and stricter convergence criteria. While this differs from the fast protocol used in run A, it ensures higher numerical precision and reduced noise in the reference dataset.
The fast-exploration campaign runs for 10 iterations of active learning, Fig. 4 reports the evolution of the model performance in cold colors. The top panels track the RMSE for energy, forces, and stress tensor components across training, validation and test sets, as a function of the active learning step and, in parallel, the dataset size that range from the initial 1177 structures (iteration 1) to 9537 structures in the final iteration. For energy and stress tensor components, we observe a consistent decrease in prediction errors as the active learning progresses. Interestingly, errors on forces increase from the first to the second iteration, and then decrease monotonically for all the following iterations, suggesting that the active learning strategy first explores novel regions of the PES that require more data to learn. This is supported by the data analysis based on dimensionality reduction discussed later, which shows that early iterations sample an increasing number of distinct structural prototypes. The final RMSE values reach 4.3 meV per atom for energies, 293.0 meV Å−1 for forces, and 3.7 meV Å−3 for stress tensor components on the test set, comparable to those reported in ref. 33 for a ML potential trained and tested on all carbon allotropes. The error bars in Fig. 4 represent the standard deviation across the model committee, which also decrease with iterations as the model becomes more consistent and robust. The bottom panels display parity plots comparing DFT reference values against NNIP predictions for the test set after the final iteration. The tight clustering of points along the diagonal line, together with the error distribution histograms (insets), demonstrates excellent agreement between the ML predictions and DFT calculations across all evaluated properties.
To better understand how our active learning strategy explores the potential energy landscape, we analyze data diversity in the training dataset by using the kinetic Spectral Operator Representation (SOREP) descriptor.46 The kinetic SOREP provides a compact electronic-structure fingerprint based on the density of states computed for the kinetic energy operator, which is evaluated on a basis set made of a customized version of the atomic natural orbitals (ANO) in terms of contracted Gaussian-type orbitals (cGTO).47–53 We then use t-SNE (t-distributed Stochastic Neighbor Embedding) to visualize the high-dimensional SOREP descriptors in 2D space, with points colored according to the active learning iteration in which they were generated (left panel of Fig. 5). This depicts the evolution of the training set in electronic-structure space during the active-learning process. The clustering pattern reveals that early iterations (iterations 1–2) sample distinctly different and broad regions of the configuration space compared to the initial dataset (depicted as iteration 0 in the left panel of Fig. 5). This explains the initial increase in force errors, as the model encounters novel atomic environments with rather different electronic structures, requiring additional data to be learned accurately. In later iterations, after thorough sampling of these initial regions, the workflow begins exploring new domains. While the decrease in errors after the initial iterations can be attributed to good sampling of primary regions, the reduction is modest as exploration of new regions continues concurrently. Although the final model achieves rather good accuracy, further active learning iterations could be performed to explore a wider region of the PES and enhance even further model performance.
Beyond understanding the exploration strategy through SOREP analysis, we evaluate the effectiveness of using committee disagreement as an uncertainty metric for steering the growth of the training dataset. The right panel of Fig. 5 demonstrates the correlation between committee disagreement
and true prediction errors
at the final active learning iteration. The analysis reveals an approximately linear relationship between these quantities, confirming that committee disagreement can serve as a reliable proxy for accuracy,19 although the proportionality constant is far from being unity. This aspect, anticipated in Sec. 2.1.6 and consistent with recent discussions on uncertainty quantification in ML atomistic models,20 is addressed here by introducing a calibration factor aα determined via linear regression. This factor transforms user-defined error tolerances εthrα into equivalent disagreement thresholds τthrα; the calibrated committee disagreement is then employed to select the structures to be labeled at the first-principles level. This calibration procedure ensures that the committee disagreement threshold appropriately reflects the true prediction errors, addressing cases where the uncalibrated disagreement metric might either overestimate (as observed for forces in the right panel of Fig. 5) or underestimate (as seen for energies and stress in our validation test) the actual deviations from DFT.
To quantitatively assess the reliability of this approach, we analyze two key metrics shown in the insets of the right panel of Fig. 5: the True Positive Rate (TPR) and Positive Predictive Value (PPV) as a function of the disagreement and the true error thresholds. The TPR is defined as:
![]() | (3) |
and FN is the number of false negatives
. The PPV is defined as:
![]() | (4) |
. These metrics quantify the reliability of using committee disagreement for structure selection. A high TPR indicates the approach successfully identifies structures with large true errors, while a high PPV confirms that selected structures genuinely require additional training. The analysis shows that along the fitted correlation line (dashed red line), i.e., using a calibrated committee disagreement, both TPR and PPV remain close to unity, particularly for force predictions up to several hundred meV Å−1—which is the typical range for accuracy thresholds and where most data points lie. Furthermore, we find that the performance of the calibrated disagreement criterion is robust with respect to the progression of the active-learning procedure and to variations in the committee size, with no significant degradation of either the TPR or the PPV (see SI Fig. 1 in SI). This indicates not only that calibrated committee disagreement effectively identifies the structures requiring additional ab initio calculations, but it shows to be an optimal and robust strategy that simultaneously maximize both TPR and PPV, hence minimizing both the number of false positives and false negatives.
While the first validation study demonstrates the effectiveness of our active learning strategy, we observe diminishing returns in RMSE improvement as the number of iterations and the size of the dataset increase, due to the exploration of larger and larger regions of the PES. This suggests that strategic optimization of initial structures and data augmentation parameters can enhance the NNIP accuracy for the target application, while performing very few iterations of active learning. Therefore, for the second validation—focused on accuracy and data efficiency—we target the description of polymorphs near equilibrium with a refined data augmentation strategy combined with constrained temperature ranges in MD simulations: strain range is increased with respect to the previous run, while atomic rattling is reduced and MD temperatures range up to 1000 K (see Methods Sec. 4 for details). This approach delivers high-quality potentials in just two active learning iterations and about 1800 training configurations, hence making more affordable and sustainable the use of the stringent protocol for DFT calculations, which is computationally more expensive.
The accuracy metrics are reported as warm colors in Fig. 4: the NNIPs score (test set RMSE) 15.9 meV per atom for energies, 319.8 meV Å−1 for forces, and 15.1 meV Å−3 for stress tensor components. While the overall errors on the test set seem comparable to the first run, we target here accurate energetic and vibrational properties, which are shown later to be in good agreement with DFT.
Although our training data is obtained with the semi-local Perdew–Burke–Ernzerhof (PBE) functional,54 we efficiently include long-range van der Waals interactions by adding Grimme's D2 dispersion corrections35 on top of the NNIPs. We check that the approach works in practice by calculating the energy profile as a function of interlayer distance in graphite (see Fig. 6), comparing the NNIP with PBE—both with and without D2 corrections, where the in-plane lattice parameters are fixed with DFT structural optimization. Except for defect formation energies, the following validation tests are all performed with Grimme's D2 dispersion correction applied on top of the NNIP and compared with DFT calculations that include D2 corrections as implemented in Quantum ESPRESSO.
Fig. 7 shows the equation of states (EOS) for various carbon allotropes, including graphite, graphene, diamond, dimer, simple cubic (sc), face-centered cubic (fcc), and body-centered cubic (bcc) structures. Binding energies are evaluated with reference to the energy of isolated atoms. This benchmarking approach is widely adopted in the literature,33,55 as it offers a compact yet informative way to assess how accurately a potential reproduces bonding behavior across diverse local geometries.
The left panel of Fig. 7 shows the excellent agreement between NNIP predictions and DFT references in around equilibrium; discrepancies become more noticeable at extreme bond compressions or expansions, which correspond to configurations that are underrepresented in the training data. Notably, the EOS for the dimer is reasonably accurate, even if no dimer configurations were included in the initial training set. The right-hand panel compares the EOS for graphite, graphene, and diamond, focusing on their relative energetic ordering. This inset is especially informative because graphite and diamond exhibit nearly degenerate formation energies in DFT. The NNIP correctly reproduces the energy hierarchy, demonstrating its ability to capture subtle thermodynamic trends. Similar benchmarking practices have been applied in the development of the ACE interatomic potentials.33
To assess the transferability of our NNIP to defective structures, we compute the formation energies of three representative point defects in graphene: the monovacancy, divacancy, and Stone–Wales defect. The formation energy is defined as:
| Eform = Edefected − Epristine + n·E0, |
As summarized in Table 1, we evaluate defect formation energies using four complementary approaches: structures relaxed with the NNIP and subsequently evaluated with either the NNIP or DFT, and structures relaxed with DFT and evaluated with either the NNIP or DFT. This protocol allows us to separately assess the contributions of structural relaxation versus energy evaluation to the overall accuracy.
| Relax-energy | Monovacancy | Divacancy | Stone–Wales |
|---|---|---|---|
| NNIP-NNIP | 7.30 | 7.91 | 4.74 |
| NNIP-DFT | 8.03 | 7.47 | 4.67 |
| DFT-NNIP | 7.72 | 8.00 | 4.74 |
| DFT-DFT | 7.72 | 7.39 | 4.64 |
For all the cases considered, the calculated formation energies agree well with literature values33,56 and our DFT references. For the divacancy and Stone–Wales defects, residual NNIP-DFT differences stem mainly from energy evaluation and not from the relaxation method. For the monovacancy, however, differences in the relaxed structures matter more: spin-unpolarized DFT yields a Jahn–Teller-like reconstruction with a slight out-of-plane displacement of one dangling atom. This subtle rearrangement is not fully reproduced by the NNIP, which returns indeed highly accurate energetics on the DFT-relaxed structure, but describes less accurately the local PES curvature. Instead, the NNIP predicts a completely flat configuration to be the energetically more stable for the monovacancy. However, we should note that a physically accurate description of vacancies in graphene—particularly monovacancies—would anyway require spin-polarized DFT calculations to properly account for the unpaired π orbitals,57 which were not used for our reference dataset.
We present in Fig. 8 the phonon dispersion and density of states for graphene, graphite, and diamond. The comparison between NNIP predictions and density-functional perturbation theory (DFPT)58 calculations demonstrates good agreement across all three carbon allotropes, confirming the potential's ability to accurately capture vibrational properties. The observed small imaginary ZA phonons near Γ are a well-known numerical issue for 2D materials, which can be solved by adopting prohibitively tight parameters and in particular very high plane-wave cutoffs.59 Notably, despite being trained on DFT data exhibiting this, the NNIP does not display such unphysical behavior.
![]() | ||
| Fig. 8 Phonon dispersion and density of states for graphene, graphite, and diamond. The comparison between NNIP predictions and DFPT calculations shows good agreement for all three carbon allotropes. | ||
In order to further assess the transferability of our NNIP, we consider amorphous carbon, which features a disordered network of mixed sp2 and sp3 bonds that was not explicitly included in the rattled and strained crystalline configurations included in the training set. We compute the radial distribution function (RDF), g(r), which characterizes short-range order in disordered systems, on independent amorphous configurations at a density of 3.5 g cm−3 that have been obtained using the melt-and-quench procedure described in ref. 60. Fig. 9 compares the RDFs for the committee of potentials and the committee average with reference ab initio molecular dynamics (AIMD) data from ref. 61. The NNIPs closely reproduce the AIMD reference, accurately capturing the position and height of the first and second peaks, confirming the ability of our potentials to reliably describe the local structure of amorphous carbon.
As a further transferability test, we evaluate one NNIP on the SACADA dataset, which contains 1635 carbon allotropes collected from the scientific literature.62 Unlike the standard test sets, which are generated using the same protocols as the training sets (run A and run B) and therefore include structures obtained from data augmentation and MD sampling, the SACADA structures can differ substantially from the training set and span a wide variety of densities and space groups. This makes the SACADA evaluation a genuine test of the model's ability to generalize beyond its training domain. Out of the 1635, we exclude 13 structures due to unconverged DFT labeling. Then, we discard 19 structures where at least one of the potentials yields a committee-based disagreement larger than the following loose thresholds: 0.2 eV for energy, 50 eV Å−1 for forces, and 1 eV Å−3 for the stress tensor. The remaining 1603 structures (98% of the SACADA dataset) represent the overlap set between converged DFT simulations and successful NNIPs predictions, as required for a meaningful comparison. Here, we also evaluate the universal PET-MAD potential (v1.0.2),8 both in its original form and after fine-tuning on the carbon dataset generated in run B. Since run B was obtained with PBE functionals, while PET-MAD was trained with PBEsol,63 we relabel the SACADA dataset with PBEsol for evaluating the accuracy of PET-MAD potentials. In passing, we note how the modular structure of AiiDA-TrainsPot permits a seamless exchange of training engines (MACE or Metatrain) and architectures, the reuse of existing labeled datasets, and uniform evaluation procedures across models. We report the parity plots and RMSE in Fig. 10: the MACE potential trained from scratch with AiiDA-TrainsPot on 1795 structures performs similarly to the PET-MAD foundation model, with comparable error on forces and stress tensors while doing slightly worse on energies (89 vs. 54 meV). Notably, fine-tuning PET-MAD on the PBEsol-relabeled run B datasets lowers the error on all metrics (energies, forces and stress tensors) both compared to the MACE potential trained from scratch and to PET-MAD.
For each W fraction x, the formation energy of a given configuration is defined as:
| Ef(x) = E(WxMo1−xTe2) − xE(WTe2) − (1 − x)E(MoTe2), | (5) |
, which directly determines the relative structural phase stability at zero temperature.
To generate validation configurations, we employ special quasirandom structures (SQS)65 to approximate random W/Mo distributions on the metal sublattice at a given composition x. For each intermediate composition shown in Fig. 11, we construct and evaluate 10 independent SQS realizations to ensure statistical robustness of the estimated formation-energy differences.
As shown in Fig. 11, existing foundation models struggle with this task. MATTERSIM (v1.0.0–1M) and MACE (MATPES–PBE–0) incorrectly predict the T′ phase to be the ground state for all compositions, while PET–MAD (v1.0.2), although capturing a structural phase stability and quite well the formation energy difference for end-members, places the critical concentration around x ∼ 0.7—far from the DFT reference value at about x ∼ 0.3.
Then, we employ AiiDA-TrainsPot to train three additional MACE models, all starting from the same four pristine input structures (H- and T′-phase MoTe2 and WTe2) taken from the Materials Cloud 2D database:59
1. Alloy model trained from scratch. A dataset of 1687 alloy configurations is generated by randomly sampling the W/Mo distribution in both phases during the dataset-augmentation stage, and a committee of models is trained from scratch on this dataset.
2. Fine-tuned alloy model. The same alloy dataset is used to fine-tune the MACE MATPES–PBE–0 foundation model.
3. End-member model with active learning. A fully independent model was trained from scratch using only configurations of the pure end-members in both phases (no alloys, clusters, surfaces, or substitutional configurations were generated during dataset augmentation). Twelve active-learning iterations are performed, yielding a final dataset of 6965 structures.
All three models markedly outperform the existing foundation models, in particular accurately capturing both the composition dependence of the formation-energy difference and the H–T′ phase-stability crossover. Remarkably, the best-performing model is the one trained only on end-member configurations: despite never having seen alloy structures, it correctly reproduces the formation-energy trend and predicts the phase-stability crossover close to the DFT reference value. This result highlights the effectiveness of active learning with calibrated committee disagreement—and, more generally, of the AiiDA-TrainsPot protocol—in sampling the relevant regions of configuration space and constructing a compact dataset that yields a highly transferable NNIP.
The SOREP-based dimensionality reduction (tSNE) has shown that subsequent MD-based active learning steps explore the PES through a dual mechanism: dense sampling of already-known regions and, simultaneously, exploration of entirely new basins. A compelling example is the spontaneous formation of carbon nanotubes during the MD simulations in the active learning cycle: carbon nanotubes were absent from the original dataset but become then automatically incorporated by AiiDA-TrainsPot into subsequent training iterations, suggesting the ability of the automated workflow to find novel and quite different metastable or stable structures without prior knowledge. Indeed, the automated training strategy delivered potentials that were also able to reproduce the correct RDF of amorphous carbon.
A key aspect is the use of calibrated committee disagreement to guide the selection of new training structures. This strategy improves the efficiency and reliability of active learning, while ensuring that the model is exposed just to configurations that enhance its predictive power. Notably, the relationship between committee disagreement and actual error appears to be linear over all active learning iterations and across different properties (e.g., energies, forces, stress tensors): that support the reliable use of calibrated committee disagreement also in production simulations, i.e., when reference ab initio simulations typically cannot be performed.
Furthermore, as demonstrated for both the carbon allotropes and the WxMo1−xTe2 alloy benchmarks, AiiDA-TrainsPot can be effectively employed to fine-tune existing foundation models to specific applications. Fine-tuning can be performed either in a single shot on a fixed dataset or embedded into subsequent active-learning iterations, further improving model accuracy and robustness.
It would be interesting to investigate other data-augmentation approaches—such as random structure search,66 non-diagonal supercells,67 or generative models68—as well as advanced exploration strategies beyond NPT and NVT MD—such introducing metadynamics69 by interfacing AiiDA-TrainsPot to PLUMED.70–72 More in general, we hope that the AiiDA's plugin system and the modular structure of AiiDA-TrainsPot will encourage and facilitate future upgrades, as well as the integration of new tools. An example would be supporting multiple NNIP backends (beyond MACE and Metatrain) and electronic structure codes (beyond Quantum ESPRESSO), in the spirit of previous efforts on code-agnostic common workflows for EOS and dissociation curves.73 Powerful upgrades would be enabled by interfacing AiiDA-TrainsPot with existing specialized AiiDA workflows, for instance using DFT+U calculations where the Hubbard U can be automatically calculated for each configuration either with DFPT74–76 through the AiiDA-Hubbard workflow77 or even more efficiently through ML methods.78
As a side note, AiiDA-TrainsPot inherit from the AiiDA infrastructure the tracking of data-provenance graphs, enabling external validation and assessment of the published training data. This capability is crucial for public foundational models and, more broadly, for the reuse of training datasets and their corresponding NNIPs by the community. In the SI, we include a representative provenance graph from a small test run to illustrate how AiiDA automatically records the complete data lineage throughout a training campaign. SI Fig. 2 shows the full provenance graph, whereas SI Fig. 3 presents a simplified view that retains only the most relevant nodes and connections.
While AiiDA-TrainsPot can operate autonomously for general-purpose NNIP development, domain experts retain full flexibility to incorporate their physical and chemical intuition in the automation strategy. The modular architecture is designed to enable full customization of all key components: initial structure selection, dataset augmentation parameters, MD simulation conditions, and computational settings for integrated codes—Quantum ESPRESSO, MACE, Metatrain and LAMMPS. Such level of customization allows users to balance the power of automation with the flexibility that is needed to support a wide range of applications, which require tailoring the active learning process to specific research objectives. Indeed, while AiiDA-TrainsPot automates the entire process—traditionally long, tedious and prone to human errors—of developing NNIPs, optimal results still benefit from careful consideration of the system of interest. In other words, the selection of initial structures, augmentation strategies, and MD conditions, can—and often should—be tailored to reflect the target application and desired properties: AiiDA-TrainsPot makes that effort straightforward. On top of that, the enforcement of standardized protocols (either already established, e.g., SSSP pseudopotentials29 or “fast”/“stringent” QE protocols,32 or introduced in this work) contribute to precision, reproducibility and seamless integration with future efforts in training larger models.
AiiDA-TrainsPot democratizes the access to high-quality NNIPs tailored to the application of interest, hopefully encouraging computational scientists with limited expertise in electronic structure and ML to tackle challenging phenomena and materials, pushing the frontier of what can be simulated, understood and designed with ab initio accuracy.
For both runs, structures were replicated up to a maximum of 600 atoms and a minimum cell length of 24 Å. A total of 80 cluster structures (up to 30 atoms each with minimum interatomic distance 1.5 Å) were generated. Slab configurations were created with a minimum thickness of 10 Å and a maximum of 600 atoms, along the (100), (110), (111), (001), (011), (010), and (101) directions. Non-periodic directions were padded with 15 Å of vacuum. Vacancies (2 per structure) were created in 30% of the structures.
For run A, random distortions and strains were introduced with a rattle_fraction of 0.4, a max_compressive_strain of 0.2 and a max_tensile_strain of 0.2. DFT calculations used the fast protocol32 for k-point grid (λ = 0.30 Å−1) and smearing (σcold = 0.0275 Ry). MD simulations explored temperatures ranging from 0 to 5000 K and pressures from −5 to 5 kbar.
For run B, dataset augmentation parameters were optimized for near-equilibrium conditions: rattle_fraction was reduced to 0.3, while strain ranges were increased to max_compressive_strain of 0.3 and max_tensile_strain of 0.6 to better sample elastic deformations. DFT calculations employed the stringent protocol for enhanced accuracy (λ = 0.1 Å−1, σcold = 0.0125 Ry). MD exploration was constrained to temperatures from 0 to 1000 K and pressures from −20 to 20 kbar.
Both runs utilized the SSSP PBE precision library v1.3 pseudopotentials,29–31 total energy convergence threshold of 10−8 Ry, MACE training with radial cutoff of 4.5 Å, two message-passing layers, batch size of 1, and up to 500 epochs. MD simulations were performed in NPT (for fully or partially periodic systems) or NVT (for non-periodic systems) ensembles using a 1 fs timestep and extracting trajectory frames every 1 ns. Since van der Waals interactions were not included at the DFT level, Grimme's D2 dispersion correction35 was enabled in MD simulations via the momb pair style84 in LAMMPS. Active learning thresholds on energy, forces, and stress tensor were set to 2 meV, 50 meV Å−1, 10 meV Å−3, respectively, with a maximum of 1000 structures selected per iteration.
During dataset augmentation, structures were replicated up to a minimum cell length of 18 Å. The non-periodic direction was padded with 15 Å of vacuum. Vacancies (2 per structure) were created in 30% of the structures. No substitutional configurations, clusters, or surfaces were generated during dataset augmentation. Only for the alloys dataset, random W/Mo distributions on the metal sublattice were sampled by setting in dataset augmentation stage Te as fixed_species and Mo and W as alloy_species.
The same DFT computational settings as in run B for carbon allotropes were used also in this case, while MD simulations performed in NPT ensemble (with barostat applied only to in-plane directions) explored temperatures ranging from 0 to 1800 K and pressures from −10 to 10 kbar, without additional VdW corrections. MACE trainings were performed with radial cutoff of 6.0 Å, two message-passing layers, batch size of 5, and up to 500 epochs, except for fine-tuning run for which same parameter of the original MACE MATPES–PBE–0 model were used.
All the calculations in this work were performed on the CINECA Leonardo supercomputer, using nodes equipped with 4 NVIDIA A100 GPUs. Taking as reference the active-learning run used to train the MACE potential for MoTe2 and WTe2 (alloy benchmark in Fig. 11), where each structure contains about 108 atoms, we estimate an average computational cost of ∼1.2 GPU hours per structure for DFT calculations, ∼25 GPU hours for each MACE training run, and ∼0.1 GPU hours for each LAMMPS MD simulation. After 12 active-learning iterations, resulting in a final dataset of 6965 structures, the total computational cost for training this committee of MACE potentials amounts to about 11
000 GPU hours, 84% of which was spent on DFT labelling.
Supplementary information (SI) is available. See DOI: https://doi.org/10.1039/d6dd00005c.
Footnote |
| † These authors contributed equally to this work. |
| This journal is © The Royal Society of Chemistry 2026 |