Open Access Article
Marlen Neubert
ab,
Patrick Reiser
ab,
Frauke Gräter
*cde and
Pascal Friederich
*ab
aInstitute of Theoretical Informatics, Karlsruhe Institute of Technology, Kaiserstr. 12, 76131 Karlsruhe, Germany. E-mail: pascal.friederich@kit.edu
bInstitute of Nanotechnology, Karlsruhe Institute of Technology, Kaiserstr. 12, 76131 Karlsruhe, Germany
cMax Planck Institute for Polymer Research, Mainz 55128, Germany. E-mail: graeter@mpip-mainz.mpg.de
dHeidelberg Institute for Theoretical Studies, Heidelberg 69117, Germany
eInterdisciplinary Center for Scientific Computing, Heidelberg University, Heidelberg 69120, Germany
First published on 30th March 2026
Hydrogen atom transfer (HAT) reactions are essential in many biological processes, such as radical migration in damaged proteins, but their mechanistic pathways remain incompletely understood. Simulating HAT processes is challenging due to the conflicting requirements of quantum chemical accuracy and biologically relevant time and length scales; thus, neither classical force fields nor DFT-based molecular dynamics simulations are applicable. Machine-learned potentials offer an alternative, with the ability to learn potential energy surfaces (PESs) that capture reactions and transitions with near-quantum accuracy. However, training such models to generalize across diverse HAT configurations—especially at radical positions in proteins—requires tailored data generation strategies and careful model selection. In this work, we systematically generate HAT reaction configurations in peptides to build large datasets using semiempirical methods as well as DFT. We benchmark three atomistic machine-learned potential architectures, SchNet, Allegro, and MACE, on their ability to learn HAT potential energy surfaces and indirectly predict reaction barriers through direct energy predictions. MACE consistently outperforms the other models in energy, force, and reaction barrier prediction accuracy, achieving a mean absolute error of 1.13 kcal mol−1 on DFT barrier predictions. Short molecular dynamics simulations indicate that the learned potential is numerically stable at finite temperature and can sustain reactive HAT sampling under moderate biasing, serving as a feasibility check for downstream simulation workflows. We show that the trained MACE potential generalizes well beyond our training data by performing out-of-distribution evaluations and analysis of HAT barriers in collagen I snapshots. This level of accuracy can enable integration of machine-learning-based barrier predictions into large-scale simulation workflows to compute reaction rates from predicted barriers, advancing the mechanistic understanding of HAT and radical migration in peptides. We analyze scaling laws, model transferability, and cost-performance trade-offs, and outline strategies for improvement through the combination of machine-learned potentials with transition state search algorithms and active learning. The presented approach is generalizable to other biomolecular systems, offering a method toward quantum-accurate barrier predictions of chemical reactivity in complex biological environments.
| AH + B˙ → A˙ + BH | (1) |
Load-bearing proteins can form radicals under mechanical stress via homolytic bond scission. Experiments and simulations on collagen I tendons under mechanical load reveal the formation of radicals and subsequent production of reactive oxygen species, suggesting that mechanical load contributes to redox signaling and stress.5 While specific microscopic steps are not fully assigned by Zapp et al., HAT, again, provides a plausible route for radical migration among nearby donor/acceptor sites.
Whether programmed (enzymes) or stimulus-driven (oxidative, mechanical), protein radicals follow pathways shaped by structure and microenvironment. In all cases, HAT reaction barriers and pre-organization of donor/acceptor sites determine kinetics and product distributions. Due to short lifetimes of HAT, we rely on computational models to understand the fundamentals of radical reactivity and transport in proteins.
Recent efforts have focused on approaches that included machine learning (ML). Riedmiller et al.6 trained a graph neural network to directly predict HAT barriers in protein environments, focusing on small systems cut out from collagen simulation snapshots. Their model achieved a prediction error of 2.4 ± 2.5 kcal mol−1 on configurations inferred from classical molecular dynamics trajectories and synthetic peptide systems, and an error of 4.6 ± 4.8 kcal mol−1 on out-of-distribution data. Using the same data, Ulanov et al.7 demonstrated that Gaussian process regression models can be employed to predict HAT reaction barriers in low-data regimes, achieving an MAE of 3.23 kcal mol−1. Both models' limited prediction accuracy restricts their use in subsequent simulation schemes and the understanding of the HAT reactions themselves. Furthermore, since both approaches directly predict the reaction barriers, the models can not be used in direct molecular dynamics (MD) simulations or enhanced sampling methods.
Barrier heights depend on precise knowledge of reaction paths, which in turn require an understanding of the potential energy surface (PES). The PES describes the functional relationship between potential energy and atomic positions. If an accurate PES is known, a molecular system's equilibrium structures or transition states can be found since these correspond to the PES's minima or saddle points. When an ML model is directly trained on barrier heights, information on the reaction path and topology of the PES is thus not included.
In this work, we model the PESs of HAT reactions in peptides using transferable ML models, which allow us to indirectly predict reaction barrier heights via direct energy predictions. By learning the full PES of HAT reactions in peptide systems, we capture the energetics and forces that govern barrier heights and reaction pathways, and thus can predict more accurate reaction barrier heights. An accurate model of the PES will also enable further investigations and a deeper understanding of reaction dynamics. Beyond barrier prediction, the trained ML models, possibly refined through active learning, can also be utilized in MD simulations, enhanced sampling, optimization, and transition state search algorithms, as they are differentiable.
Traditionally, there were two approaches to calculating the energy and forces for molecular systems, i.e., the PES. Ab initio methods, while accurate, are unfeasible for large system sizes due to their computational costs. Classical force fields, instead, are very fast due to their analytical form. The terms in classical force field functions contain many empirical parameters describing bonded and non-bonded interactions (e.g., electrostatic or van der Waals interactions). However, force fields do not allow bonds to break or form, i.e., no chemical reactions can be simulated. Reactive force fields have been developed to counteract this; however, the accuracy is generally lower than ab initio calculations due to the general empirical approximations.8
Since the PES is a multidimensional function, an analytical expression can also be found by mathematical fitting to data with ab initio accuracy. With the formulation of the construction of the PES as a function approximation problem, it becomes clear where machine learning (ML) methods can be used as efficient tools in this context: if the relationship between potential energy, including associated analytical derivatives, and atomic positions of a system is constructed using ML methods, the resulting analytical expression is referred to as a machine-learned (ML) potential. The training data for ML potentials consists of coordinates as well as elements of all atoms of a system and the corresponding energies and often forces. Since the quality of the resulting ML potential is directly dependent on the quality and quantity of the training data, the latter is typically calculated using an accurate but affordable ab initio method, e.g. density functional theory (DFT).
Compared to classical force field methods, ML methods offer the advantage that no constraining assumptions about the functional form of the PES or bonds are needed – the chemical behavior, including long-range interactions and chemical reactions, is learned from the reference data alone.9 ML potentials allow the modeling of the PES of a system with high accuracy and reasonable computational costs. Therefore, their field of application is vast and theoretically ranges over any material in any state, from biomolecules to crystalline systems. Due to its many potential use cases, the development of ML potentials is a very active research field. Successful ML potentials are based on several different ML architectures, from neural networks to kernel-based methods to graph neural networks (GNNs), with specific advantages and disadvantages.10 Regardless of the exact model architecture, training data for ML potentials initially consist of (Cartesian) atom coordinates, the element type of the atoms of a system, and the associated energies in context-dependent configurations. Often, information about forces is also part of the training data, as adding them can increase the accuracy of the models and reduce the required training set size11 since there are 3 N forces for N atoms instead of just one energy label.12 The calculation of energies and forces requires an ab initio method to guarantee the accuracy of the learned potential, but it can also become a bottleneck if a large training data set is needed.
The PES exhibits symmetries, which the ML model should also reflect. For example, the total energy is invariant if a molecule is translated or rotated, or if two atoms of the same element type exchange. These invariances can either be explicitly satisfied by choosing a representation of the geometry (e.g. inverse distances), by including them in the functional form of the machine learning model (inductive bias), or by learning them (e.g. through data augmentation). Currently, the most popular ML model architecture for ML potentials is the graph neural network (GNN), which utilizes the natural graph structure of molecules.13,14 Since the topology of the molecular structure can be considered as an undirected graph, atoms can be associated with nodes and chemical bonds with edges. At first, atom feature vectors contain properties such as element types and positions.15 Information or ‘messages' are then exchanged between atoms through message-passing layers, and the model iteratively learns feature representations of the individual atoms' local environments, including information about neighbors and more long-range interactions after several message-passing steps.
One of the first message-passing neural networks to learn PES was SchNet,16 which is based on invariant convolutions over scalars. The model consists of convolutional interaction blocks in which the initial features are updated and the final atom embeddings are learned. The model ensures rotational invariance of the output by constructing only scalar features and operating on (scalar) interatomic distances, rather than Cartesian atom coordinates. While SchNet was successfully employed in various chemical applications, the requirement for a lot of ab initio training data was found to be a bottleneck for larger length scales.
More recently, equivariant GNNs gained popularity as ML potential models, outperforming previous invariant architectures and displaying higher data efficiency. Equivariant GNNs can encode more physical information about an atomic system by directly acting on vector quantities while preserving known physical symmetries. More specifically, the models are equivariant with respect to transformations under the 3D Euclidean group (rotation, inversion, and translation). This is relevant for preserving force vectors under rotation of the atomic system. Equivariance is achieved by not only learning scalar node representations but also higher-order geometric tensor features.
Examples of equivariant GNNs include NequIP,17 Allegro,18 and MACE.19 NequIP utilizes learned scalar and tensor features, and information is propagated via message-passing over relative position vectors. While achieving state-of-the-art accuracies on several benchmark datasets, computational performance, specifically training and evaluation speed, constitutes a drawback when scaling to larger systems. The main disadvantage in this context is the message-passing step since it constructs many neighbors for each atom, hindering the parallelizability of the model.
To combat this, Allegro,18 based on NequIP, learns strictly local equivariant tensor features between edges and employs no message-passing, resulting in an
(N) scaling with respect to the number of atoms. The embedding of the local environment only leads to receptive fields with fixed sizes, which does not reduce accuracy on benchmark datasets. MACE,19 on the other hand, employs a higher-order message-passing strategy to reduce computational costs. It explicitly models higher-order interactions by constructing many-body features from radial and spherical harmonics basis functions based on the multi-atomic cluster expansion framework.20 Equivariant messages from these features are then constructed hierarchically via tensor operations. With this construction method, the authors show that the message-passing can be reduced to two iterations, compared to 4–6 for other equivariant models.17 Evaluations on benchmark datasets show that both Allegro and MACE have high accuracies and good transferability to out-of-distribution data.
While benchmark data allows a wide range of comparisons between the latest models, many applications of interest, especially in biochemistry, are often more complex and more specific than common benchmark datasets, and it is not immediately clear which architecture is the most suitable or what kind and how much training data is required. Thus, trade-offs between training times, data requirements, and accuracy are not intuitively apparent.
One of the challenges in the context of HAT in peptides is the increased complexity that a reaction entails. In addition to training data on equilibrium configurations, global information on the PES, i.e., the reaction's intermediate steps and transition states, is also required. Accurately training a model thus requires more data, which at the same time needs to be informative to allow a model to capture the increased complexity. Generating this data constitutes a challenge in itself and requires an efficient data generation workflow. This is especially true if not only one specific reaction configuration but, as in our case, various combinations of peptides, radical positions, and reaction paths should be learned. Higher data requirements to learn the PES of a reaction also mean that we need more ab initio calculations for energy and force labels. The chosen model must, therefore, be data-efficient. Otherwise, the number of ab initio calculations represents a bottleneck.
In this work, we trained ML potentials to learn potential energy surfaces of HAT reactions in peptides. Guided by previous studies on collagen and broader protein-radical chemistry, we construct amino-acid and dipeptide systems that incorporate intramolecular (within a peptide) and intermolecular (between two peptides) HAT motifs. We developed a workflow to generate training data for reaction configurations in peptides using HAT. Using this workflow, we generated a dataset of 172
000 data points with energy and forces calculated using the semi-empirical method GFN-xTB.21 Additionally, we generated a dataset comprising 125
365 data points at the DFT/bmk/def2-TZVPD level of theory. We explored the performance of the models SchNet, Allegro, and MACE, estimated a scaling law, and investigated their transferability from small to large systems, both on semi-empirical and DFT data. We used the trained models to indirectly predict HAT reaction barriers through direct energy predictions (see Fig. 1). The models trained on the PES directly can capture more complexity than previous SOTA direct barrier predictions. Our best MACE model, trained on 65
514 DFT/bmk/def2-TZVPD configurations, achieved an MAE of 1.13 kcal mol−1 in barrier predictions on out-of-distribution data, compared to a previously reported MAE of 2.4 ± 2.5 kcal mol−6. The significant increase in accuracy achieved here renders the ML model suitable for use in barrier predictions, for example, in radical migration in damaged proteins. Our short MD simulations show that the learned potential is stable and reactive under thermal conditions and can serve as an engine for sampling HAT reactions or for enhanced-sampling calculations of HAT-free energies, potentially requiring active learning to ensure reliability. We show that our trained ML potential generalizes well beyond our training data by performing out-of-distribution evaluation and analysis of HAT barriers in collagen I snapshots.
We used RDKit22 to generate 3D coordinates from SMILES representations corresponding to I molecules we want to include in the final training data. We then optimized the initial coordinates using xTB21 to get minimum energy structures. The minimum energy structures were the starting point for a conformer search performed on each with CREST23 (Fig. 2a). Since this typically results in numerous structures per molecule, we selected the five lowest-energy conformers and five randomly sampled higher-energy conformers, resulting in C = 10 conformer configurations {Ric} per molecule i, where i ∈ I and c ∈ C.
We applied normal mode sampling (Fig. 2b) to the chosen conformers in the next step to obtain J physically relevant non-equilibrium structures per molecule {Rij}, where j ∈ J. This allowed us to sample the PES around minima up to a maximum relative energy. In this step, we distorted the molecules along their normal modes based on methods employed by Rupp et al.24 and Smith et al.25,26 We used xTB to calculate normal mode coordinates qic,m and force constants kic,m for m eigenmodes of each conformer configuration c per molecule i. The force constants were then used to calculate displacements Ric,m (Fig. 2b), with which the sampled configuration was generated according to eqn (2):
![]() | (2) |
The normal mode sampled geometries Rij are thus superpositions of perturbed normal mode coordinates qic,m that pass relative bond length and total energy checks. This ensures that no bonds are broken and that the new configuration's total energy is within a set range. Normal mode sampling is only an estimation working within the harmonic approximation; in the context of generating training data for ML potentials, it is still beneficial since it allows fast sampling of structures that cover physically relevant PES areas. The perturbed molecular coordinates Rij served as initial structures to build radical systems in the subsequent steps.
To create radical systems, we transformed a molecule into a radical by removing a hydrogen atom, creating a radical at position r0 (Fig. 2c). We consider two types of radical systems in which reactions occur – intramolecular and intermolecular HAT. For intra-HAT in peptides, we assume that a transfer occurs within the same molecule, while for inter-HAT, we assume that a hydrogen atom at position rH moves between two distinct peptides towards a radical at position r0, thus creating a radical at position r1. We implemented a function g that creates the radical systems by performing a selection and geometry modification in the case of inter-HAT systems. For both system types, the function analyzes given molecules and randomly chooses a hydrogen atom for transfer and a radical position, i.e., the start and end position of the reaction. The selection function performs distance checks to prevent clashes and includes conditions under which hydrogen atoms can be transferred, depending on the molecule type and atom environment. The function generates inter-HAT systems by translating and rotating one randomly chosen molecule and one radical, while the distance between hydrogen atom at rH and radical at r0 is randomly drawn from a χ2-distribution with a maximum distance of 4 Å. The two configurations were arranged so that no clashes occurred, and no other hydrogen atom was closer to the radical position than the atom designated for transfer. This step results in the creation of radical system configurations {Ra} for both inter- and intra-HAT. In the last step, we created reaction configurations from the generated radical systems. A function f modifies the geometry of a system by moving the designated hydrogen atom between the start and end positions. This displacement function takes a generated radical system Rinter, intraa and translates the hydrogen atom rH to a point on a sphere with a randomly sampled radius around the center of the start and endpoints of the reaction (Fig. 2d). To avoid outlier geometries, we checked again for clashes and energy outliers, i.e., we defined a maximum difference between the minimum energy and the energy of the generated configuration. This scheme creates a diverse set of reaction systems {Rr} with corresponding energies and forces {Er, Fr} that differ in geometry, transfer type (intra or inter), type of peptide, as well as the hydrogen and radical positions.
Since our goal is to predict reaction barriers indirectly using the direct energy predictions from trained models, we generated additional data from linear interpolation of the hydrogen atom at rH designated for transfer (Fig. 2d) similar to what has been done previously in the literature.6 We generated configurations by moving the hydrogen atom on a linear path between a system's start and end positions with 10 interpolations per system. Thus, the reaction barriers for a system were calculated as the energy difference between the highest energy configuration on the path and the start and end positions, respectively. We applied the linear interpolation scheme to radical systems using the evaluation data, and for additional analysis, to part of the training data.
042 reaction configurations, of which 45
724 correspond to linear interpolations between hydrogen donor and acceptor positions (see Section 2.1.2). The scaling law analysis (Section 3.2) enabled us to approximate the training set size required to achieve mean absolute errors (MAEs) below 40 meV (1 kcal mol−1) for reaction barrier predictions. Based on this analysis, we selected a subset of the xTB dataset for more accurate density functional theory (DFT) calculations. Energies and forces were recalculated at the bmk/def2-TZVPD level using Turbomole, resulting in a DFT dataset with 125
365 configurations, including the full set of 45
724 linear interpolations.
318 unique radical systems. To create intramolecular HAT systems, we equally sampled normal mode sampling configurations across all molecule types (amino acids, dipeptides capped/uncapped). We considered all possible pairs of molecule types for intermolecular HAT systems, i.e., HAT between two amino acids, amino acid and dipeptide, two dipeptides (capped and uncapped), all equally weighted. From each radical system, we generated one reaction configuration by randomly sampling a hydrogen position rH using the method described in Section 2.1. Our preliminary tests showed that including a larger variety of systems improves model generalization more effectively than including multiple configurations per system. As a result, only one configuration—either a start, end, or intermediate hydrogen position—was retained per system, yielding a dataset of 126
318 single-point reaction configurations. System sizes ranged from 15 atoms (e.g., uncapped single amino acids) to approximately 130 atoms (e.g., capped dipeptide–dipeptide pairs). Energy and distance distributions are provided in the SI (see SI Fig. 2). For training and evaluation, the xTB dataset was split into subsets while preserving the distribution of system types (intra- vs. inter-HAT and molecule combinations). The maximum training set size used was 112
191. Detailed dataset statistics and splits are provided in Table 1 in the SI.
661 small configurations
| Model | Energy MAE, (meV) | Force MAE, (meV Å−1) | Barrier MAE, (meV) |
|---|---|---|---|
| SchNet (≤50 atoms) | 100 | 74 | 100 |
| Allegro (≤50 atoms) | 60 | 45 | 58 |
| MACE (≤50 atoms) | 50 | 33 | 47 |
| SchNet (>50 atoms) | 234 | 71 | 146 |
| Allegro (>50 atoms) | 120 | 44 | 94 |
| MACE (>50 atoms) | 100 | 31 | 66 |
104 and 24
620 configurations, respectively. These datasets were not used for training but only for the evaluation of indirect barrier predictions.
641 single-point configurations from the xTB reaction-configuration dataset. For consistency, we preserved the same configuration indices and data splits across both theory levels. Distribution plots and statistics are provided in the SI Table 1.
724 configurations) was recalculated at the DFT level, preserving the same system identities and splits (1861 training, 2164 test). As shown in Fig. 3, the DFT barriers are consistently higher than those calculated by xTB, indicating that xTB systematically underestimates HAT barrier heights in peptides (see SI Fig. 3).
661 DFT configurations with fewer than 50 atoms and tested on a dataset of 34
853 configurations with more than 50 atoms. Performance was also measured on a 3411-configuration test set of small systems for comparison. All models exhibit limited transferability to larger systems in terms of energy and barrier MAEs (Table 1). Force MAEs, however, remain stable or even improve with increasing system size (Fig. 5b). MACE again outperforms both Allegro and SchNet across all metrics. A deeper analysis of MACE on varying system sizes shows that energy MAEs increase with system size, while force MAEs decrease, due to the additive nature of energy prediction errors and stable local force accuracy. For intermediate-sized systems (60–70 atoms), we observe a slight peak in per-atom energy MAEs, followed by a decrease for even larger systems. No clear trend is visible for barrier estimations based on the energy predictions (SI Fig. 5).
514 DFT configurations and tested on 6836 unseen configurations. Reaction barriers were derived from direct energy predictions for 2164 HAT test systems, comprising 24
620 single-point evaluations. MACE achieves the best performance, with the lowest energy (68 meV), force (28 meV Å−1), and barrier (49 meV) MAEs (Table 2). Despite higher errors in energy predictions, all models exhibit lower errors in barrier predictions, likely due to systematic error cancellation when models systematically over-/underestimated energies. In some cases, predicted energy profiles closely match DFT reference values (Fig. 7a), while in others, consistent over- or underestimation across the pathway leads to accurate relative energies and barriers (Fig. 7b).
514 DFT configurations and tested on 6836 unseen configurations and 2164 barrier evaluations
| Model | Energy MAE, (meV) | Force MAE, (meV Å−1) | Barrier MAE, (meV) |
|---|---|---|---|
| SchNet | 97 | 58 | 96 |
| Allegro | 79 | 36 | 60 |
| MACE | 68 | 28 | 49 |
Barrier analysis across 74 paired full (parent) and reduced (child) peptide systems reveals that the parent and child barrier distributions largely overlap. The median child–parent shift is 0, and mean per-pair deviations are small (0.08 eV). The same conclusion holds when splitting into intra vs. inter subsets. Smaller capped systems that keep only the residues directly connected to the donor and acceptor yield barrier estimates comparable to those of the full systems, with only minor case-by-case differences.
Our trained MACE model achieves a barrier MAE of 76 meV over all systems, an energy MAE of 331 meV, and a force MAE of 47 meV Å−1. After excluding 22 outlier systems, the barrier MAE decreases to 72 meV, the energy MAE to 90 meV, and the force MAE to 38 meV Å−1. The flagged systems all show a near-constant energy offset along the reaction path, indicating that most of their energy error stems from system-specific constant shifts rather than from errors in the shape of the reaction profile. Compared to the synthetic data, the errors are higher in absolute terms, as expected from an OOD test, but barriers remain comparatively robust due to cancellation of systematic offsets within each trajectory. Larger parent and smaller child systems exhibit similar barrier MAEs (76 meV each), indicating that the model retains its barrier accuracy even when the environment is increased, as long as the local H-transfer motif is preserved. On realistic systems derived from collagen I snapshots, our trained MACE model thus preserves barrier shapes well and achieves near-in-distribution barrier accuracy. Once outliers with per-system offsets are removed, it also approaches near-in-distribution energy and force errors.
MACE consistently outperforms the other models regarding energy, force, and barrier MAEs but is also the most computationally demanding. Allegro achieves slightly lower accuracy and has comparable training costs. SchNet trains quickly but suffers from higher prediction errors, especially when the dataset size as well as the budget for training compute is large. Models trained on xTB data achieve lower errors than their DFT-trained counterparts, suggesting that xTB PESs are inherently easier to learn, at least for the HAT/peptide systems investigated here. The approximate nature of the xTB PES might reduce the complexity of the PES compared to DFT. DFT-trained models may better reflect physical reality despite higher energy and force errors. As dataset sizes increase, this difference becomes more pronounced, particularly for MACE, indicating that DFT PESs likely require more complex models or additional model capacity (e.g. higher-body terms). Though hyperparameter searches were conducted for both data types, they were limited to small datasets due to training costs. Improved performance on DFT data might be achievable with more extensive hyperparameter tuning on larger datasets.
Our MD experiments demonstrate that the learned MACE–DFT potential is dynamically stable and supports reactive HAT at finite temperature. Under unbiased NVT dynamics, spontaneous HAT events occurred in a minority of test systems, whereas physically motivated steering increased the event frequency. This suggests that the model could, in principle, be used as a starting point for enhanced-sampling calculations of HAT-free energies or further refinement toward broader coverage of reactive peptide configurations. The steering coordinate q used here could be reused directly as the collective variable for umbrella windows or as the bias in well-tempered metadynamics. However, enhanced sampling can deliberately drive the system into high-energy and thus rarely sampled regions, where the current model may be forced to extrapolate. Therefore, quantitative free-energy estimates are not validated by the present MD tests alone and might require active learning. It is expected that some regions of configuration space will be poorly sampled, e.g. near transition states. Incorporating an active-learning strategy that monitors model uncertainty during MD and triggers targeted DFT re-labeling in those regions could improve sampling, reduce bias, and accelerate convergence of the free-energy estimates.
Energy prediction transferability to larger systems is challenging for all model architectures we investigated, likely because global, extensive properties such as total energy suffers from additive errors. The relatively stable or even improved force MAEs suggest that models generalize well locally, and that force predictions benefit from larger local environments or averaging effects in bigger systems. For very large systems, long-range interactions beyond the chosen cutoff may become important. Because SchNet, Allegro, and MACE are primarily local (finite-cutoff) models, such contributions cannot be fully captured unless the architecture is augmented36–38 or the cutoff substantially increased.
Compared to our in-distribution tests, the collagen I MD-snapshot evaluation is an explicit out-of-distribution stress test introducing finite-temperature strain, context truncation, and a shifted geometry distribution. Our generated training data spans all residues in idealized contexts, whereas the collagen snapshots sample a subset of chemistries in specific packed environments and sit off minima with thermally distorted bonds, angles, and torsions. Despite this mismatch, our model achieves a barrier MAE of 76 meV on the full extracted snapshot set, and overall barrier shapes and TS locations are retained. Our analysis of reaction barriers in parent/child systems indicates that barriers remain essentially unchanged when trimming the environment. Larger parent and smaller child systems exhibit similar barrier MAEs, highlighting size transferability and showing that we can, in practice, use smaller capped systems as long as the local H-transfer motif and its immediate contacts are preserved. A small number of matched parent–child systems were flagged as outliers, all of which exhibit a nearly constant offset between predicted and DFT energies along the reaction path. These outliers occur in both intra and inter HAT motifs and span neutral as well as charged residue combinations, indicating that they are not confined to a single topology or simple residue type. Instead, the model reproduces the barrier shape for these cases but assigns an incorrect absolute energy reference to the capped fragment, consistent with a fragment-specific calibration error rather than a failure to describe the reaction coordinate itself. In practice, these cases dominate the global energy MAE but have limited impact on barrier errors, since the constant offset largely cancels along the path.
Applying MACE at the collagen scale could be made possible using hybrid schemes rather than single end-to-end models. One option could be ML/MM embeddings that treat only the reactive HAT region at ML accuracy, while the protein/solvent environment remains classical. Alternatively, the ML potential could be used as a barrier emulator within large-scale kinetics (e.g., kinetic Monte Carlo39,40) driven by on-the-fly computed ML barriers along representative pathways. For small datasets (<10 k configurations), SchNet offers a practical trade-off between training time and accuracy. In some cases, if a limited amount of resources is available, it might be more efficient to train more data on SchNet for fewer GPU hours (Fig. 6b). For larger datasets (>30 k configurations), MACE becomes more advantageous due to its data efficiency, despite longer training times. Allegro falls in between in terms of both cost and performance. All models were trained on a single GPU for consistency, but MACE in particular supports parallel training, which could significantly reduce training time. All models achieve more accurate barrier predictions than direct energy predictions, likely due to error cancellation within reaction pathways. Overall, our results show that machine-learned potentials can accurately predict DFT-level reaction barriers from direct energy predictions, with MACE providing the most reliable and generalizable performance across the tasks considered. In our tests, we used linear interpolations for barrier estimations, which need to be refined via optimization and transition state searches to get more accurate barriers. However, our scheme of indirectly predicting the reaction barriers should work equally well when using refined reaction configurations. More accurate barriers could also be obtained via transition state searches using the trained models directly, since they provide a cheaper way to get Hessians via auto-differentiation. As MACE provides differentiable energy landscapes, it could also be integrated into such optimization schemes. To do so, we would need further tests to ensure accurate Hessian predictions, but initial investigations already suggest this works well.41 Finally, combining our pretrained models with transition state search algorithms is also very well suited to be used in an active learning approach to retrain models on relevant PES regions for HAT reactions.
MACE consistently outperforms SchNet and Allegro in energy, force, and reaction barrier prediction accuracy, albeit at a higher computational cost. Our best MACE model achieved a mean absolute error of 1.13 kcal mol−1 in indirectly predicting DFT-calculated HAT reaction barriers, substantially improving upon previous machine learning approaches. This accuracy is critical for reliable reaction rate predictions, as errors in barriers propagate exponentially into rate constants. Our approach leverages direct energy predictions to model complex PESs and estimate reaction barriers without the need for explicit transition-state data, offering a generalizable alternative to direct barrier prediction schemes. Our analysis of scaling laws, transferability, and training costs highlights the importance of balancing model complexity, dataset size, and computational resources. While xTB PESs are easier to learn and allow fast prototyping, they may limit generalization to high-accuracy regimes. In contrast, DFT-trained models better reflect physical reality and are essential for robust and transferable predictions.
The trained models show good transferability of force predictions across system sizes, though total energy errors grow with system size, likely due to a lack of out-of-distribution generalization to larger systems and missing long-range interactions.
Our short MD experiments indicate that the learned potential remains numerically stable under finite-temperature dynamics and can sustain reactive sampling under moderate biasing, supporting its feasibility as an engine for exploratory HAT sampling. Moreover, because reaction barriers can be extracted from the learned PES, the trained model can serve as an emulator in large-scale simulation workflows, for example, as input to kinetic Monte Carlo or hybrid MC/MD schemes that invoke reactive events. Building on this, umbrella sampling along the H-transfer coordinate or metadynamics on the same CV are promising next steps toward HAT free-energy profiles at MACE cost. However, quantitative free-energy estimates will require additional validation because biasing may drive the system into regions outside the training distribution. Coupling enhanced sampling to an active-learning loop that identifies high-uncertainty configurations for selective DFT relabeling provides a practical route to improve reliability and obtain accurate free-energy predictions. We evaluated the DFT-MACE potential on collagen I MD–derived HAT snapshots, which is an explicit out-of-distribution setting relative to our benchmark, to test transfer to a realistic peptide environment. Our MACE potential generalises well beyond its synthetic training set. It preserves barrier shapes and TS locations and achieves barrier errors that remain within a few 10 meV of its in-distribution performance. The similar accuracy on larger parent and reduced child systems demonstrates size transferability, enabling the use of compact capped fragments without sacrificing barrier fidelity. The few remaining outliers are dominated by fragment-specific constant energy offsets rather than distorted reaction profiles, pointing to clear avenues for improvement via targeted retraining or simple fragment-level corrections that reduce system-specific shifts while retaining the already robust barrier predictions. Transition-state optimization with ML-predicted Hessians could further refine barrier predictions. An active learning strategy that combines pre-trained models with automated transition state searches may help systematically improve accuracy and broaden the configurational diversity of training data.42
The presented workflow is not limited to HAT in peptides and can be translated to other reactive motifs in biomolecular fragments. More broadly, such PES-based models enable quantum-accurate, data-driven barrier prediction and reactive-event modeling that can be integrated into large-scale biomolecular simulation workflows.
Supplementary information (SI): dataset statistics, more information on the comparative analysis of the trained ML potentials and details on the performed molecular dynamics simulations. See DOI: https://doi.org/10.1039/d6dd00054a.
| This journal is © The Royal Society of Chemistry 2026 |