Open Access Article
This Open Access Article is licensed under a Creative Commons Attribution-Non Commercial 3.0 Unported Licence

Data-driven parametrization of molecular mechanics force fields for expansive chemical space coverage

Tianze Zheng *a, Ailun Wang b, Xu Han a, Yu Xia a, Xingyuan Xu a, Jiawei Zhan b, Yu Liu b, Yang Chen a, Zhi Wang b, Xiaojie Wu b, Sheng Gong b and Wen Yan *b
aByteDance Research, Beijing, Beijing, 100098, China. E-mail: zhengtianze@bytedance.com
bByteDance Research, Bellevue, Washington 98004, USA. E-mail: wen.yan@bytedance.com

Received 30th September 2024 , Accepted 25th December 2024

First published on 31st December 2024


Abstract

A force field is a critical component in molecular dynamics simulations for computational drug discovery. It must achieve high accuracy within the constraints of molecular mechanics' (MM) limited functional forms, which offers high computational efficiency. With the rapid expansion of synthetically accessible chemical space, traditional look-up table approaches face significant challenges. In this study, we address this issue using a modern data-driven approach, developing ByteFF, an Amber-compatible force field for drug-like molecules. To create ByteFF, we generated an expansive and highly diverse molecular dataset at the B3LYP-D3(BJ)/DZVP level of theory. This dataset includes 2.4 million optimized molecular fragment geometries with analytical Hessian matrices, along with 3.2 million torsion profiles. We then trained an edge-augmented, symmetry-preserving molecular graph neural network (GNN) on this dataset, employing a carefully optimized training strategy. Our model predicts all bonded and non-bonded MM force field parameters for drug-like molecules simultaneously across a broad chemical space. ByteFF demonstrates state-of-the-art performance on various benchmark datasets, excelling in predicting relaxed geometries, torsional energy profiles, and conformational energies and forces. Its exceptional accuracy and expansive chemical space coverage make ByteFF a valuable tool for multiple stages of computational drug discovery.


1 Introduction

Drug discovery involves identifying potential therapeutic candidates within the vast and intricate landscape known as chemical space, which encompasses all possible molecular structures.1–3 As a pivotal tool in this process, molecular dynamics (MD) simulations offer insights into dynamical behaviors and physical properties of molecules, as well as interactions in the molecular systems at an atomic level.4–8 Central to the accuracy and reliability of these simulations is the force field, a mathematical model that describes the potential energy surface (PES) of the molecular system as a function of the positions of the atoms involved.9–12 Recent advances in synthetic chemistry and high-throughput screening technologies have significantly expanded the chemical space for drug candidates,13–17 which necessitates the development of force fields that can provide accurate predictions of PES for diverse molecules in expansive chemical space.

Generally, force fields can be classified into two main categories: conventional molecular mechanics force fields (MMFFs),9–12,18,19 which parameterize a fixed analytical form to approximate the energy landscape, and machine learning force fields (MLFFs),20–23 that aim to map the atomistic and molecular features and coordinates to the PES using neural networks without being limited by the fixed functional forms. Most conventional MMFFs, such as Amber,24,25 GAFF9 and OPLS,11,12,26 describe the molecular PES by decomposing it into various degrees of freedom, including bonded (i.e., bonds, angles, and torsions) and non-bonded interactions (i.e., electrostatics and dispersion). These conventional MMFFs benefit from the computational efficiency of these terms, while suffering inaccuracies due to the inherent approximation, especially when non-pairwise additivity of non-bonded interactions are of significant importance. Emerging recently, MLFFs have shown great promise for modeling PES due to their ability to capture subtle interactions and complex behaviors that may be overlooked or oversimplified by classical models.20–23 Despite their outstanding accuracies, several drawbacks limit their applications in drug discovery. Owing to the complexity of the machine learning models involved, the computational efficiency of MLFFs is relatively low. Meanwhile, the amount of data required to train an effective MLFF is extremely large, which imposes constraints on their ability to cover the chemical space comprehensively. Consequently, conventional MMFFs remain the most reliable and commonly used tool for MD simulations involving biological systems to this day.7,8

In the past few years, several efforts have been made to improve the quality of MMFFs for predicting the PES of small molecules. Following the traditional look-up table approach, OPLS3e increased the number of torsion types to 146[thin space (1/6-em)]669 to enhance accuracy and expand chemical space coverage.11 Furthermore, OPLS3e and its successor OPLS4 were empowered with FFBuilder11 to refine the torsion terms for molecules beyond the coverage of the pre-determined torsion list. OpenFF18,19,27 took a different approach by utilizing SMIRKS patterns to describe the chemical environment of both bonded and Lennard-Jones terms. However, these discrete descriptions of the chemical environment have inherent limitations that hamper the transferability and scalability of these force fields. In the past decade, ML techniques have been employed to improve the parameterization process of molecular force fields, including genetic algorithms for parameterizing polarizable force field,28 neural network potentials for improving dihedral angle potentials29 and Bayesian optimization for coarse-grained force fields.30 Such early attempts provide excellent proof-of-concepts of the potential of ML methods for improving the development of force fields. Building on this, Espaloma31,32 introduced a novel end-to-end workflow where the MMFF parameters are predicted by graph neural networks (GNN), opening new avenues for the advancement of MMFFs.33 Despite promising results, these early attempts are constrained by the ML techniques and the training data, where significant improvements can be achieved.

In this work, we propose ByteFF, a data-driven MMFF trained on a large-scale, high-diversity, and high-quality quantum mechanics (QM) dataset with sophisticated ML techniques. ByteFF is designed to leverage both atom and bond features using a state-of-the-art GNN model, while preserving molecular symmetry. To construct the dataset for training ByteFF, we employ novel fragmentation methods and a rigorous QM calculation workflow, generating 2.4 million optimized molecular fragment geometries with Hessian matrices, along with 3.2 million torsion profiles. Additionally, we introduce a differentiable partial Hessian loss and an iterative optimization-and-training procedure to effectively train ByteFF on the dataset. Finally, we demonstrate the performance of ByteFF on various benchmarks, showing its expansive chemical space coverage and exceptional accuracy on intra-molecular conformational PES.

2 Methods

2.1 Molecular mechanics force field

In this work, we follow the analytical forms in GAFF9 and OpenFF:27
 
EMM = EMMbonded + EMMnon-bonded(1)
 
image file: d4sc06640e-t1.tif(2)
where the bond lengths r, angles θ, proper torsion angles ϕ and improper torsion angles ψ are rotationally invariant internal coordinates. The force field parameters include bonded parameters (equilibrium values r0, θ0 and ϕ0, and force constants kr, kθ, kϕ and kψ), and non-bonded parameters (van der Waals (vdW) parameters σ and ε, and partial charges q). A well-trained transferrable MMFF shall predict these parameters accurately for any given molecules. In this article, “torsion” refers to proper torsion unless otherwise specified. We also fix the phase angles ϕijklnϕ,0 at 0 for odd nϕ and π for even nϕ, and therefore ensure that the torsional energy is independent of the order being ijkl or lkji.

Force field parameters should adhere to several physical constraints: (1) force field parameters should be permutationally invariant, e.g., the force constant of bond (i, j) should be equal to that of bond (j, i). (2) Force field parameters should be in accordance with chemical symmetries of molecules, e.g., force constants of the two C–O bonds in a carboxyl group (–C([O–])[double bond, length as m-dash]O) must be equal to each other, due to their chemical equivalency, even though they may have different bond orders assigned when written as a SMILES or SMARTS string.34 (3) Charge conservation should be guaranteed, i.e., the summation of assigned partial charges of atoms in one molecule should be consistent with the molecule's net charge, which avoids net charge gain/loss for the molecule in the parameter determination. These constraints are naturally satisfied in the traditional look-up table approaches, which should also serve as essential guidelines when force field parameters are inferred by a machine-learning model.

Additionally, there are two key philosophies worth considering when building general small molecule MMFFs: (1) force field parameters should be dominated by local structures, so that the parameters trained from small molecules can be consistently transferred to similar structures in relatively large molecules.9,11 (2) Torsional energy profiles should be accurately captured, as the quality of torsion parameters significantly affects the conformational distribution of small molecules, thereby influencing the prediction of properties such as protein-ligand binding affinity.11,35,36

2.2 Dataset construction

2.2.1 Molecular fragments generation. The curated dataset is mostly built from the ChEMBL database37 with some additions from the ZINC20 database38 to further enhance the diversity. From these datasets, a subset of molecules was initially selected by several criteria, including number of aromatic rings, polar surface area (PSA), quantitative estimate of drug-likeness (QED), element types, and hybridization types. These selected molecules were then cleaved into fragments with less than 70 atoms using our in-house graph-expansion algorithm (details in ESI), such that the local chemical environments were well-preserved.11,39 In brief, this fragmentation algorithm traverses over each bond, angle, and non-ring torsion in a molecule, retains the relevant atoms and their conjugated partners, then trims the rest and caps the cleaved bonds. Next, these fragments were expanded to various protonation states within a pKa range of 0.0 to 14.0, calculated by Epik 6.5,40 to cover most possible protonation states that might appear in aqueous solutions. Finally, 2.4 million unique fragments were selected for QM calculations after deduplication.
2.2.2 Quantum chemistry methods and workflow. From the 2.4 million fragments, we created two QM datasets, namely the optimization dataset and the torsion dataset, at the B3LYP-D3(BJ)/DZVP level of theory. The same QM method is also used by OpenFF to generate their training data.18,19 This method achieves a good balance between accuracy (relative to CCSD(T)/CBS) and computational cost in recent benchmarks.41,42 For molecular conformational PES, more advanced methods such as ωB97M-V are significantly more expensive but only marginally more accurate.

The optimization dataset was generated from the entire 2.4 million fragments. The 3D conformations of each fragment was initially generated by RDKit from its SMILES string, and then optimized using the geomeTRIC43 optimizer at the chosen QM level. The relaxed structure is verified where no accidental bond breaking or formation happened during the structural relaxation. Finally, the Hessian matrix is calculated for each fragment using Q-Chem 6.1.44 The hessian matrix is further verified by checking all the eigenvalues (except for six near-zero values corresponding to translational and rotational modes) to ensure that a true local minimum is captured. More details about the workflow and the filtering criteria are provided in the ESI.

The torsion dataset consists of two subsets, non-ring torsions and in-ring torsions, curated separately, comprising 2.2 million and 1.0 million frames respectively. For each non-ring torsion fragment, we first rotate the torsion angle in 15° increments from the optimized conformation, creating 24 initial frames that were then separately relaxed using the geomeTRIC optimizer with the rotated torsion angle constrained. While the in-ring torsions were scanned using a sequential frame-by-frame approach, with an early stopping strategy if the conformational energy is more than 20 kcal mol−1 higher than the unconstrained relaxed conformation, since those high-energy regimes are irrelevant in room-temperature molecular dynamics simulations. For both non-ring and in-ring subsets, the resultant conformers were also carefully filtered by the bond breaking or formation criteria. Due to the high computational cost of torsion scan, these calculations were performed with GPU4PySCF45,46 for significant acceleration and cost reduction. The results were also verified to be consistent with Q-Chem.

2.2.3 Diversity of chemical space. Evaluating the diversity of chemical space composed of numerous molecules often involves representing molecules with descriptors or fingerprints followed by measuring of their distances. One widely used fingerprint is the Morgan fingerprint, which captures the local environment of each atom within a specified radius defined by the number of covalent bonds. To emphasize the local structure of torsions, Morgan fingerprints were calculated with a radius of 2, focusing exclusively on the neighborhoods of the two central atoms in the selected torsion. This torsional fingerprint was used to evaluate the diversity of different datasets, including SPICE, GEOM, and our torsion dataset, as well as in the creation of a diverse torsion dataset (BDTorsion).

2.3 ByteFF model

ByteFF predicts both the bonded and non-bonded force field terms in one-pass. It leverages the state-of-the-art Edge-augmented Graph Transformer (EGT)47 as its backbone to achieve accuracy and transferability. The output of EGT is further processed by a series of MLPs to derive force field parameters while preserving both the total charge and the physical symmetry of each force field term. The ML model structure is shown in Fig. 1 and consists of three major parts: featurization, GNN, and output modules. They are briefly described here, and more details are given in the ESI.
image file: d4sc06640e-f1.tif
Fig. 1 Model structure of ByteFF. ByteFF predicts MMFF parameters in three steps. First, atom and bond features are extracted from the molecular graph and then projected into embeddings. Then, an edge-augmented graph transformer (EGT)47 is used to synergize the edge embeddings with the node-based attention mechanism. Lastly, the output module derives force field parameters while preserving the molecular symmetry and total charge.
2.3.1 Featurization module. For each molecule, we first convert the molecule structure into a graph, and detect chemically equivalent atoms and bonds with bond orders ignored. Atomic features are then detected from this graph, including the element type, formal charge, ring connectivity and minimum ring size. Additionally, to better describe the chemical environment of each atom, bond features including bond order and whether the bond is in ring, are also extracted. Replacing the discrete atom-typing rules used in conventional MMFFs, atom and bond features are projected into continuous vectorial embedding spaces, and concatenated into node and edge feature embeddings, respectively. Importantly, embeddings of chemically equivalent atoms and bonds are averaged before passed to the next module, which helps to preserve the chemical symmetry of the molecule.
2.3.2 GNN module. In the GNN module, a modified EGT architecture is employed to learn the chemical environment of molecules from both atom and bond features. This architecture synergizes the node and edge features in the attention mechanism to efficiently capture the structural information of the molecule. To ensure locality and speed up training and inference, the original global self-attention in EGT is modified such that attention is applied only to a local neighborhood of each node.
2.3.3 Output module. After a few layers of localized EGT, atom and bond hidden states generated by the GNN module are passed to the output module to predict each force field term individually. For each bonded term, the corresponding output module is designed to preserve the necessary symmetry, respectively. For the partial charge term, the output module is designed to preserve the total charge of the molecule via a bond-charge-correction (BCC) style postprocessing.

Additionally, an ensemble of five models randomly initialized is trained for improved predictive performance and uncertainty quantification.

2.4 Training procedures

To efficiently use the curated optimization and torsion datasets, while enhancing the robustness and performance of the ByteFF model, we took an ingenious training strategy with three stages: pre-training, training, and fine-tuning. The loss functions (details in ESI) were carefully designed in each stage, such that the improvement of the force field was properly reflected by the minimization of loss functions. In each stage, distinct loss function forms were chosen for individual terms in the analytical form of MMFF (eqn (2)), and then combined to form the overall loss function for the stage.

In the pre-training stage, the optimization dataset were used as the training set. In this stage, the non-bonded parameters were fitted to GAFF-2.2 vdW parameters and AM1-BCC charges with the mean squared error (MSE) losses. Force constants of proper torsions were also fitted to the GAFF-2.2 values with MSE losses, and the equilibrium values of bonds and angles were refined with energy-based loss functions. Meanwhile, the force constants of bonded terms in eqn (2) were trained using a partial Hessian loss, evaluated by the mean absolute percentage error (MAPE) of Hessian blocks corresponding to bonds, angles and improper torsions, namely partial Hessian blocks. It has been reported that the partial Hessian blocks can be used to derive accurate force constants for single molecules.48–50 Our design of the differentiable partial Hessian loss enables fitting force constants in batches, combining training accuracy with computational efficiency.

Given the significance of torsion profiles in the quality of force fields, in the training stage, we incorporated the curated torsion dataset to fit the force constants of proper torsions, while the other parameters were trained using optimization dataset following the same manner in the pre-training stage. Here, we employed the Boltzmann MSE loss functions51 into an iterative optimization-and-training process to train the force constants of proper torsions using the torsion dataset. In each iteration, the QM-optimized geometries were refined by the force field, with the torsion angle constrained and atom positions restrained.18 The parameters were then trained on both the optimization and torsion datasets. As the force field's accuracy improved, the positional restraint force constant was gradually reduced. Additionally, L1-norm regularization was applied to the force constants of proper torsions to restrain redundant degrees of freedom. The combination of pre-training and training stages yields the ByteFF-gopt model.

In the final fine-tuning stage, we incorporated part of the training set in Espaloma-0.3.0,32 named off-equilibrium dataset in this work, to refine the force field parameters with the QM energy and forces. Combining all the three stages (pre-training, training, and fine-tuning), the ByteFF-joint model was obtained.

The detailed settings of the three-stage training procedure are summarized in Table 1.

Table 1 Summary of training settings for different stages
Pre-training Training Fine-tuning
Model parameters initialized Randomly initialized parameters Parameters from pre-training stage Parameters from training stage
Datasets and fitting targets Optimization dataset: Optimization dataset: Optimization dataset:
- GAFF-2.2 σ/ε/q/kϕ - GAFF-2.2 σ/ε/q same as training
- EMMbond/EMMangle/EMMimproper - EMMbond/EMMangle/EMMimproper Torsion dataset:
- QM hessian matrix - QM hessian matrix same as training
Torsion dataset: Off-equilibrium dataset:
- QM energy - QM energy/force
- L1-norm of kϕ
Optimization-and training strategy No force field optimization Three optimization-and-trainiterations on torsion dataset Using coordinates from the last iteration of training stage
Optimizer RAdam RAdam RAdam
Learning rate 10−4 10−4 2 × 10−5
Scheduler ReduceLROnPlateau ReduceLROnPlateau ReduceLROnPlateau
Factor: 0.2 Factor: 0.2 Factor: 0.2
Patience: 10 Patience: 4 Patience: 4
Threshold: 10−6 Threshold: 10−6 Threshold: 10−6
Early stop patience 50 10 10
Yield ByteFF-gopt ByteFF-joint


2.5 Benchmark datasets and metrics

To be compatible with the Amber family of protein force fields, the non-bonded terms of ByteFF were trained to reproduce the AM1-BCC charges52,53 and GAFF-2.2 vdW parameters. Therefore, in this work we focus on the intramolecular PES performance relative to QM references of ByteFF, and leave the further improvement of non-bonded terms as a future work.

Three different types of benchmarks were used to provide a thorough assessment of ByteFF. First, the large-scale industrial collaborative OpenFFBenchmark public dataset54 was used to quantify how well the force field-relaxed geometries and energies reproduced the QM-relaxed references, without constraints. Three different metrics were used, namely root-mean-square deviation of atomic positions (RMSD), torsion fingerprint deviation (TFD),55 and relative energy difference (ΔΔE) defined by OpenFF team.56 Second, the crucial torsional energy profile performance was benchmarked on two different datasets, namely the publicly available TorsionNet500 dataset57 and the BDTorsion dataset curated in-house. For consistency, energies of conformations in TorsionNet500 (ref. 57) dataset were re-calculated with the B3LYP-D3(BJ)/DZVP level of theory. On these datasets, the root mean squared error (RMSE) and Boltzmann-weighted RMSE of each torsional energy profile were calculated and compared.

Beyond the constrained (torsional energy) and unconstrained (OpenFFBenchmark) optimization benchmarks, the performance of ByteFF was further benchmarked on the QM references energies and forces calculated on the off-equilibrium dataset, which includes subsets of the SPICE dataset58 and the RNA Structure Atlas.59 On this dataset, RMSE of energies and forces were calculated to quantify if the PES is properly captured beyond local minima.

Furthermore, due to the significance of validating the accuracy of ByteFF in MD simulations, metadynamics (MetaD) simulations60,61 were conducted on several drug-like fragments to compare the conformational free energy surface (FES) predicted by various MMFFs to the reference predicted by QM.62,63 In MetaD simulations, SPONGE (Simulation Package tOward Next GEneration molecular modelling)64,65 was used as the simulation engine, employing the GPU4PySCF and OpenMM66 as the backends for QM and MMFF calculations, respectively.

Rigorous procedures were followed in the benchmarking of ByteFF models, which ensured no data leakage between the training and testing data. More details of these benchmarks can be found in ESI.

3 Results and discussion

3.1 Diversity of various datasets

Using the Morgan-based torsional fingerprint defined above, we quantify the diversity of SPICE, GEOM and BDTorsion datasets in terms of torsional local environment. Specifically, in all three datasets, only torsions with circular standard deviation67 greater than 0.3 were included and used in fingerprint analysis. The diversity of each dataset was then visualized using the standard t-SNE (t-distributed stochastic neighbor embedding) algorithm. In Fig. 2, the scatter dots are colored by the element types of the two center atoms (the rotatable bond) in each torsion. The most frequent rotatable bond is C–C, followed by C–N rotatable bonds and other bonds including C–O, C–S, etc. It is evident that our torsion dataset excels in the diversity and comprehensiveness of its chemical space coverage, while the SPICE and GEOM datasets are more sparse. In comparison, both SPICE and GEOM datasets show obvious vacancies, corresponding to the absence of certain chemical torsion patterns. For example, the uncovered areas around (−45, −20) in Fig. 2(a) and (b) correspond to in-ring torsions of protonated pyridine and imidazole, as well as their derivatives, which are indispensable for the training of force fields for bio-organic molecules. Such an expansive coverage of the chemical space in the training dataset provides a solid foundation for the coverage and transferability of ByteFF.
image file: d4sc06640e-f2.tif
Fig. 2 t-SNE analysis of different datasets, (a) SPICE, (b) GEOM and (c) ByteFF torsion dataset. The Morgan-based torsional fingerprint analysis results are illustrated using the t-SNE algorithm. Every scatter dot corresponds to a torsion profile being analyzed in the corresponding dataset, which is colored by the element types of the two center atoms in the torsion.

3.2 Torsional potential energy surfaces

To evaluate the accuracy of ByteFF in predicting torsional PES, and directly compare with the performance of other force fields, we performed comprehensive benchmarks using the TorsionNet500 and BDTorsion datasets. TorsionNet500 (ref. 57) is a benchmark dataset consisting of 500 chemically diverse fragments relevant to biological and pharmaceutical applications. The BDTorsion dataset is curated in this work. It is divided into non-ring and in-ring torsion subsets and each contains 1000 chemically diverse fragments. For both TorsionNet500 and BDTorsion datasets, dozens of conformations are sampled by scanning torsion angle, then QM-relaxed, and tagged with QM energy labels for each fragment. Here, two metrics were used to evaluate the accuracy of the predictions from various force fields with respect to the ground truth from QM: the RMSE and Boltzmann RMSE, with RMSE assessing the overall discrepancy while Boltzmann RMSE emphasizing the deviations near the energy minimum. When calculating the Boltzmann RMSE, we used 2.0 kcal mol−1 for both cutoff and scaling factors.

As illustrated in Fig. 3, both ByteFF-gopt and ByteFF-joint significantly outperform competitors on all three benchmark datasets. On all the three datasets, ByteFF-gopt exhibits the best performance on both metrics, with the highest population of predictions with lower RMSE and Boltzmann RMSE. When tested on TorsionNet500, the Boltzmann RMSE of most predictions by ByteFF-gopt are within 0.5 kcal mol−1, while none of the Boltzmann RMSE exceeds 1.33 kcal mol−1, demonstrating the exceptional accuracy of ByteFF-gopt in terms of predicting torsional energy profiles. On the more challenging BDTorsion datasets, both RMSE and Boltzmann RMSE are generally higher for all the force fields being tested. Nevertheless, ByteFF-gopt and ByteFF-joint still outperformed competitors significantly, with most Boltzmann RMSE values below 1.0 kcal mol−1 (chemical accuracy). In these torsion profile benchmarks, the performance of ByteFF-joint is compromised by the addition of off-equilibrium energies and forces training data. This is inevitable due to the very limited fixed functional form of MMFFs.


image file: d4sc06640e-f3.tif
Fig. 3 Density distribution of the discrepancy of torsional PES between predictions of QM and force fields. As a comprehensive benchmark, two metrics including Boltzmann RMSE and RMSE were used to assess the accuracy of force field-predicted torsional energy profiles with respect to the QM results. Three datasets were included in this benchmark: TorsionNet500 (a and b), BDTorsion-NonRing (c and d), and BDTorsion-InRing (e and f). The density distributions were smoothed using kernel density estimation (KDE).

One key feature of organic molecules is the presence of a great variety of rings,68 and in-ring torsion parameters have crucial impact on the conformational PES of rings. However, this is much less discussed in the development of MMFFs. In this work, we carefully constructed the in-ring torsion scan training dataset to improve the performance of ByteFF on in-ring torsions. As shown in Fig. 3(e) and (f), both ByteFF-gopt and ByteFF-joint significantly outperform competitors, with both RMSE and Boltzmann RMSE mostly better than chemical accuracy, respectively. As an example to illustrate the accuracy of ByteFF-gopt and ByteFF-joint in predicting in-ring torsional energy profile, we present a molecule from the BDTorsion-InRing dataset that contains a four-membered aliphatic ring (Fig. 4(a)). On this molecule, we calculated the torsional energy profile for the in-ring torsion atoms highlighted in Fig. 4(a), and compared PES predictions from various force fields with the QM references (Fig. 4(b)). It is evident that both Espaloma-0.3.0 (green) and OpenFF-2.0.0 (blue) failed to predict the energy landscape in the vicinity of the local minimum, which could lead to incorrect predictions of ring conformations in geometric optimization or MD simulations. Though GAFF-2.2 captured the approximate shape of this torsional energy landscape, it overestimated the barrier near −130° while significantly underestimated the energy near −160°, which led to significantly incorrect shifts of the predicted location of global energy minimum. In contrast, both ByteFF-gopt and ByteFF-joint predictions align well with QM references, which not only captured the shape of the energy landscape, but also accurately predicted the positions and energy differences of local minima. In addition to accurately predicting in-ring torsional energy profiles, ByteFF models also excel in the prediction of non-ring torsions, as illustrated with the example molecule in Fig. 4(c). As shown in Fig. 4(d), ByteFF-gopt and ByteFF-joint successfully predicted the torsional energy landscape of the highlighted non-ring torsion, which is highly consistent with the QM reference, while other force fields failed to predict both the positions of energy minima and the height of the energy barrier. More example molecules are provided in Fig. S2 and S3 including both in-ring and non-ring torsions, where ByteFF models predicted the torsional energy profiles with exceptional accuracy against the QM reference, while other force fields failed to achieve.


image file: d4sc06640e-f4.tif
Fig. 4 Example of the in-ring and non-ring torsion prediction accuracy of various force fields. As examples to show the accuracy of ByteFF models in predicting torsional energy profiles, an in-ring (a and b) and a non-ring (c and d) example molecule are provided. The torsional energy profiles predicted by various force fields are compared with the QM references and shown for each example molecule.

The effectiveness and accuracy of ByteFF-gopt and ByteFF-joint stems from the synergy of two components: comprehensive in-house training datasets and carefully tweaked training strategy. The iterative optimization-and-training strategy allows the model to accurately capture the energy landscape along torsional degrees of freedom, minimizing interference from complex interactions such as bond-torsion coupling that is beyond the limited description of the chosen functional form. The predictions are further enhanced by the simple but effective model-ensemble averages.

3.3 Equilibrium conformations

In addition to predict the torsional PES of molecules, the OpenFFBenchmark54 dataset is used to benchmark the performance of various force fields in terms of predicting unconstrained equilibrium conformations and associated energies.

As shown in Fig. 5(a) and (b), the distribution of ByteFF-gopt and ByteFF-joint in terms of RMSD and TFD are both concentrated near zero, with higher peak values near zero and narrower distributions than other force fields, demonstrating superior consistencies with QM-optimized molecular conformations. Particularly, both ByteFF-gopt and ByteFF-joint significantly outperform the state-of-the-art “OPLS4 cst” model, which is OPLS4 (ref. 12) with parameters refined for each molecule individually using the FFBuilder tool. Additionally, we calculated the relative energy differences (ΔΔE) to assess energetic agreement between force field-optimized conformations and the QM-optimized ones (Fig. 5(c)). Both ByteFF-gopt and ByteFF-joint exhibit sharp peaks around zero in their ΔΔE distribution, significantly outperform all other force fields, indicating that both of them reproduce correctly the QM relative energies between conformers. The accurate predictions of equilibrium conformations ensures the correct conformations being sampled in MD simulations, while the precise prediction of ΔΔE guarantees the Boltzmann distribution in the conformational space is properly reproduced.


image file: d4sc06640e-f5.tif
Fig. 5 Histograms of different metrics on OpenFFBenchmark dataset. The accuracy of force field-relaxed geometry relative to QM-relaxed references is quantified with (a) RMSD and (b) Torsion Fingerprint Deviation (TFD) scores. The energetic accuracy is quantified by (c) ΔΔE distributions. All benchmark results for “OPLS4 cst”, and the detailed protocols to calculate the benchmark results are obtained from ref. 54.

Fig. 6 illustrates a more detailed example of the performance of reproducing equilibrium geometries for various force fields. The QM-relaxed reference geometry shows that the four-membered ring takes a “butterfly” conformation due to significant torsional strain. The predictions relaxed using various force fields, including ByteFF-gopt (green), Espaloma-0.3.0 (yellow), and GAFF-2.2 (cyan), are superimposed on the QM reference. It is evident that Espaloma-0.3.0 (yellow) predicts the four-member ring as a planer conformation, while GAFF-2.2 (cyan) overestimates the bending of the ring. Being at the center of the molecule, a minor misprediction of the four-member ring conformation results in more significant geometric deviations on the two ends of the molecule, leading to overall RMSD values of 0.9 Å for Espaloma-0.3.0 and 0.4 Å for GAFF-2.2, respectively. Notably, the geometry relaxed by ByteFF-gopt (green) achieved superior alignment with the QM prediction with an RMSD of 0.2 Å. With increasing ring sizes and increasing number of neighboring rings, the complexity of the conformational space expands, making precise predictions of in-ring torsion profiles increasingly critical.


image file: d4sc06640e-f6.tif
Fig. 6 Example of the equilibrium conformation predictions of molecules. (a) 2D representation of the example molecule. (b–d) The equilibrium conformations predicted by ByteFF-gopt (green), Espaloma-0.3.0 (yellow), and GAFF-2.2 (cyan), superimposed to the equilibrium conformations predicted by QM (grey).

3.4 Validation in molecular dynamics

Several MD relevant tests were conducted to verify the performance of ByteFF. Firstly, we comprehensively assessed the prediction accuracy of various force fields on off-equilibrium conformations, quantified by energy and force RMSE relative to QM references (Table 2). Since most of the tested conformations were sampled from MD simulations, the accurate predictions will lead to better performance in MD. Among all five subsets, ByteFF-gopt achieved higher accuracy than both GAFF-2.2 and OpenFF-2.0, but less accurate than Espaloma-0.3.0. Importantly, ByteFF-gopt achieved this accuracy with relaxed geometry, energy, and partial hessian, without using any force labels in the training process. With additional off-equilibrium energy and forces training data, we further trained ByteFF-joint to improve the force prediction capabilities of ByteFF-gopt. Remarkably, after fine-tuning with just around 10k data in the off-equilibrium dataset—a significantly smaller dataset than that used by Espaloma-0.3.0—the performance of ByteFF-joint surpassed that of its competitors.
Table 2 Comparison of energy RMSE (in kcal mol−1) and force RMSE (in kcal mol−1 Å−1) for different datasets and force fields
Dataset GAFF-2.2 OpenFF-2.0 Espaloma-0.3 ByteFF-gopt ByteFF-joint
SPICE-Pubchem Energy 4.8 4.4 2.3 3.4 2.3
Force 14.1 14.0 6.5 9.9 6.7
SPICE-DES-monomers Energy 2.7 2.7 1.4 1.9 1.3
Force 10.7 12.5 6.0 8.0 5.4
SPICE-dipeptide Energy 5.2 4.5 3.3 4.0 2.3
Force 12.5 12.4 8.1 9.9 5.3
RNA-diverse Energy 6.7 5.4 3.9 4.6 3.5
Force 15.8 18.4 4.5 9.7 3.7
RNA-trinucleotide Energy 6.2 6.0 3.8 4.4 3.4
Force 16.5 18.7 4.4 10.0 3.6


As a result, ByteFF-joint achieves state-of-the-art performance in almost all subsets, except for the SPICE-Pubchem dataset, where ByteFF-joint shows marginally higher force RMSE compared to Espaloma-0.3.0. As discussed earlier, it is inherently challenging of an MMFF to perfectly replicate the QM PES. There is a trade-off between predicting the PES of relaxed-geometry and off-equilibrium conformations, and we focused more on the relaxed-geometry performances during the training process, since this is more relevant to the sampling of local minima in molecular dynamics simulations and is more stressed in the latest large-scale collaborative industrial benchmark.54 Despite this, ByteFF-joint achieves overall state-of-the-art performance in all benchmarks considered in this study.

In addition to performance benchmarking on the static small molecule conformations, ByteFF was further validated through MD simulations. From the ligands in an FEP benchmark dataset,69 10 drug-like fragments were chosen for these simulations, as illustrated in Fig. 7 and S4. In each fragment, a critical torsion angle was selected as the collective variable (CV) for the MetaD simulations.60,61 During the MetaD simulations, Gaussian bias potentials were applied along the selected CV, and the corresponding torsional FES was obtained. Unlike torsion profiles, which are obtained from locally minimized geometries, the torsional FES is averaged from all geometries following the Boltzmann distribution, making it directly related to experimental results. To illustrate the performance of ByteFF in MetaD simulations, a common fragment in the PTP1B subset from the FEP+ R-group dataset is presented in Fig. 7(a), with the highlighted torsion serving as the CV. As shown in Fig. 7(b), ByteFF-gopt and ByteFF-joint excel in predicting the torsional FES for the highlighted torsion, achieving high consistency with the QM references in both the peak/valley positions and the relative heights of the peaks. Additional results for the other 9 fragments are provided in Fig. S4, where ByteFF-gopt and ByteFF-joint outperform other force fields in most cases. These MetaD simulation results not only confirm the reliability of ByteFF in the MD simulations but also support the rationale behind fitting torsion profiles in the development of ByteFF.


image file: d4sc06640e-f7.tif
Fig. 7 Example of the torsional FES computed by metadynamics. (a) 2D representation of the example molecule. (b) The torsional FES of the highlighted torsion predicted by various force fields compared with the QM references.

3.5 Limitations

Despite the excellent performance of ByteFF compared to other force fields, several limitations still exist. As an MMFF with a minimal analytical functional, ByteFF inherently lacks the capability of modeling more complicated effects in intramolecular PES, such as bond-torsion and angle-torsion couplings. Moreover, while intramolecular parameters are carefully trained against QM references, non-bonded parameters are simply fitted to GAFF-2.2 vdW parameters and AM1-BCC charges to ensure Amber compatibility, leaving room for further refinement. Addressing these limitations presents a path forward for the further improvements of ByteFF.

4 Conclusion

In this work, we developed ByteFF, an Amber-compatible MMFF with broad chemical space coverage and state-of-the-art accuracy. Through Morgan-based torsional fingerprints and t-SNE analysis, we demonstrated that our dataset covers superior chemical space. Based on this strong data foundation, ByteFF leverages an edge-augmented, symmetry-preserving GNN model to predict MMFF parameters. We also carefully designed our training strategy, enhanced by techniques such as partial Hessian fitting, an iterative optimization-and-training scheme, and effective model ensembling for prediction improvements and uncertainty quantification. Benefitting from the comprehensive datasets and exquisitely designed training strategy, ByteFF outperforms existing MMFFs across multiple benchmarks, delivering accurate predictions of torsional energy profiles, equilibrium conformations, and off-equilibrium energies and forces. We believe ByteFF will significantly benefit biomolecular simulations for various purposes.

Data availability

The software package used to train ByteFF models, along with the curated BDTorsion dataset, is available at https://github.com/bytedance/byteff.

Author contributions

Conceptualization: T. Z., A. W., J. Z., Z. W., S. G., and W. Y.; data curation: T. Z., Y. X., X. H. and X. W.; methodology: T. Z., X. H., J. Z., A. W. and Y. L.; investigation: T. Z., A. W., and X. X.; software: T. Z., Y. X. and Y. C.; supervision: W. Y.; writing: T. Z., A. W., X. H., Y. X., X. X., and W. Y.

Conflicts of interest

ByteDance Inc. holds intellectual property rights pertinent to the research presented herein.

Acknowledgements

We thank Weihao Gao for beneficial discussions about the GNN backbone architecture and training strategy. This work is supported by ByteDance Research.

Notes and references

  1. J.-L. Reymond and M. Awale, ACS Chem. Neurosci., 2012, 3, 649–657 Search PubMed.
  2. W. A. Warr, M. C. Nicklaus, C. A. Nicolaou and M. Rarey, J. Chem. Inf. Model., 2022, 62, 2021–2034 CrossRef CAS PubMed.
  3. G. Caron, J. Kihlberg, G. Goetz, E. Ratkova, V. Poongavanam and G. Ermondi, ACS Med. Chem. Lett., 2021, 12, 13–23 CrossRef CAS PubMed.
  4. X. Liu, D. Shi, S. Zhou, H. Liu, H. Liu and X. Yao, Expert Opin. Drug Discovery, 2018, 13, 23–37 CrossRef CAS PubMed.
  5. M. De Vivo, M. Masetti, G. Bottegoni and A. Cavalli, J. Med. Chem., 2016, 59, 4035–4061 CrossRef CAS PubMed.
  6. P. Das, T. Sercu, K. Wadhawan, I. Padhi, S. Gehrmann, F. Cipcigan, V. Chenthamarakshan, H. Strobelt, C. Dos Santos and P.-Y. Chen, et al. , Nat. Biomed. Eng., 2021, 5, 613–623 CrossRef CAS PubMed.
  7. I. Muegge and Y. Hu, ACS Med. Chem. Lett., 2023, 14, 244–250 CrossRef CAS PubMed.
  8. L. F. Song and K. M. Merz Jr, J. Chem. Inf. Model., 2020, 60, 5308–5318 CrossRef CAS PubMed.
  9. J. Wang, R. M. Wolf, J. W. Caldwell, P. A. Kollman and D. A. Case, J. Comput. Chem., 2004, 25, 1157–1174 CrossRef CAS PubMed.
  10. E. Harder, W. Damm, J. Maple, C. Wu, M. Reboul, J. Y. Xiang, L. Wang, D. Lupyan, M. K. Dahlgren, J. L. Knight, J. W. Kaus, D. S. Cerutti, G. Krilov, W. L. Jorgensen, R. Abel and R. A. Friesner, J. Chem. Theory Comput., 2016, 12, 281–296 Search PubMed.
  11. K. Roos, C. Wu, W. Damm, M. Reboul, J. M. Stevenson, C. Lu, M. K. Dahlgren, S. Mondal, W. Chen, L. Wang, R. Abel, R. A. Friesner and E. D. Harder, J. Chem. Theory Comput., 2019, 15, 1863–1874 Search PubMed.
  12. C. Lu, C. Wu, D. Ghoreishi, W. Chen, L. Wang, W. Damm, G. A. Ross, M. K. Dahlgren, E. Russell, C. D. Von Bargen, R. Abel, R. A. Friesner and E. D. Harder, J. Chem. Theory Comput., 2021, 17, 4291–4300 CrossRef CAS PubMed.
  13. M. J. Buskes and M.-J. Blanco, Molecules, 2020, 25, 3493 CrossRef CAS PubMed.
  14. L. Guillemard, N. Kaplaneris, L. Ackermann and M. J. Johansson, Nat. Rev. Chem, 2021, 5, 522–545 Search PubMed.
  15. N. Brandenberg, S. Hoehnel, F. Kuttler, K. Homicsko, C. Ceroni, T. Ringel, N. Gjorevski, G. Schwank, G. Coukos and G. Turcatti, et al. , Nat. Biomed. Eng., 2020, 4, 863–874 Search PubMed.
  16. T. Mohammad, Y. Mathur and M. I. Hassan, Briefings Bioinf., 2021, 22, bbaa279 Search PubMed.
  17. F. Touret, M. Gilles, K. Barral, A. Nougairède, J. Van Helden, E. Decroly, X. De Lamballerie and B. Coutard, Sci. Rep., 2020, 10, 13093 CrossRef CAS PubMed.
  18. Y. Qiu, D. G. A. Smith, S. Boothroyd, H. Jang, D. F. Hahn, J. Wagner, C. C. Bannan, T. Gokey, V. T. Lim, C. D. Stern, A. Rizzi, B. Tjanaka, G. Tresadern, X. Lucas, M. R. Shirts, M. K. Gilson, J. D. Chodera, C. I. Bayly, D. L. Mobley and L.-P. Wang, J. Chem. Theory Comput., 2021, 17, 6262–6280 CrossRef CAS PubMed.
  19. S. Boothroyd, P. K. Behara, O. C. Madin, D. F. Hahn, H. Jang, V. Gapsys, J. R. Wagner, J. T. Horton, D. L. Dotson, M. W. Thompson, J. Maat, T. Gokey, L.-P. Wang, D. J. Cole, M. K. Gilson, J. D. Chodera, C. I. Bayly, M. R. Shirts and D. L. Mobley, J. Chem. Theory Comput., 2023, 19, 3251–3275 CrossRef CAS PubMed.
  20. I. Batatia, D. P. Kovacs, G. Simm, C. Ortner and G. Csanyi, Adv. Neural Inf. Process. Syst., 2022, 35, 11423–11436 Search PubMed.
  21. S. Batzner, A. Musaelian, L. Sun, M. Geiger, J. P. Mailoa, M. Kornbluth, N. Molinari, T. E. Smidt and B. Kozinsky, Nat. Commun., 2022, 13, 2453 CrossRef CAS PubMed.
  22. M. Rezaee, S. Ekrami and S. M. Hashemianzadeh, Sci. Rep., 2024, 14, 11791 CrossRef CAS PubMed.
  23. R. P. Pelaez, G. Simeon, R. Galvelis, A. Mirarchi, P. Eastman, S. Doerr, P. Thölke, T. E. Markland and G. De Fabritiis, J. Chem. Theory Comput., 2024, 20, 4076–4087 CrossRef CAS PubMed.
  24. V. Hornak, R. Abel, A. Okur, B. Strockbine, A. Roitberg and C. Simmerling, Proteins: Struct., Funct., Bioinf., 2006, 65, 712–725 CrossRef CAS PubMed.
  25. J. A. Maier, C. Martinez, K. Kasavajhala, L. Wickstrom, K. E. Hauser and C. Simmerling, J. Chem. Theory Comput., 2015, 11, 3696–3713 Search PubMed.
  26. R. C. Rizzo and W. L. Jorgensen, J. Am. Chem. Soc., 1999, 121, 4827–4836 CrossRef CAS.
  27. L. Wang, P. K. Behara, M. W. Thompson, T. Gokey, Y. Wang, J. R. Wagner, D. J. Cole, M. K. Gilson, M. R. Shirts and D. L. Mobley, J. Phys. Chem. B, 2024, 7043–7067 CrossRef CAS PubMed.
  28. Y. Li, H. Li, F. C. Pickard IV, B. Narayanan, F. G. Sen, M. K. Chan, S. K. Sankaranarayanan, B. R. Brooks and B. Roux, J. Chem. Theory Comput., 2017, 13, 4492–4503 CrossRef CAS PubMed.
  29. R. Galvelis, S. Doerr, J. M. Damas, M. J. Harvey and G. De Fabritiis, J. Chem. Inf. Model., 2019, 59, 3485–3493 Search PubMed.
  30. J. L. McDonagh, A. Shkurti, D. J. Bray, R. L. Anderson and E. O. Pyzer-Knapp, J. Chem. Inf. Model., 2019, 59, 4278–4288 CrossRef CAS PubMed.
  31. Y. Wang, J. Fass, B. Kaminow, J. E. Herr, D. Rufa, I. Zhang, I. Pulido, M. Henry, H. E. B. Macdonald, K. Takaba and J. D. Chodera, Chem. Sci., 2022, 13, 12016–12033 RSC.
  32. K. Takaba, A. Friedman, C. Cavender, P. Behara, I. Pulido, M. Henry, H. MacDermott-Opeskin, C. Iacovella, A. Nagle, A. Payne, M. Shirts, D. L. Mobley, J. D. Chodera and Y. Wang, Chem. Sci., 2024, 12861–12878 Search PubMed.
  33. G. Chen, T. Jaffrelot Inizan, T. Plé, L. Lagardère, J.-P. Piquemal and Y. Maday, J. Chem. Theory Comput., 2024, 5558–5569 CrossRef CAS PubMed.
  34. J. March et al. , Advanced organic chemistry: reactions, mechanisms, and structure, McGraw-HillNew York, 1977 Search PubMed.
  35. S.-L. J. Lahey, T. N. Thien Phuc and C. N. Rowley, J. Chem. Inf. Model., 2020, 60, 6258–6268 CrossRef CAS PubMed.
  36. J. T. Horton, S. Boothroyd, J. Wagner, J. A. Mitchell, T. Gokey, D. L. Dotson, P. K. Behara, V. K. Ramaswamy, M. Mackey, J. D. Chodera, J. Anwar, D. L. Mobley and D. J. Cole, J. Chem. Inf. Model., 2022, 62, 5622–5633 Search PubMed.
  37. B. Zdrazil, E. Felix, F. Hunter, E. J. Manners, J. Blackshaw, S. Corbett, M. de Veij, H. Ioannidis, D. M. Lopez, J. F. Mosquera, M. P. Magarinos, N. Bosc, R. Arcila, T. Kizilören, A. Gaulton, A. P. Bento, M. F. Adasme, P. Monecke, G. A. Landrum and A. R. Leach, Nucleic Acids Res., 2024, 52, D1180–D1192 CrossRef CAS PubMed.
  38. J. J. Irwin, K. G. Tang, J. Young, C. Dandarchuluun, B. R. Wong, M. Khurelbaatar, Y. S. Moroz, J. Mayfield and R. A. Sayle, J. Chem. Inf. Model., 2020, 60, 6065–6073 CrossRef CAS PubMed.
  39. S. Jinsong, J. Qifeng, C. Xing, Y. Hao and L. Wang, Commun. Chem., 2024, 7, 20 CrossRef CAS PubMed.
  40. J. C. Shelley, A. Cholleti, L. L. Frye, J. R. Greenwood, M. R. Timlin and M. Uchimaya, J. Comput.-Aided Mol. Des., 2007, 21, 681–691 Search PubMed.
  41. M. K. Kesharwani, A. Karton and J. M. L. Martin, J. Chem. Theory Comput., 2015, 12, 444–454 Search PubMed.
  42. P. K. Behara, H. Jang, J. T. Horton, T. Gokey, D. L. Dotson, S. Boothroyd, C. I. Bayly, D. J. Cole, L.-P. Wang and D. L. Mobley, J. Phys. Chem. B, 2024, 128, 7888–7902 Search PubMed.
  43. L.-P. Wang and C. Song, J. Chem. Phys., 2016, 144, 214108 Search PubMed.
  44. E. Epifanovsky, A. T. Gilbert, X. Feng, J. Lee, Y. Mao, N. Mardirossian, P. Pokhilko, A. F. White, M. P. Coons and A. L. Dempwolff, et al. , J. Chem. Phys., 2021, 155, 084801 CrossRef CAS PubMed.
  45. R. Li, Q. Sun, X. Zhang and G. K.-L. Chan, Introducing GPU-acceleration into the Python-based Simulations of Chemistry Framework, arXiv, 2024, preprint, arXiv:2407.09700,  DOI:10.48550/arXiv.2407.09700.
  46. X. Wu, Q. Sun, Z. Pu, T. Zheng, W. Ma, W. Yan, X. Yu, Z. Wu, M. Huo, X. Li, W. Ren, S. Gong, Y. Zhang and W. Gao, Enhancing GPU-acceleration in the Python-based Simulations of Chemistry Framework, arXiv, 2024, preprint, arXiv:2404.09452,  DOI:10.48550/arXiv.2404.09452.
  47. M. S. Hussain, M. J. Zaki and D. Subramanian, Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 2022, pp. 655–665 Search PubMed.
  48. J. M. Seminario, Int. J. Quantum Chem., 1996, 60, 1271–1277 Search PubMed.
  49. A. E. Allen, M. C. Payne and D. J. Cole, J. Chem. Theory Comput., 2018, 14, 274–281 Search PubMed.
  50. R. Wang, M. Ozhgibesov and H. Hirao, J. Comput. Chem., 2016, 37, 2349–2359 Search PubMed.
  51. M. K. Dahlgren, P. Schyman, J. Tirado-Rives and W. L. Jorgensen, J. Chem. Inf. Model., 2013, 53, 1191–1199 Search PubMed.
  52. A. Jakalian, B. L. Bush, D. B. Jack and C. I. Bayly, J. Comput. Chem., 2000, 21, 132–146 CrossRef CAS.
  53. A. Jakalian, D. B. Jack and C. I. Bayly, J. Comput. Chem., 2002, 23, 1623–1641 CrossRef CAS PubMed.
  54. L. D'Amore, D. F. Hahn, D. L. Dotson, J. T. Horton, J. Anwar, I. Craig, T. Fox, A. Gobbi, S. K. Lakkaraju, X. Lucas, K. Meier, D. L. Mobley, A. Narayanan, C. E. M. Schindler, W. C. Swope, P. J. in 't Veld, J. Wagner, B. Xue and G. Tresadern, J. Chem. Inf. Model., 2022, 62, 6094–6104 CrossRef PubMed.
  55. T. Schulz-Gasch, C. Schärfer, W. Guba and M. Rarey, J. Chem. Inf. Model., 2012, 52, 1499–1512 CrossRef CAS PubMed.
  56. V. T. Lim, D. F. Hahn, G. Tresadern, C. I. Bayly and D. L. Mobley, F1000Research, 2020, 9, 1–22 Search PubMed.
  57. B. K. Rai, V. Sresht, Q. Yang, R. Unwalla, M. Tu, A. M. Mathiowetz and G. A. Bakken, J. Chem. Inf. Model., 2022, 62, 785–800 CrossRef CAS PubMed.
  58. P. Eastman, P. K. Behara, D. L. Dotson, R. Galvelis, J. E. Herr, J. T. Horton, Y. Mao, J. D. Chodera, B. P. Pritchard, Y. Wang, G. De Fabritiis and T. E. Markland, Sci. Data, 2023, 10, 11 CrossRef CAS PubMed.
  59. L. G. Parlea, B. A. Sweeney, M. Hosseini-Asanjan, C. L. Zirbel and N. B. Leontis, Methods, 2016, 103, 99–119 Search PubMed.
  60. A. Laio and M. Parrinello, Proc. Natl. Acad. Sci. U. S. A., 2002, 99, 12562–12566 CrossRef CAS PubMed.
  61. G. Bussi and D. Branduardi, Rev. Comput. Chem., 2015, 28, 1–49 Search PubMed.
  62. X. Biarnés, A. Ardevol, A. Planas, C. Rovira, A. Laio and M. Parrinello, J. Am. Chem. Soc., 2007, 129, 10686–10693 CrossRef PubMed.
  63. J. Iglesias-Fernández, L. Raich, A. Ardèvol and C. Rovira, Chem. Sci., 2015, 6, 1167–1177 Search PubMed.
  64. Y.-P. Huang, Y. Xia, L. Yang, J. Wei, Y. I. Yang and Y. Q. Gao, Chin. J. Chem., 2022, 40, 160–168 CrossRef CAS.
  65. J. Zhang, D. Chen, Y. Xia, Y.-P. Huang, X. Lin, X. Han, N. Ni, Z. Wang, F. Yu and L. Yang, et al. , J. Chem. Theory Comput., 2023, 19, 4338–4350 CrossRef CAS PubMed.
  66. P. Eastman, R. Galvelis, R. P. Peláez, C. R. Abreu, S. E. Farr, E. Gallicchio, A. Gorenko, M. M. Henry, F. Hu and J. Huang, et al. , J. Phys. Chem. B, 2023, 128, 109–116 Search PubMed.
  67. K. V. Marida, in In Statistics of Directional Data, Academic Press, 1972, pp. 18–24 Search PubMed.
  68. J. Shearer, J. L. Castro, A. D. G. Lawson, M. MacCoss and R. D. Taylor, J. Med. Chem., 2022, 65, 8699–8712 CrossRef CAS PubMed.
  69. G. A. Ross, C. Lu, G. Scarabelli, S. K. Albanese, E. Houang, R. Abel, E. D. Harder and L. Wang, Commun. Chem., 2023, 6, 222 CrossRef PubMed.

Footnotes

Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4sc06640e
Work done as intern at ByteDance Research.

This journal is © The Royal Society of Chemistry 2025
Click here to see how this site uses Cookies. View our privacy policy here.