Open Access Article
Jakub Poziemskia,
Artur Yurkevychb and
Pawel Siedlecki
*a
aInstitute of Biochemistry and Biophysics, Polish Academy of Sciences, Warsaw, Poland. E-mail: pawel@ibb.waw.pl
bInstitute of Chemistry, University of Silesia in Katowice, Katowice, Poland
First published on 19th March 2026
The advancements in computational methods in drug discovery, particularly through the use of machine learning (ML) and deep learning (DL), have significantly enhanced the precision of binding affinity predictions. However, accurate prediction of binding affinity remains a challenge due to the complex, non-linear character of molecular interactions. Generalizability continues to limit the current models, with performance discrepancies noted between training datasets and external test conditions. This study explores the integration of molecular dynamics (MD) simulations with ML to assess their predictive performance and limitations. In particular, MD simulations offer a dynamic perspective by depicting the temporal interactions within protein–ligand complexes, potentially providing additional information for affinity and specificity estimates. By generating and analyzing over 800 unique protein–ligand MD simulations, we evaluate the utility of MD-derived descriptors based on time series in enhancing predictive accuracy. The findings suggest specific and generalizable features derived from MD data and propose approaches to augment the current in silico affinity prediction methods.
Molecular dynamics (MD) simulations introduce a vital temporal dimension to protein–ligand complex studies. Such simulations allow for more detailed observations of how drug molecules interact with biological targets over time.11–13 Over the last few years, integration of MD simulations with ML and DL has been applied with varying success in different drug discovery tasks and specific campaigns. In the case of affinity prediction, Ash and Fourches in 2017 (ref. 14) analyzed 87 ERK2- docked ligand complexes by computing chemical descriptors derived from 20 ns molecular dynamics (MD) trajectories. They showed that models trained on MD derived descriptors were able to distinguish the most active ERK2 inhibitors from the moderate/weak actives and inactives. They claimed that the descriptors extracted from MD trajectories are highly informative and, having little correlation with classical 2D/3D descriptors, could augment chemical library screening tasks, candidate design and lead prioritization. A similar conclusion was presented by the authors of ref. 15, who performed molecular docking of 43 compounds associated with Caspase-8, with consecutive 10 ns MD simulations of the top scoring complex for each ligand. They investigated 770 2D and 115 3D descriptors together with 4 descriptors extracted from MD simulations: solvent accessible surface area (SASA), radius of gyration (Rg), potential energy and total energy, in the form of mean and standard deviation (8 descriptors in total). They reported that ML models trained on MD data had the most balanced accuracies and AUC values, compared to the 2D and 3D descriptor models, and that models using a combination of 3D and MD descriptors had the best performance.
A counter experiment was performed 16 using the BCR-ABL tyrosine-kinase and 15 ns MD simulations of Imatinib and a large series of its derivatives. In conclusion the authors stated that incorporating MD based matrices could not improve the binding affinity prediction ability of either deep NN or random forest (RF) QSAR models. However, their models did show reduced prediction error, indicating that the negative effect of simulation noise becomes stronger as the number of snapshots increased. Another approach to compare different ML models trained on descriptors obtained from MD trajectories was presented in ref. 10. Using three different targets and a maximum of 433 complexes predicted by docking per target, the results for MD augmented approaches were greatly dependent on the target. The paper concludes that the use of MD does not generally improve screening results and may only be justified in certain cases. Given that the models were trained with descriptors generated from every frame, this may have been a challenge for simple ML models due to the large number of frames and small number of examples. In addition, the low MM/PBSA and Glide scores suggest that the analyzed collections were rather difficult. In conclusion, current research indicates the complexity of leveraging molecular dynamics (MD) data, suggesting that it is target specific and can depend on the noise to signal ratio, e.g. number of frames, the length of MD simulation, etc. It is difficult, however, to draw definite conclusions as only a handful of targets have been tested so far.
In this study, we have generated the largest set of MD simulations to date, encompassing a broad array of protein–ligand complexes. We developed a comprehensive set of descriptors that utilize various aspects of MD-derived data and implemented a rigorous feature selection mechanism to tailor the number of features to the analytical methods employed. By training ML models on MD-derived time series descriptors from over 800 unique protein–ligand complexes, we seek answers to the following questions: (1) what are the complex specific and simulation specific features influencing the affinity prediction outcomes of the models? (2) Whether time series representations of MD are beneficial and how do they generalize? (3) Can the MD-derived descriptors augment and/or replace current crystallographic derived descriptors? Based on our findings, we present general guidelines and assessments on how such an approach influences the in silico affinity prediction on a large scale.
Ki, Kd and IC50) were considered. More details on the filtering procedure, functional composition and similarity assessments of MDD targets and ligands are available in the SI.
The crystallographic (static) representation is composed of 63 descriptors: 24 pocket descriptors calculated with RDKIT and MDAnalysis19,20, 11 interaction descriptors calculated with ProLIF21 and 28 ligand descriptors calculated with RDKIT and SciPy19,22 (for details see the “List of descriptors” section in the SI).
In the case of MD simulation data, for each complex we calculate 51 descriptors: 24 pocket descriptors, 11 interaction descriptors, 9 ligand descriptors (19 ligand property descriptors are omitted as they do not change during simulation), and additional 7 motion descriptors not present in static representations. These 51 descriptors are calculated for each frame of the MD simulation. For each descriptor sequence, we extract its unique time series value with the use of a multistep procedure:
(1) For each descriptor, we calculate all 788 time series descriptors (ts_descriptors) supported by tsfresh (see Fig. 2).
(2) For each ts_descriptor, its p-value is determined in terms of statistical significance against experimentally determined affinity (sourced from PDBBind), using a univariate test with FDR set to 0.001.
(3) Next, a trimming procedure is used. For each of the 75 feature types,23 the ts_descriptor with the lowest p-value is selected. Some ts_descriptors are parametric in nature; for such cases, we use different thresholds to generate several versions of that ts_descriptor and select the one with the lowest p-value. At this stage, there could be a maximum of 75 ts_descriptors per single descriptor.
(4) To avoid caveats in training, trimming of correlations was applied. All descriptors and ts_descriptors were tested for correlations with each other. Correlated descriptors were dropped if PCC was ≥0.8. Among a group of correlated ts_descriptors, the one with the highest cardinality was retained.
(5) In the final filtering step, for each descriptor, the ts_descriptor with the highest mutual information score between itself and the experimentally determined affinity value was selected. As a result, each descriptor is described with at most one ts_descriptor (Fig. 2, bottom left).
In total, each crystallographic complex is described by 63 descriptors. For the MD data representation the same complex is described by a maximum of 51 ts_descriptors and 19 ligand property (static) descriptors, for which the time series cannot be generated. Both representations are augmented with the ECFP4 fingerprint (1024 bits) to directly incorporate molecular connectivity information of ligands.
:
1 (80% training collection and 20% test collection). In the random split, complexes were randomly allocated to the training and test sets. In the target split, Uniprot IDs split the complexes, therefore, there are no identical training and testing examples. The target split can therefore be used to approximate the generalization potential.
Ligands were split using DeepChem.24 Scaffold Splitter divides molecules into groups based on their Bemis-Murcko scaffolds; the smallest scaffold groups form the test set. Although not without its caveats, this type of ligand split is more challenging than random splits, as it tests more thoroughly the generalizability to new or less abundant areas of chemical space.
Three different ML models were trained using the MDD: Random Forest, SVM25 and XGBoost.26 Model parameters were selected using 5-fold cross-validation (CV). The models were fitted to the training sets and evaluated on the test sets. Throughout this work, boxplots represent results obtained on the test set, with mean (triangle) and median (horizontal line inside the boxplot) values. More details on the Random Forest models and on the descriptor model (XGB) are presented in the SI.
| Model name | Description | PCC | RMSE | Training size | Code and feature reference |
|---|---|---|---|---|---|
| OnionNet2 | CNN trained on contact descriptors | 0.75 | 1.26 | 13 546 |
6 |
| PLEC-NN | Extended connectivity FP & neural network | 0.74 | 1.54 | 11 203 |
28 |
| Descriptor model (XGB) | XGBoost trained on 63 descriptors + ECFP4 | 0.72 | 1.29 | 12 349 |
This work |
| NN-score | Feed-forward neural network | 0.67 | 1.40 | 4647 | 29 |
| RF-score v3 | Random forest with spatial distance count | 0.65 | 1.45 | 4647 | 4 |
| Vina | Hybrid empirical scoring function | 0.49 | — | — | 30 |
| Model name | Description | PCC | RMSE | Training size | Original reference |
|---|---|---|---|---|---|
| CASF|CoreSet | CASF|CoreSet | ||||
| OnionNet2 | CNN trained on contact descriptors | 0.86|0.82 | 1.16|1.36 | — | 6 |
| TopBP | Topological descriptors with GBT | 0.86|0.81 | 1.19|1.95 | 3767 | 31 |
| SS-GNN | Graph neural network | 0.85|0.82 | 1.18|1.35 | 15 394 |
32 |
| Descriptor model (XGB) | XGBoost | 0.85|0.81 | 1.20|1.37 | 12 866 |
This work |
| EBA-AY | Ensemble attention based | 0.86|0.79 | 1.20|1.44 | 10 324 |
33 |
| OPRC-GBT | Ollivier persistent Ricci curvature | 0.84|0.79 | 1.25|2.01 | 3772 | 34 |
| DCML | Dowker complex based machine learning | 0.84|0.78 | 1.25|1.43 | 3772 | 35 |
| CAPLA | Sequence based cross-attention with 1D CNN | 0.84| 0.77 | 1.2|1.36 | 11 906 |
36 |
| PLANET | Graph neural network | 0.82|N/A | 1.24|N/A | 15 616 |
37 |
| KDEEP | Convolutional neural network | 0.82|N/A | 1.27|N/A | 3767 | 38 |
| PLEC-NN | Extended connectivity FP & neural network | 0.82|0.77 | 1.25|1.43 | 12 906 |
28 |
| OnionNet | Convolutional neural network | 0.82|0.78 | 1.27|1.50 | 11 906 |
39 |
| RF-score v3 | Random forest | 0.80|0.74 | 1.39|1.51 | 3767 | 4 |
| Pafnucy | Convolutional neural network | 0.78|0.70 | 1.42|1.62 | 11 906 |
5 |
| Vina | Hybrid empirical scoring function | 0.60|0.56 | 1.75|1.86 | — | 40 |
Both Pearson Correlation Coefficient (PCC) and root mean square error (RMSE) values (Table 1) render the MDD a more difficult dataset compared to CASF2016 (Table 2). All tested models show a rather consistent drop in performance, independent of their complexity. The observed decrease in performance may be multifaceted: CASF2016 is a relatively small dataset compared to MDD (285 vs. 862 complexes), therefore, it may be easier to optimize or overtrain the models. Also, excluding MDD complexes from the training data may have influenced the availability of information necessary for higher affinity prediction performance.
Our descriptor model (see the Representation section in the Materials and methods section) shows the same consistent performance drop as seen with other models. Interestingly, the affinity prediction performance of our models is on par with some of the best, highly sophisticated methods. This result highlights that a carefully selected set of descriptors and a relatively simple ML model can show a level of performance comparable to specialized neural networks.
We assessed the difference in affinity correlations between individual time series descriptors (ts_descriptors) derived from 20 ns compared to 200 ns MD simulations. We filtered all cross-correlated ts_descriptors from the two simulation lengths, and compared the shared 36 ts_descriptors. The results show a significant gain in around 40% of the tested ts_descriptors (14 out of 36, with Δ > 0.1) in favor of the 20 ns simulation length. Comparable correlations are registered for 22 ts_descriptors (Δ < 0.1). Interestingly, there was no descriptor that had more than 0.1 correlation difference in favor of the 200 ns simulation.
In the assessed timescale, the obtained results indicate that short MD simulations may capture useful steric conformational changes and that longer MD simulations may contain more random movements or noise, which may lower the individual correlation of some of the ts_descriptors. Similar results have been presented in studies concerning single targets.10,14,15 Taken together, our results show that time series descriptors from longer MD do not provide an advantage over short MD runs, possibly due to higher noise content and increased noise fitting during model training (Fig. 4).
We also tested whether the frequency of saving the trajectory frames had an influence on model performance. We observed no significant improvement in results when training on an extensive number of frames (see SI Fig. S3). From a practical point, in similar setups we recommend using a less frequent recording (larger time interval), which can significantly reduce storage requirements without decreasing prediction quality.
Next, we compared each crystallographic descriptor with its time series counterpart to assess their correlations with affinity. Fig. 6 shows the results obtained for the 6 different groups of descriptors used. In the group of interaction descriptors, all time series descriptors have increased correlations compared to their crystallographic counterparts (with the exception of the anionic term). For both types of descriptors, the hydrophobic and vdW interactions show the highest affinity correlation, and importantly the time series correlation at least doubles compared to crystallographic descriptors. In the case of a single descriptor, a correlation around 0.5 PCC would be considered fairly high. The correlations for the rest of the time series interaction descriptors are higher than those of crystallographic descriptors; nevertheless, overall their correlation values are low for both types of data.
With respect to pocket descriptors, both property and geometric, the results show a substantial number of them with PCC close or above 0.5. Overall, pocket property descriptors show the highest difference between static and dynamic treatment. Out of the 8 best correlating time series descriptors only one static descriptor (pocket_hydrophobic) has a comparable correlation. The results show that simple pocket composition features, such as atom or bond types count, which change with respect to ligand movement, correlate with high PCC values with small molecule affinity.
In the case of geometric descriptors, correlation values are comparable between time series and static treatment. Similar results are obtained for ligand geometric descriptors. Here, molecular eccentricity and molecular asphericity descriptor correlations gain the most from a time series representation; however their PCC values are rather low.
Surprisingly, motion descriptors, which we thought would substantially contribute to the affinity prediction performance, show only minor gains compared to static, crystallographic descriptors. Only two of them (pocket_contact_new and pocket_contacts_old) have PCC around 0.5 or above. The results suggest that pocket internal contacts (both preserved and new) are more correlated with affinity than e.g. pocket–ligand interactions and changes in RMSD of both molecular entities.
Taken together, although individual performance of the ts_descriptors is favorable, it does not seem to add up to the final model performance (Fig. 5). This effect is probably due to non-linear cross-correlations present between the ts_descriptors. This effect is further enhanced by the noise in the data. Common correlation methods such as Pearson's, Spearman's or Kendall are able to find linear and monotonic relationships between variables, assuming little noise. Other methods measuring potential non-linear correlations test only the independence of variables without quantifying the strength of the relationship, and are also very sensitive to noise and outliers. Therefore, such nonlinearity is currently difficult to filter beforehand.41
| Static model | MD-derived model | |
|---|---|---|
| Screening task (GCR target) | ||
| EF10% | 1.92 | 2.82 |
| ROC AUC | 0.615 | 0.675 |
![]() |
||
| Pose selection (7 target structures) | ||
| Top-1 | 2/7 | 4/7 |
| Top-3 | 3/7 | 5/7 |
| Mean std | 0.22 | 0.28 |
The MD-derived model showed better performance in early enrichment (EF10%) compared to the static model (2.82 vs. 1.92). This result was accompanied by a higher ROC AUC score (0.675 vs. 0.615). Overall the MD-derived model improved the ability to discriminate between active and inactive compounds, indicating a potential benefit of using the dynamic information in the training process.
To evaluate the ability of the two models to select the correct ligand binding pose, we conducted an experiment using a subset of targets derived from the CASF-2016 dataset. We selected 3 targets: FAX (1lpg, 1z6e, and 2xbv), CDK2 (1pxn and 4eor) and CAH (2weg and 3dd0). For each of the 7 target–ligand complexes, a docking run was performed following the same procedure as described for the screening experiment (i.e. Vina, default parameters, exhaustiveness = 32). To minimize bias, ligand structures used for docking were generated from SMILES strings using Open Babel’s lowest energy parameter. Each docking run resulted in nine putative conformations for every protein target. The conformation derived directly from the crystallographic structure was included, resulting in 10 poses per target. Finally, each of the obtained poses were subjected to a 20 ns MD simulation using the same parametrization procedure, to obtain MD-derived descriptors (see the Materials and methods section). The results of the ability of the two types of models to correctly identify the native crystallographic structure (i.e. assign the highest affinity value) among the docking results are shown in Table 3.
The model based on MD trajectories performs better than the model based on crystallographic structures only. However, it is also noticeable that the standard deviation is higher for the MD model (0.28 vs. 0.22), suggesting a greater sensitivity of this approach to the choice of initial conformation. This may reflect the complex nature of the molecular dynamics process and indicates that accounting for conformational variation affects the stability of the prediction. Ultimately, these results suggest that models based on MD simulation data may be an effective alternative to approaches based only on static crystallographic structures.
| Descriptors | Target split | Random split | ||||||
|---|---|---|---|---|---|---|---|---|
| PCC | r2 | PCC | r2 | |||||
| MD | Crystal | MD | Crystal | MD | Crystal | MD | Crystal | |
| All descriptors | 0.60 (0.07) | 0.58 (0.10) | 0.32 (0.10) | 0.29 (0.14) | 0.73 (0.04) | 0.73 (0.05) | 0.53 (0.06) | 0.51 (0.06) |
| Pocket property | 0.54 (0.07) | 0.52 (0.07) | 0.26 (0.10) | 0.23 (0.10) | 0.68 (0.04) | 0.71 (0.04) | 0.45 (0.05) | 0.50 (0.060) |
| Motion | 0.53 (0.07) | — | 0.24 (0.10) | — | 0.58 (0.05) | — | 0.33 (0.06) | — |
| Pocket geometric | 0.53 (0.07) | 0.53 (0.08) | 0.23 (0.09) | 0.22 (0.12) | 0.58 (0.04) | 0.58 (0.05) | 0.33 (0.05) | 0.33 (0.07) |
| Ligand property + ECFP4 | 0.52 (0.09) | 0.21 (0.13) | 0.69 (0.04) | 0.46 (0.04) | ||||
| Interaction | 0.45 (0.08) | 0.26 (0.14) | 0.15 (0.11) | −0.01 (0.10) | 0.51 (0.05) | 0.46 (0.06) | 0.25 (0.05) | 0.20 (0.05) |
| Ligand geometric | 0.44 (0.11) | 0.42 (0.11) | 0.12 (0.17) | 0.10 (0.15) | 0.51 (0.06) | 0.49 (0.06) | 0.25 (0.06) | 0.23 (0.06) |
Interestingly, for random splits, models trained only on static pocket property descriptors perform better than models trained on their time series counterparts. The situation changes with target splits; the static models perform significantly worse compared to the MD-derived models, to which they were previously superior. One explanation of this result is the interdependence of pocket and ligand descriptors. Since pocket descriptors are calculated with respect to the ligand position, ligand information is implicitly contained, even more so with a dynamic representation of the pocket.
The highest difference between crystallographic and MD derived models is obtained when training with interaction descriptors (0.450 vs. 0.257). This result confirms our initial hypothesis that the interaction and motion descriptors when represented as time series provide novel and useful information in the context of affinity prediction. However, these ts_descriptors, when combined with other types, fail to make a substantial difference in performance. This might indicate the need for a more thorough representation of motion and interactions present in ligand–receptor complexes, compared to the setup tested in this work.
An interesting result was achieved by the models trained only with ECFP4 and ligand properties. These models, having no information about the target, show elevated performance in affinity prediction, suggesting that they mostly learn biases and random relationships in the data rather than predict affinity as a function of both the target and ligand complex. Similar conclusions in the context of protein-ligand affinity predictions have been noted in other studies as well.44–46
SHAP analysis presents the 20 most important features, together with their utilization in the form of counts, over all 20 models of each type; static- and time series based (Table S7). Each model is using a slightly different set of descriptors (or ts_descriptors) depending on data splits it was trained on. SHAP analysis of a single static- and time series based model is presented in Fig. S4 in the SI as a reference.
One striking difference between the static and time series models is the heavy reliance on ligand descriptors by the former. Out of the 20 most influential descriptors, nearly half of them (9/20, 45%) are ligand based. The same comparison with time series models shows that they rely only on 25% (5/20) of ligand based ts_descriptors. Even more, with static models, 5/9 of the most influential ligand descriptors are simple 2D physicochemical features. With time series models it is just 1/5. These results may explain the poorer performance of static models observed with the scaffold splits.
SHAP analysis shows that the MD-derived models rely mostly on a different set of features compared to models trained on static representations. There are three motion descriptors important for the time series models. The pocket contacts (both old and new) and ligand RMSD are especially interesting, as they are exclusive to the MD-derived representation.
Taken together, the SHAP results show that the time series models use different sets of features, rely on simulation exclusive motion descriptors, and use less simple ligand features, rendering them possibly less prone to small molecule bias present in datasets.
The novelty of our approach lies in a feature-centric representation of MD data, where selected protein–ligand interaction properties are tracked over time and summarized using time-series descriptors (ts_descriptors). Unlike snapshot-based augmentation strategies commonly employed in prior studies, this framework captures ensemble-level interaction tendencies through temporal statistics, enabling efficient learning from MD trajectories using tabular representations.
One of the central findings of this study is that longer MD simulations do not necessarily improve predictive performance. Across more than 800 simulations, models trained on descriptors derived from 20 ns trajectories consistently matched or outperformed those based on 200 ns simulations. This effect is likely attributable to increased variational noise introduced at longer timescales, which is difficult for machine learning models to filter. We further investigated whether the optimal MD simulation length depends on macroscopic protein–ligand properties, such as protein or ligand size or intrinsic flexibility. Despite extensive analysis across the MDD, we did not observe any consistent relationships between the best-performing simulation length and these features (see SI Fig. S6). Nevertheless, we note that more subtle dependencies may emerge when considering specific protein families, enzymatic classes, or systems involving pronounced allosteric motions, which remain important directions for future investigation.
Another important finding of this study is that even a rather generic MD protocol and a relatively small number of complexes can be used with success to achieve predictive accuracy on par with or better than highly complex models based on neural networks, with a much larger number of parameters. It is therefore expected that increasing the number of short MD simulations will further improve prediction performance. At scale, this conclusion brings hope to the inclusion of short MD simulations into protocols concerning diverse chemical library screening and hit prioritization and is also consistent with some previous studies done on single targets.10,14,15
The choice of traditional machine learning methods over deep learning was motivated by the tabular, physically interpretable nature of the descriptor set, as well as the need for transparency and robustness in the presence of correlated interaction features. While deep learning models have demonstrated strong performance in affinity prediction, the performance gap remains limited for engineered descriptors, and classical approaches provide a strong, interpretable baseline. However, the MDD provides MD trajectories that can be used to construct more sophisticated representations, including graph-based or sequence-aware encodings that are better suited for deep learning architectures for further comparison and benchmarking.
Comparisons between models trained on static crystallographic descriptors and those derived from MD trajectories reveal that the two approaches consider different descriptors to be most relevant. Interestingly, we note a number of time series derived descriptors with significantly better correlations compared to their static counterparts (Fig. 6). Their summarized influence, however, did transfer only slightly to improved affinity prediction performance. Given that extensive cross-correlation filtering was performed, this would suggest that non-linear correlations may decrease the overall performance. Ablation analysis and SHAP studies (see SI Fig. S3) further confirm these findings, showing that the two models employ different types of descriptors. In the case of more challenging target splits, an advantage of MD-derived models can also be observed, highlighting potential generalization advantages of the time series descriptors.
In both the screening and pose selection tasks, MD-based models showed better performance than models based solely on static crystallographic structures. However, the higher standard deviation in the results of the MD models suggests a greater sensitivity to the selection of the initial conformation, which may affect the reproducibility and stability of the prediction. This phenomenon may be due to the greater complexity of dynamic representations and should be further investigated, taking into account different classes of molecular targets and simulation conditions. Importantly, while the descriptors used in this study are derived from molecular dynamics trajectories, they represent statistical summaries of interaction properties sampled over time, rather than an explicit encoding of discrete dynamic events. The time series descriptors (ts_descriptors) capture ensemble-level tendencies, such as the persistence and variability of classical protein–ligand features, such as van der Waals contacts, hydrophobic interactions, hydrogen bonding, etc., aggregated across the simulation window. Our analysis shows that the model's performance emerges from multivariate combinations of interaction features rather than individual interaction terms.
Consequently, the presented representation should be viewed as an approximation of interaction dynamics, designed with physical interpretability, robustness, and scalability for machine-learning applications. Capturing true dynamic processes, such as interaction lifetimes, state transitions, or allosteric shifts, would require alternative time-resolved or event-based representations, which remain an important direction for future work. Within this context, the MDD and the descriptor framework introduced here provide a reproducible and extensible baseline upon which more mechanistically explicit modeling approaches can be developed and benchmarked.
In summary, we demonstrate that short MD simulations combined with time-series descriptor representations and classical machine learning models can achieve predictive performance on par with, and in some cases exceeding, models based on static structures, while offering improved generalization and scalability. Although the developed MDD remains limited in size relative to static structures, our findings already show the strong potential of MD aware models and lay the foundation for further development of larger and more diverse protein–ligand MD collections.
Supplementary information (SI) is available. See DOI: https://doi.org/10.1039/d5dd00452g.
| This journal is © The Royal Society of Chemistry 2026 |