Open Access Article
Lucas Andersen†
a,
Max Rausch-Dupont†
b,
Alejandro Martínez León
a,
Andrea Volkamer
c,
Jochen S. Hub
*a and
Dietrich Klakow
*b
aTheoretical Physics and Center for Biophysics, Saarland University, PharmaScienceHub (PSH), 66123 Saarbrücken, Germany. E-mail: jochen.hub@uni-saarland.de
bSpoken Language Systems, Center for Bioinformatics, Saarland Informatics Campus, Saarland University, PharmaScienceHub (PSH), 66123 Saarbrücken, Germany. E-mail: dietrich.klakow@lsv.uni-saarland.de
cData Driven Drug Design, Center for Bioinformatics, Saarland Informatics Campus, Saarland University, PharmaScienceHub (PSH), 66123 Saarbrücken, Germany
First published on 13th June 2026
Predicting protein–ligand binding affinity with high accuracy is critical in structure-based drug discovery. While docking methods offer computational efficiency, they often lack the precision required for reliable affinity ranking. In contrast, molecular dynamics (MD)-based approaches such as MMGBSA provide more accurate binding free energy estimates but are computationally intensive, limiting their scalability. To address this trade-off, we introduce an active learning framework that automates molecule selection for docking and MD simulations, replacing manual expert-driven decisions with a data-efficient, model-guided strategy. Our approach integrates fixed, partly pre-trained deep learning molecular embeddings (MolFormer, ChemBERTa-2, and Morgan fingerprints) with adaptive regression models (e.g. Bayesian Ridge and Random Forest) to iteratively improve binding affinity predictions. We evaluate this approach retrospectively on a new dataset of 59
356 chemically diverse compounds from ZINC-22 targeting the MCL1 protein using both AutoDock Vina and MMGBSA binding free energy scores. Validation against a subset of experimentally measured binding affinities demonstrates that MMGBSA scores exhibit a stronger ranking correlation than the docking scores. Our results show that incorporating MMGBSA scores into the active learning loop enables highly efficient compound selection, recovering 79.9% of the top 1% MMGBSA-ranked binders while screening only a fraction of the dataset. In contrast, docking-guided selection identifies a largely distinct set of compounds, recovering only 6.7% of these top MMGBSA-ranked binders, underscoring the critical impact of scoring function choice. Furthermore, we demonstrate that a one-at-a-time acquisition active learning strategy consistently outperforms traditional batched acquisition, with the latter achieving just 78.4% recovery with MolFormer and Bayesian Ridge. These findings underscore the potential of integrating deep learning-based molecular representations with MD-level accuracy in an active learning framework, offering a scalable and efficient path to accelerate virtual screening and improve hit identification in drug discovery.
To facilitate the identification of promising compounds, several extensive screening libraries are available. The ZINC22 (ref. 3) database contains over 50 billion chemical compounds that can be ordered from commercial suppliers. Another example is Enamine's Real database4 featuring building blocks that can be combined into 70 billion compounds, which can be synthesized on demand with a success rate of 80% within two weeks.
Different computational methods have been developed to search the vast chemical space, which vary in their computational cost and in their accuracy in correctly ranking ligands by binding affinity. Efficient molecular docking algorithms have been used with the aim to predict the optimal binding conformation of ligands within a protein binding pocket by minimizing an affinity scoring function5,6 and as an enrichment tool to prioritize ligands that are more likely to bind to a target. However, docking protocols typically neglect the flexibility of the protein and rely on highly simplified models for protein–ligand interactions and solvent contributions, which may lead to poor correlations between docking scores and experimental binding affinities.7 In contrast, methods based on molecular dynamics (MD) simulations can account for protein flexibility and explicit solvent effects, but they come at a significantly higher computational cost compared to docking. Among MD-based methods, free energy perturbation (FEP) may be considered as a gold standard for affinity predictions as it relies on a rigorous statistical framework along an alchemical binding pathway and treats the solvent explicitly throughout the simulations and affinity calculations.8 Simplified MD-based approaches are given by end-point free energy techniques such as the Molecular Mechanics Generalized-Born Surface Area (MMGBSA) method. MMGBSA involves explicit-solvent simulations of both the bound and the unbound state; however, it approximates the solvent and entropy contributions to the binding affinity with implicit models.9–11 In this study, we score ligands using MMGBSA, an MD-based binding affinity estimation technique,12 and AutoDock Vina,5 a tool built upon an empirical scoring function. Active learning approaches, such as Bayesian optimization,13 can build upon binding affinities from MD-based methods such as MMGBSA, enabling binding affinity predictions with MD-level accuracy while maintaining scalability for large-scale VS applications. In active learning, binding affinity predictions are performed iteratively, thereby avoiding the calculation of binding affinities for all available molecules and instead allowing the algorithm to focus on promising regions of the chemical space. During each iteration, a surrogate model selects a subset of molecules for simulation. The molecules are selected such that the likelihood of finding the molecule with the lowest binding free energy in the entire database is maximized. After each iteration, the surrogate model is updated with the newly computed binding affinities obtained from the simulation, thereby refining its predictions for subsequent rounds.
While the optimization of lead candidates using generative models has gained significant attention in recent years,14–17 active learning13 has proven highly effective for screening large datasets in drug design campaigns.18–24 Graff et al.22 analyzed several surrogate models in a pool-based active learning setting and found that the best performance is achieved when using multi-layer perceptrons (MLPs) and message-passing neural networks25 (MPNNs) as surrogate models. However, training such surrogate models is time-consuming. Cao et al.26 proposed a method in which the entire model, including the pretrained components, is updated during each training iteration. While continuing to train a pretrained model can enhance performance, it is computationally expensive and requires careful handling to prevent the new updates from overwriting previously learned knowledge.27
In this work, we demonstrate how combining Bayesian active learning with binding affinities from MD-based MMGBSA calculations can both accelerate the drug discovery process and bring MD-level accuracy to virtual screening applications. To this end, we compiled a large dataset of binding affinities for 59
356 chemically diverse compounds from the ZINC-22 database targeting the binding pocket of the myeloid cell leukemia 1 (MCL1) protein. MCL1 is a promising target for anti-cancer therapies that has been used previously to benchmark binding affinity calculations.28,29
We show that by using embeddings from a pretrained chemical language model without further fine-tuning, the surrogate model can be a simple model such as ordinary linear regression while maintaining the high level of performance. Additionally, retraining the model speeds up the active learning process during the early stages of learning. Specifically, we find that using a pretrained embedding model does not require retraining the full model; instead, updating a simple linear regression model after each iteration yields the same retrieval of top binders and also offers the possibility of one-at-a-time molecule acquisition. Using our active learning pipeline, we obtain 79.9% of the top-1% binders according to MMGBSA after querying 6% of the approximately 60
000 compounds. In addition, we find that learning docking scores is easier than learning MMGBSA scores, as shown by retrieving 97.1% of the top-1% binders according to docking scores after querying 6% of the compounds. However, since MMGBSA scores exhibit a much stronger correlation with experimental binding affinities compared to docking scores, our results suggest that the combination of active learning with MMGBSA scoring provides a balance between computational efficiency and predictive accuracy for identifying promising ligand candidates.
| ΔG = ΔGPLsolvation − ΔGPsolvation − ΔGLsolvation + ΔGgas |
Here, ΔGPLsolvation, ΔGPsolvation, and ΔGLsolvation denote the solvation free energies of the protein–ligand complex, the protein, and the ligand, respectively, computed with the Generalized Born (GB) method or the Poisson–Boltzmann (PB) method,30 together with the solvent-accessible surface area (SA). These methods operate within the framework of implicit solvent models, simplifying the representation of solvation effects.
ΔGgas denotes the binding affinity in the gas phase, which is computed as the difference in expected values of potential energy, derived from a molecular mechanics force field, between the products (protein–ligand complex) and the reactants (protein and ligand). This term may optionally include an entropic correction contribution.12,31 While the MD simulations are typically conducted in explicit solvent (Fig. 1a) to ensure realistic conformational sampling, the solvation free energy terms (ΔGXsolvation) are estimated using implicit solvation models (GB or PB combined with SA). Although absolute binding free energy values from MMGBSA often deviate significantly from experimental binding affinities, the method has been shown to provide reasonably accurate rankings of ligands binding to the same target protein.28,32–34
, BO mitigates the cost of labeling the entire domain by constructing a probabilistic surrogate model
of the objective function f, which subsequently evaluates the utility of labeling each unlabeled data point. Since the objective is to identify the optimum of f, the utility is typically defined as a measure of improvement relative to the currently labeled data points. Gaussian Processes (GPs)35 have traditionally been a popular choice for surrogate modeling, while Bayesian Neural Networks (BNNs)36,37 have emerged as a more recent alternative.
Surrogate models are typically Bayesian for two main reasons. First, in theory, Bayes' theorem allows efficient model updates when new labels are acquired, eliminating the need for costly full-model retraining. Second, Bayesian models provide access to the full posterior distribution, enabling more robust selection of the next evaluation point by considering, for example, the expected prediction rather than only the most likely prediction.
The utility of a new point can be assessed by acquisition functions such as expected improvement EI(x).38 This criterion assigns zero utility to points that are not expected to improve upon the best observed value, f★, while assigning the expected improvement to all other points. If the objective is to find a minimizer of f, EI is given by
![]() | (1) |
denotes the Iverson bracket, which evaluates to 1 if its argument is true and to 0 otherwise. The expectation is computed over all surrogate models
that have a non-zero posterior probability.
Assuming a Gaussian posterior distribution, the expected improvement permits the following closed-form expression:
| EI(x) = σ(x)[δ·Φ(δ) + ϕ(δ)] | (2) |
| δ = f★ − µ(x)/σ(x) | (3) |
The main difficulty in adapting such models to BO is that they are not Bayesian by default. Simple approaches, such as Monte Carlo dropout,54 cannot be easily applied to pretrained models not originally trained with dropout, since these models lack the necessary dropout layers to enable uncertainty estimation. To overcome this issue, we perform Bayesian linear regression, with the pretrained embeddings as input. Bayesian linear regression allows direct access to the posterior distribution, including uncertainty estimates. This approach has been shown to be a viable alternative to fully probabilistic methods.37,55
000 molecules is highly diverse, with many compounds differing substantially in their scaffold and functional groups.
We measure similarity using the maximum common substructure (MCS) metric.58 We leverage RDKit59 version 2024.3.1 to calculate the MCS and include all molecules above a similarity threshold of 0.4. Despite screening for similarity, most compounds show a low similarity to the query, and the average pairwise similarity in the dataset is just 0.22. A histogram of the similarity distribution can be found in the SI (Fig. S1). Moreover, the structural diversity of the dataset is further supported by the observation that many compounds achieve a significantly worse MMGBSA score than the experimentally validated binders (Fig. 2c).
Afterwards, to obtain an initial pose for the MMGBSA computation, we dock all molecules using AutoDock Vina 1.2.5.5 We assess the validity of the docking pipeline by redocking the available ligands from Friberg et al.57 and comparing the root mean-square deviation (RMSD) to the reference crystal structure (Fig. S2). For each ligand, we carry out MD simulations to obtain the binding free energy using MMGBSA (see Methods). The final size of our dataset comprises 59 356 compounds.
Fig. 2a/b show the correlation of the experimental binding affinities with MMGBSA and AutoDock Vina for the subset of experimentally known binders. The Vina score exhibits poor Pearson correlation with the experimental binding affinities (0.177 with a 95% confidence interval between −0.21 to 0.52 derived via bootstrapping) and poorly ranks the ligands as indicated by a Spearman correlation coefficient of 0.174 (95% confidence interval −0.21 to 0.52 according to bootstrapping) (Fig. 2a). As expected, MMGBSA overestimates the absolute binding affinity; nevertheless, it demonstrates substantially improved agreement with the experimental data, with Pearson and Spearman correlations of 0.610 and 0.607, respectively (Fig. 2b) (bootstrapped 95% confidence interval for Pearson correlation: 0.30 to 0.80; and for Spearman correlation: 0.26 to 0.82). Using MMPBSA instead of MMGBSA further improves the correlation with the experimental data, with Pearson and Spearman correlations of 0.689 and 0.684, respectively (Fig. S3) (bootstrapped 95% confidence interval for Pearson correlation: 0.41 to 0.85; and for Spearman correlation: 0.33 to 0.87). These findings support our assumption that the computationally more demanding MMGBSA method provides better ligand ranking than AutoDock Vina. Notably, among the 59
356 compounds of our augmented dataset, docking and MMGBSA scores correlate poorly (Fig. 2c), demonstrating that the MMGBSA and Vina scores are sensitive to different molecular properties. The orange and pink areas of Fig. 2c highlight the top 1% of molecules based on the MMGBSA score and the Vina score, respectively, demonstrating that the sets of top 1% binders according to MMGBSA and Vina are nearly non-overlapping. Only 1.6% of the top 1% binders according to MMGBSA also belong to the top 1% binders according to the Vina score. Overall, MMGBSA correlates more strongly with the experimental binding data subset and identifies a largely distinct set of top binders from Vina. These results suggest that Vina may not reliably distinguish the true highest-affinity compounds, making MMGBSA the preferred choice despite its higher computational cost. In addition, MMPBSA-based binding affinities were computed for the augmented dataset. Due to MMPBSA and MMGBSA scores exhibiting a comparable performance on the experimentally validated dataset, the computationally less expensive method, namely MMGBSA, is used in the following analyses. A separate analysis using MMPBSA scores is provided in the SI (Fig. S7). However, it is important to note that both methods may be used interchangeably within the active learning pipeline.
240 molecules taken from the Enamine's Discovery Diversity Collection (
), docked against thymidylate kinase using AutoDock Vina, as described by Graff et al.22 Unlike previous studies, our work additionally incorporates MMGBSA scores.
The acquisition function is a mathematical function that computes the informational value of a molecule based on the model prediction and uncertainty estimation. Step (2) of Fig. 4 uses the acquisition function to compute the informational value of each molecule and selects a predefined number of the most valuable molecules in a greedy fashion, for which the binding affinity will be computed. After a selection of promising molecules, step (3) computes the respective binding affinities, for instance using an MD-based technique such as MMGBSA, thereby expanding our dataset. Step (4) of Fig. 4 uses the expanded dataset to retrain our surrogate model for the next iteration. The process is repeated, starting at step (1).
In summary, in each iteration, we score all unlabeled molecules using the surrogate model and use the acquisition function to select the set of molecules with the highest utility value. In cases where we acquire not only one but a batch of b data points per iteration, we update the model only every b iterations. After the label is queried, the regression head of the surrogate model is retrained. The initial set of computed binding affinities comprises a single molecule taken as the molecule closest to the centroid in the embedding space. The pseudo-code of the active learning pipeline is shown in algorithm 1.
be the weights of the model; we assume
, with α−1 being the variance. As is standard in regression tasks, we assume that the target is corrupted by independent Gaussian noise, i.e. y(x) = f(x) + ε, where
, f(x) = wTx, and x is the embedding vector of the ligand. Let
be the training inputs with associated targets
. By Bayes' theorem, the posterior distribution of the weights is a Gaussian with mean ŵ = βΣXTy and inverse covariance matrix Σ−1 = αI + βXTX. Marginalizing the posterior distribution over the weights yields the predictive posterior distribution as a Gaussian with mean ŵTx and variance σ2(x) = β + xTΣx. For BO applications, the true values of α and β are typically not known. We infer them by maximizing the marginal likelihood.75 For more information, we refer to Bishop75 or Murphy.76 Notably, the combination with a pretrained model is essentially the approach taken by Snoek et al.,37 and the weights of the embedding model may be considered as hyperparameters of the basis function.
. Graff et al.22 and Cao et al.26 used docking scores against thymidylate kinase obtained with Autodock Vina (see Datasets). To compare our pipeline with previous work, we follow the experimental setting by querying a batch of b molecules, where b is approximately 1% of the dataset size, corresponding to b = 500 for the
dataset. We perform six rounds, such that only 6% of the entire dataset is screened. The results are summarized in Table 1.
dataset based on Autodock Vina docking scores against thymidylate kinase. The first column shows the retrieval rate upon querying a batch of 500 molecules per iteration for six iterations, totaling 3000 molecules. The second column presents the retrieval rate upon querying 3000 molecules individually, one after another. We report results for both the greedy and upper confidence bound (UCB) acquisition strategies from Cao et al.26 (rows 1–4) and compare them with the results obtained by our method when using expected improvement (rows 5–10)
| Batched query | Single molecule query | |
|---|---|---|
| MolFormer (Greedy)26 | 78.36 | ✗ |
| MolFormer (UCB)26 | 79.24 | ✗ |
| RF (Greedy)26 | 54.52 | ✗ |
| RF (UCB)26 | 37.08 | ✗ |
| ChemBERTa-2 + linear | 78.23 | 81.85 |
| ChemBERTa-2 + RF | 65.73 | 73.99 |
| MolFormer + linear | 74.39 | 78.02 |
| MolFormer + RF | 40.52 | 58.26 |
| Morgan fingerprints + linear | 65.12 | 73.79 |
| Morgan fingerprints + RF | 39.91 | 52.62 |
To select new molecules for evaluation, we score the molecule's utility using the Expected Improvement (EI) criterion.38 Prior work has also considered alternative acquisition functions such as the Upper Confidence Bound (UCB) and surrogate model predictions (greedy selection).22,26 In contrast to UCB, EI does not require tuning of additional hyperparameters. Following Cao et al.,26 we assess the model performance using the retrieval rate of the top-1% of molecules according to the binding affinity score. Cao et al.26 reported a top-1% retrieval rate of 79.24% on
using the MolFormer model and updating all parameters after querying each batch.49 In comparison, our approach combining MolFormer with a linear model (
) achieves a slightly lower rate of 74.39%. However, substituting MolFormer with ChemBERTa-2 embeddings (
) improves performance, yielding a retrieval rate of 78.23%. The similar performance of the batched acquisition and one-at-a-time acquisition models indicates that updating the regression model, instead of updating the entire pipeline, including the regression model and the much larger embedding model, has minimal impact on the overall retrieval rate. Thus, it is not required to further train the embedding model. Furthermore, we find that using a Random Forest model instead of the linear regression model significantly worsens performance in all cases (Table 1). A similar trend is observed with Morgan fingerprints, which exhibit comparatively lower retrieval performance than models based on pretrained embeddings. Consequently, we focus our further analysis on results obtained using the ChemBERTa-2 and MolFormer embedding models in combination with the linear regression model. Additional results involving the RF model and Morgan fingerprints are presented in the SI (Fig. S4–S6). In conclusion, these findings suggest that full model fine-tuning is not necessary; instead, updating only the regression model after each iteration is sufficient to achieve strong retrieval performance.
By querying a single molecule per iteration, we achieve a top-1% retrieval rate of 81.85% using
on the
dataset, surpassing the previous best performance of 79.24% (Table 1). The advantage of querying single molecules is more evident during the early stages of the active learning process (Fig. 5): here, the top-1% retrieval rate increases more rapidly with a one-at-a-time acquisition as compared to querying batches of 500 molecules, irrespective of the choices for the surrogate and embedding model.
dataset, we next apply our pipeline to our newly derived dataset comprising 59
356 compounds taken from the ZINC22 database binding to MCL1 (see Datasets). Accordingly, we analyze the performance of active learning on both the docking scores (
) and the MD-based MMGBSA binding affinity (
), while again comparing the batched-query active learning approach with the one-at-a-time variant. Considering the superior performance of the linear model, the following analysis focuses exclusively on this model. We provide results with the Random Forest model and additional batch sizes in the SI.
Focusing first on the
dataset, the batched model results in comparable scores between the MolFormer and ChemBERTa-2 models, both achieving a retrieval rate over 90% (Table 2). Querying a single molecule, instead of a batch, greatly improves the top-1% retrieval during the early stages of the active learning process (Fig. 6, left) and, after 3600 iterations, improves the top-1% retrieval rate in both cases, with
reaching 95.8% and
reaching 97.1%. These advantages of the one-at-a-time acquisition are in line with the findings on the docking-based
dataset presented in the last section. Furthermore, the finding that using MolFormer and ChemBERTa-2 yields similar learning rates aligns with the findings on the
dataset.
| Batched query | Single molecule query | |
|---|---|---|
| (a) MCL1-Docking | ||
| ChemBERTa-2 | 94.6 | 97.1 |
| MolFormer | 93.8 | 95.8 |
![]() |
||
| (b) MCL1-MMGBSA | ||
| ChemBERTa-2 | 59.9 | 62.7 |
| MolFormer | 78.4 | 79.9 |
Applying our pipeline to the
dataset shows successful learning of MMGBSA scores. A key finding of our study is that the pipeline achieves top-1% retrieval rates of up to 79.9% upon querying 6% of the compounds when using single molecule query (SMQ) and MolFormer.
However, active learning of the
scores also revealed striking differences from the
scores. We obtained lower top-1% retrieval rates when using MMGBSA scores (62.7% and 79.9%, SMQ) as compared to using docking scores (97.1% and 95.8%, SMQ; Table 2 and Fig. 6, compare the left and right panels). In addition, while the top-1% retrieval rate based on docking scores increases rapidly during early stages of the process, the top-1% retrieval rate based on MMGBSA increases irregularly and requires far more queries before reaching significant retrieval rates. These findings suggest that learning docking scores is easier as compared to learning MMGBSA scores. We attribute these findings to the relative simplicity of the docking scoring algorithm, which we hypothesize renders the energy landscape smoother and its patterns easier for the model to learn. In contrast, MMGBSA scores may have a more complex structure owing to the physically more detailed representation of the ligand–protein interactions, rationalizing the need for a larger number of queries to achieve good top-1% retrieval rates.
As a first analysis in this direction, we compare by how much the activity score, either Vina or MMGBSA, changes between two points in embedding space. To this end, we compute pairwise differences, as well as the Euclidean distance between the datapoints in the MolFormer embedding space. Since the values of MMGBSA and Vina scores themselves differ significantly in magnitude, we normalize the differences by their respective standard deviations. In Fig. 7 we show the average activity difference for a range of distance values. After normalization, we can observe that for essentially all distances, the MMGBSA scores show larger differences than the Vina scores. Indicating that, in the embedding space, the MMGBSA energy landscape is indeed less smooth than the Vina landscape. While our present analysis provides sufficient insight into the optimization process and learnability, a more detailed study of the effect of smoothness could offer valuable directions for future work.
Additionally, the choice of the embedding plays a different role during learning of MMGBSA scores as compared to docking scores (Fig. 6, right, blue and orange curves). Upon learning docking scores with the linear surrogate model, learning based on ChemBERTa-2 slightly outperforms learning based on MolFormer. In contrast, upon learning MMGBSA scores, using MolFormer leads to 20% better learning rates compared to using ChemBERTa-2. To test whether these findings are consistent across independent active learning campaigns, we repeated training of our surrogate model, yet starting from different initial molecules§ (Fig. 8). Evidently, retrieval rates differ considerably when using different initial molecules. However, top-1% retrieval rates after 3600 iterations using MolFormer consistently outperform the learning rates using ChemBERTa-2 or Morgan fingerprints, as shown by top-1% retrieval rates averaged over seven campaigns of 84%, 62%, and 50%, respectively. We hypothesize that the structural context embedded in MolFormer through the use of masked language modeling as a pretraining objective helps our surrogate model in learning MMGBSA scores. Thus, systematic testing of different embedding strategies for learning MMGBSA scores may be insightful in future studies.
Lastly, an important consideration is that docking and MMGBSA scores rank compounds very differently (Section 3), such that the choice of the scoring function substantially impacts active learning outcomes. While docking-guided selection may appear more effective based on retrieval metrics for docking scores themselves, when, e.g., MolFormer embeddings are used with docking scores, only 6.7% of the identified top compounds overlap with those selected using the same model with MMGBSA scores, suggesting that apparent performance gains do not necessarily translate to identifying the same compounds. Considering that MMGBSA scores correlate more strongly with experimental binding affinities than docking scores, we infer that MMGBSA-based selection may better identify candidates with stronger experimental binding affinities.
Recently, Sultan et al.50 suggested that the MTR objective achieves superior results in property prediction tasks. The binding affinity depends, in addition to properties such as ligand hydrophobicity, also on structural information such as the linker length between two chemical subgroups and packing properties of interface atoms.57,77 We hypothesize that subtle differences between the structural properties of molecules are not necessarily reflected in some chemical properties and therefore can lead to similar embeddings of these molecules. As a result, accurate prediction of binding free energies using simple linear regression is nearly impossible. The MLM objective on the other hand is not biased towards embeddings that encode domain specific knowledge (such as physicochemical properties of molecules), but rather implicitly learns the structure of molecules and possibly information relevant to the task at hand. It is important to note that additional protein targets must be tested to draw a definitive conclusion, as the observed results may also originate from MolFormer embeddings more effectively capturing ligand characteristics that may be particularly relevant to the MCL1 protein target and thereby resulting in the observed significantly higher top-1% retrieval rates.
Across all datasets, selecting a single molecule per iteration consistently yields higher retrieval rates, especially during early learning, regardless of the embedding or regression model used (Fig. 5 and 6). This is likely due to more frequent model updates, which enhance predictive performance. Unlike batched acquisition, one-at-a-time acquisition allows the model to retrain after each new data point, which is particularly advantageous when training data are still scarce.
While one-at-a-time acquisition improves early retrieval, it limits parallelization because each selection depends on the outcome of the previous simulation. In contrast, batched strategies enable concurrent evaluations, improving resource utilization on parallel architectures and overall throughput. Future work could explore dynamic batching, where new samples are incorporated as soon as computations are complete, to balance frequent updates with parallel efficiency.
This plateau suggests that, upon learning MMGBSA scores, the model may get trapped in a specific region of the chemical space before eventually identifying a new set of molecular properties that help to escape to other promising regions. Notably, this initial plateau is not observed when using the batched acquisition strategy, presumably because the sampling is too coarse.
In the SI, we provide results that empirically support that initialization with a diverse set or even a slightly larger batch size can help to mitigate this effect.
Although higher top-1% retrieval rates are achieved when using docking scores, it is important to consider how well these scores actually reflect the true binding affinity. To assess this, we have compared predicted binding affinities, based on both docking and MMGBSA scores, with experimentally measured values for a set of known binders (see Fig. 2a and b). Our analysis reveals that MMGBSA scores exhibit a much stronger correlation with the experimental data (Pearson: 0.610) compared to docking scores (Pearson: 0.177). Furthermore, Fig. 2c shows no apparent correlation between MMGBSA and docking scores across our dataset. Notably, the sets of top-1% binders identified by each scoring method differ significantly, with only a 1.6% overlap (represented by the green region in the figure). This may come as a result of MMGBSA accounting for the dynamics of both protein and ligand, in contrast AutoDock Vina relies on simplified empirical scoring functions.5 We hypothesize that this makes the relationship between ligand structure and predicted binding affinity more complex, making it more difficult to capture with machine learning models. This added complexity may help explain the observed irregular improvement of top-1% retrieval rates over the iterations.
Moreover, the embedding models used in this study work solely with the SMILES representation of molecules as input and therefore lack explicit information about their three-dimensional structures and spatial orientation within the protein's binding pocket. Such structural context may be critical for accurately capturing the interactions that influence MMGBSA scores. The absence of this information could be a key factor limiting the model's ability to achieve top-1% retrieval rates with MMGBSA scores comparable to those observed with docking scores.
Given the high computational demands of MD-based binding free energy estimations, existing datasets typically contain only a few thousand data points. In this study, we computed binding free energies with MMGBSA for approximately 60
000 ligands targeting MCL1. To test our active learning framework, we compared its performance upon using MMGBSA scores relative to using docking scores. After querying only 6% of the dataset, the model successfully identified 79.9% of the top-1% ligands according to MMGBSA, laying the groundwork for large-scale virtual screening with MMGBSA-level accuracy. Additionally, our results indicate that docking scores are easier to predict than MMGBSA scores, which may be attributed to the more dynamic transformation process of the protein and ligand when conducting MD simulations for MMGBSA. We hypothesize that enriching SMILES-based embeddings with additional information, particularly regarding ligand poses within the binding pocket, may help address this information gap and alleviate the observed performance difference. While MMGBSA scores are more challenging for the model to learn, they exhibit a stronger correlation with experimental binding affinities than docking scores in the subset of experimentally known MCL1 binders. This suggests that top binders selected based on MMGBSA may represent more promising candidates than those selected with docking. Based on our results, this active learning pipeline is now ready to incorporate additional scores into the learning loop such as affinities from free energy perturbation simulations or from wet lab experiments.
Supplementary information (SI): additional results and experiments. See DOI: https://doi.org/10.1039/d5dd00522a.
Footnotes |
| † These authors contributed equally to this work. |
‡ The query's SMILES is . |
| § We provide results for other initialization schemes in the SI. |
| This journal is © The Royal Society of Chemistry 2026 |