Open Access Article
Mohit Pandey
ab,
Tanvir Sajedab,
Ivan Semenovab,
Renfei Zhangc,
Fuqiang Banab,
Ekaterina Manskaiaab,
Martin Esterc and
Artem Cherkasov
*ab
aVancouver Prostate Centre, University of British Columbia, Vancouver, British Columbia, Canada. E-mail: acherkasov@prostatecentre.com
bFaculty of Medicine, University of British Columbia, Vancouver, BC, Canada
cSchool of Computing Science, Simon Fraser University, Burnaby, BC, Canada
First published on 28th May 2026
The exponential growth of accessible chemical space represents a significant computational challenge for structure-based virtual screening. Hence, active-learning and machine-learning approaches, such as Deep Docking, have been introduced to significantly speed up this process; yet even such methods became computationally prohibitive as docking libraries expanded into and beyond billion-entries levels. To address this challenge, we herein introduce the Deep Docking Ultra (DDU) approach, which integrates advanced acquisition functions with a pre-trained molecular large language model (MLLM). We demonstrate that such a combination improves accuracy of docking score emulations, while significantly reducing their computational costs. Through 384 virtual screening experiments involving 12 proteins from all major target classes, we systematically benchmarked DDU performance to identify optimal configurations that reduce required computations by up to 45-fold compared to the original Deep Docking method, and by up to 28
500-fold, compared to brute-force docking, without compromising predictive accuracy. We further demonstrate that DDU is able to screen 10.1 billion ligands against the phosphoglycerate kinase 2 target in just 10 days using 50 tesla V100 GPUs, and yields an overall docking enrichment factor of 12
000.
Broadly, strategies addressing navigation of billion-scale chemical space fall into two categories. The first, exemplified by ultra-high-throughput docking (UHTD) and AL-accelerated VS approaches9–11 operates over pre-enumerated and synthetically accessible vendor libraries. The second-category methods such as fragment-based generative approaches12 or structure-based diffusion models,13,14 construct molecules de novo, guided by learned representations of bioactivity or target geometry. While such generative methods can theoretically extend beyond the existing libraries, they face the distinct challenge of synthetic accessibility.15 These two paradigms are therefore complementary. UHTD and AL-based approaches offer immediate access to synthetically tractable hits, while generative approaches extend the boundaries of accessible chemical novelty. As chemical libraries themselves continue to grow toward and beyond the trillion-molecule scale, efficient navigation of pre-enumerated space remains an essential and practical challenge.
Thus, in 2020 we introduced Deep Docking (DD)16,17 – an AL-driven approach that trains deep neural networks to predict docking scores from an iteratively sampled docking database. DD demonstrated significant acceleration over exhaustive brute-force docking and led to numerous successful hit discovery campaigns involving billions of molecules, including the identification of inhibitors for SARS-CoV-2 papain-like protease (PLpro),18 Lin28 protein,19 WDR domain of LRRK2 protein,20 Macrodomain 1 (Mac1) of non-structural protein 3 (NSP3),21 human Androgen Receptor,22 and A2A adenosine receptor23 among many others. The use of DD facilitated our winning strategies in two worldwide hit discovery competitions CACHE-1 (ref. 24) and CACHE-3. (ref. 25)
Nonetheless, DD also remains relatively resource-intensive and does not fully allow optimal training set selection as chemical libraries continue to expand exponentially. To address these limitations, we introduce Deep Docking Ultra (DDU) – an enhanced AL framework that integrates more informative acquisition strategies with a pre-trained molecular large language model (MLLM) tailored for small molecules. The DDU uses the MLLM as a feature extractor within a deep architecture incorporating multi-head attention and residual layers (Fig. 1), enabling accurate identification of high-scoring compounds through binary classification while requiring substantially fewer explicit docking calculations.
We conducted an extensive benchmarking of DDU method across 384 virtual screening (VS) experiments under diverse conditions and identified configurations that achieved up to a 45-fold reduction in docking computations compared to the original DD protocol, using a library of one million compounds (approximating the smallest size on which DD can be accurately trained). We further validated the scalability of DDU on an ultra-large chemical library, screening over 10.1 billion molecules from the Enamine REAL database within 10 days, while maintaining strong enrichment within the top one percent and achieving approximately a 28
500-fold reduction in docking computations compared to exhaustive docking.26 These results position DDU as an efficient and practical tool for large-scale SBVS, demonstrating that the integration of language models and optimized sampling strategies can significantly enhance computational drug discovery at reduced costs.
Our original DD method relied on a feedforward neural network trained on Morgan molecular fingerprints and was augmented iteratively with newly docked molecules.16 While very effective, this approach still required docking of millions of molecules during the training phase and relied on precomputing computationally expensive 1024-bit molecular fingerprints, which incurred significant time and storage costs. In contrast to the traditional DD model, DDU operates directly on SMILES representations and achieves comparable or superior performance while reducing docking computations by up to 45-fold. Following the standard 11-iteration AL training cycle described in our original DD publication,16 DDU requires docking only a total of 21
000 molecules: 20
000 in the first iteration and 100 additional molecules in each of the subsequent 10 iterations. The original DD approach, by comparison, requires docking approximately 11 million molecules per target for efficient training and active learning. This translates to a speedup of more than three orders of magnitude compared to DD16 (Fig. 2).
Using the developed DDU protocol, we further performed comparative benchmarking across 12 proteins representing four major drug-target classes that were used in the original DD publication. This panel included nuclear receptors, represented by androgen receptor (AR), estrogen receptor-alpha (ERα), and peroxisome proliferator-activated receptor gamma (PPARγ); kinases, including calcium/calmodulin-dependent protein kinase 2 (CAMKK2), cyclin-dependent kinase 6 (CDK6), and vascular endothelial growth factor receptor 2 (VEGFR2); G protein-coupled receptors (GPCRs), such as adenosine A2A receptor (ADORA2A), thromboxane A2 receptor (TBXA2R), angiotensin II receptor type 1 (AT1R); and ion channels, represented by Nav1.7 sodium channel (Nav1.7), Gloeobacter ligand-gated ion channel (GLIC), and gamma-aminobutyric acid type A receptor (GABAA).
The results on those 12 targets demonstrated that while both DD and DDU successfully enriched for high-scoring compounds significantly above the 1st percentile threshold of randomly selected training compounds, DDU exhibited superior performance across multiple metrics. Thus, the docking score distributions (Fig. 3) reveal that DDU consistently achieved taller peaks with reduced variance, indicating more effective enrichment for high-affinity compounds, while the t-SNE projections and corresponding entropy calculations (H) demonstrated that DDU identified substantially more diverse chemical space compared to DD across all targets. These results indicate that DDU overcomes the traditional enrichment-diversity trade-off in VS by simultaneously maintaining superior scoring performance and exploring broader chemical diversity. This represents a significant methodological advancement for drug discovery, where scaffold diversity is critical for identifying novel bioactive compounds.
In addition, DDU eliminated the need to compute and store molecular fingerprints, significantly lowering the memory usage. For instance, generating and storing Morgan fingerprints for a billion-compound library such as ZINC20 requires over 200 GB of disk space and 11 days of processing on a single CPU core. By using a SMILES-based model, DDU bypassed this bottleneck altogether. In practical terms, the peak RAM footprint of the DDU orchestrator is just around 4.5 GB during model training on a 1-million-molecule screen (SI Table S1), with no dependency on pre-stored fingerprint data.
000 training, 50
000 validation, and 100
000 test molecules drawn from this pool; the validation and test sets were held fixed throughout all subsequent iterations.
Among the 10 benchmarked DL architectures, which varied in model capacity from simple feedforward neural networks (FFNN) to modified transformer architectures known to perform well in molecular modeling tasks,27,29,30 MoLFormer-DR was finally selected because of its stable performance, minimal preprocessing requirements and effectiveness in large-scale screening (Fig. 4A). Although the residual FFNN and attention-based FFNN alternatives such as mlp6K and mlpTx offer slightly higher F1 scores, they require extensive molecular fingerprint computation, which limits scalability. MoLFormer-DR and MoLFormer-AH,27 while somewhat sensitive to overfitting at larger training sizes, achieve optimal performance with only around 50
000 training examples. It is also noteworthy that beyond 20
000 samples, the performance gains are marginal, indicating that such a modest training set is sufficient and very suitable for data-efficient VS pipelines (Fig. 4B).
We further established that the DDU performance was highly sensitive to the applied docking score thresholds across the studied 12 targets. A fixed threshold of the 1% quantile of the docking scores yielded significantly more stable and accurate classification compared to the dynamic AL threshold (Fig. 5A). The dynamic threshold recalculates the docking score cutoff at each AL iteration as the top 1% quantile of the current (growing) training set. While the active-to-inactive ratio within the training set remains approximately constant, the absolute cutoff value becomes progressively more negative. Thus, the performance decline observed under the dynamic threshold conditions is attributable to the concept drift. i.e. as virtual screening progresses, fewer molecules qualify as actives, reducing the positive class fraction and making the classification task progressively more difficult. Simultaneously, previously labeled active molecules may no longer satisfy the updated threshold criterion, causing the training objective to shift from one iteration to the next. This progressive misalignment between iteratively redefined labels and the fixed held-out test set is the primary driver of the F1 score deterioration visible in Fig. 5A.
Moreover, retaining predicted inactive molecules in the sampling pool of DDU (rather than removing them as was done in the original DD) resulted in further improved model performance and allows avoiding premature pool depletion and early termination of AL runs (Fig. 5B and S6). It should be noted, however, that such a benefit was not consistent across all 12 targets and depends on the overall false-negative rate, pool size and cut-off strategies implemented. Therefore, we recommend adopting a database retention approach for libraries of manageable size, whereas for ultra-large-scale VS campaigns, a database removal strategy is more practical.
Finally, we evaluated eight most widely adopted acquisition functions in AL-accelerated VS literature7,8 to guide AL sampling, indicating that uncertainty-based methods such as BALD28 and margin sampling consistently outperform greedy or random selection employed in the original DD method (Fig. 5C). The BALD sampling was most consistently among the top performers across nearly all targets when used with dynamic thresholding and database removal. Under the recommended configuration (fixed 1% quantile threshold with database retention), BALD acquisition consistently attained the highest or near-highest F1 scores across the majority of the 12 benchmark targets (SI Fig. S7 and S8) and is the recommended default acquisition function for DDU. For standard screens of up to ∼1 M compounds, 30 AL iterations is recommended as a practical default (∼10 hours on a single GPU, docking ∼21
000 molecules in total). For ultra-large campaigns where per-iteration docking cost is substantially higher, 5 iterations is a pragmatic alternative, as demonstrated in the PGK2 10.1B prospective campaign.
A detailed analysis of all benchmarking efforts is presented in SI Sections S2, S3 and SI Fig. S5–S9. These findings characterize the Deep Docking Ultra approach as an effective, fast, scalable, and accurate alternative to the previously reported Deep Docking, with strong potential for real-world application in ultra-large drug screening campaigns. Detailed computational benchmarks substantiating the scalability claims are provided in SI Table S1.
It is important to note that DDU is released as a fully configurable framework rather than a fixed protocol. All methodological choices evaluated in this work, including model architecture, acquisition function, thresholding strategy, database management approach, active learning budget parameters, and compute environment are all exposed as user-defined parameters in a single configuration file and do not require modification of the source code. The recommended default configuration is MoLFormer-DR with BALD acquisition, a fixed one percent quantile threshold, and database retention, based on the benchmarking results presented in this study. Users may adopt this default or adjust individual parameters to match the scale, computational resources, and scientific goals of their screening campaign.
Notably, the DUD-E compounds were not explicitly docked. Instead, the trained DDU and DD models were used to rank these molecules based solely on their molecular structures. This exercise aimed to test whether the learned representations transfer to the discrimination of experimentally confirmed bioactive molecules. It therefore provides a more stringent assessment of generalization than the in-distribution docking score prediction evaluated in the benchmarking experiments.
For each of these five targets, the DDU model achieved a higher area under the receiver operating characteristic curve (ROC-AUC) compared to the original DD model (Fig. 6A). This result demonstrates that, across the entire ranked list of screened compounds, DDU provides a more accurate and consistent separation between active and inactive molecules. Further comparison of enrichment factors at different cutoffs (EF@K for K = 10, 100, and 1000) revealed complementary strengths between the two models. The DD exhibited higher enrichment at lower K values, suggesting that it is particularly sensitive to a few top-ranking compounds. In contrast, DDU performed better at larger K values, such as K = 1000, which represents more realistic VS conditions (Fig. 6B). The training strategy and BALD acquisition policy used in DDU promote chemical diversity and information gain by intentionally sampling molecules that enhance model generalization rather than focusing exclusively on the highest docking scorers. This approach improves overall ranking performance, as reflected in higher ROC-AUC values, but may slightly reduce early-stage enrichment by trading immediate precision for broader exploration of chemically promising regions of the search space.
Within a fixed 10 days compute budget on 50 Nvidia Tesla V100 GPUs, DDU was executed for five AL iterations, retraining the model at each cycle using BALD-based uncertainty acquisition to select informative candidates among the top-scoring molecules. On a held-out test set constructed by randomly sampling the Enamine REAL 10.1 billion dataset, the best model achieved AUC-ROC = 0.71 and F1 = 0.31, reflecting strong early recognition performance despite the extreme class imbalance inherent to billion-scale VS. Notably, this AUC-ROC is comparable to the best-performing models in the original DD study (e.g., AR = 0.77, PPARγ = 0.79), while screening a dataset nearly seven times larger (10.1B vs. 1.3B molecules) within the same computational budget.
To further quantify the enrichment by the DDU, we computed virtual Enrichment Factor (EF) of virtual actives defined as compounds with docking scores within the top 1% (lowest energies) for top-K = 10
000, 100
000, and 1
000
000 ranked predictions (Fig. 7A). Hence, DDU exhibited pronounced early enrichment, with EF@10
000 ≈ 1.2 × 104, representing more than a 10
000-fold concentration of virtual actives relative to random expectation. As expected, EF values decreased with increasing K, consistent with progressive inclusion of less informative compounds in the top-ranked pool. The corresponding docking score distributions (Fig. 7B) further illustrate the selectivity of the DDU method, compared to a randomly sampled subset of one million molecules. Fig. 7B clearly demonstrates that the DDU docking score distribution is markedly left-shifted toward more favorable docking scores and exhibits a sharper, higher-density peak beyond the virtual-active threshold (−8.5 kcal mol−1). Future work will focus on assessing experimental translatability by incorporating the official CACHE Challenge #7 (PGK2) results upon their release in 2026, followed by the corresponding publications.
000-molecule seed set across 50 GPU sub-jobs (22.8 min; 6.6 mol s−1 per GPU), MoLFormer-DR model retraining on one GPU (10.0 min; 1331 samples per s), and pool inference scoring of 830
000 unlabeled molecules across 50 GPUs (1.3 min; 1759 mol s−1 per GPU). Peak GPU memory demand was 14.8 GB for docking, 11.8 GB for training, and 1.2 GB for inference, all within the capacity of a single V100-32 GB or equivalent ≥16 GB consumer GPU. The orchestrator process required at most 4.5 GB of system RAM, and per-iteration disk output was approximately 1 GB. Complete hardware specifications and throughput figures are provided in SI Table S1. As such, we demonstrate that DDU, as an AL-based accelerator for ultra-large docking screens, performs as intended, and we report its computational benchmarking results here.
000 docked molecules, compared to the approximately 13 million dockings required for optimal performance of the original DD.
On the other hand, by augmenting both high-scoring molecules and uncertain regions of chemical space through BALD-based acquisition, DDU enhances the docking surrogate model's exploration–exploitation balance and improves predictive robustness. In addition, it is worth noting that DDU allows retention of the predicted inactives in the entire unlabelled molecular pool across iterations, which are completely discarded in the original DD model refinement from one iteration to the other. This strategy ensures further improvement of robustness and recall, since the molecules erroneously predicted as inactives by the model of the previous iterations of DDU will be automatically corrected by the refined prediction models in subsequent active learning rounds.
Despite these advances, DDU shares limitations common to surrogate model-based VS paradigms, particularly regarding class imbalance, as typically fewer than 1% of molecules qualify as actives. This imbalance can reduce generalizability and affect model F1 scores. Since DDU prioritizes compounds based on model predictions, improving its classification robustness remains critical. Future research could explore integrating protein structural features into the surrogate model to better capture target-specific interactions and extending the pretraining of MLLM models to broader, multi-source chemical datasets to enhance their generalization. It should also be noted that DDU, as implemented here, is designed specifically as a surrogate accelerator for standard small-molecule docking campaigns against defined binding pockets. Extension to more challenging binding site geometries such as protein–protein interfaces, which typically present flat contact surfaces that challenge conventional docking scoring functions, would require re-evaluation of the docking oracle's suitability in that context and represents an important direction for future work.
We implemented three MoLFormer variants in PyTorch, each with progressively increasing architectural complexity, aimed at improving the performance of the docking score classifier by building upon the pretrained model.
| Number of training epochs | 40 |
| Optimizer | Adam |
| Learning rate | 0.001 |
| Scheduler | Scheduler with step size 4 and gamma 0.1 |
| Weight decay | 0.01 |
| Patience | 5 |
| Loss function | Cross entropy |
| Batch size | 2048 |
| Number of training epochs | 10 |
| Optimizer | Adam |
| Learning rate | 0.001 |
| Scheduler | Cosine scheduler |
| Step size | 40 |
| Gamma | 0.1 |
| Weight decay | 0.01 |
| Loss function | Focal loss |
| Batch size | 2048 |
In order to dock the 12 targets, protein structures were prepared in PDBQT format, and ligands were docked into a predefined binding site defined by a rectangular search box. For each target, the docking search space was specified by the box center coordinates (x, y, z) and box dimensions (sizex, sizey, sizez), and the corresponding receptor PDBQT file (Table 3). Ligands were prepared from SMILES by generating 3D conformers using the RDKit ETKDGv3 (ref. 33) procedure, followed by explicit hydrogen addition and conversion to PDBQT format prior to docking. The QuickVina2-GPU configuration used a fixed OpenCL binary34 and a fixed thread setting (8000) across targets; target specificity is entirely captured by the receptor PDBQT and the box center/size parameters. Docking scores were parsed from the Vina output and used as the ground-truth values for downstream thresholding, labeling, and benchmarking.
| Target | PDB ID | center_x | center_y | center_z | size_x | size_y | size_z |
|---|---|---|---|---|---|---|---|
| AR | 1t7r | −0.252 | 30.301 | 38.95 | 22.456 | 25.328 | 18.771 |
| ERα | 1err | 67.276 | 33.964 | 76.012 | 21.081 | 26.879 | 28.435 |
| PPARγ | 1nyx | 16.562 | 63.569 | 15.1 | 25.14 | 21.336 | 28.261 |
| CAMKK2 | 2zv2 | 1.076 | −7.19 | −25.943 | 20.517 | 26.708 | 18.8 |
| CDK6 | 5l2s | 22.198 | 38.486 | −8.657 | 24.205 | 29.091 | 29.61 |
| VEGFR2 | 4ag8 | 19.892 | 25.629 | 38.443 | 29.78 | 23.803 | 23.32 |
| ADORA2A | 5mzj | −38.176 | 5.319 | 22.425 | 20.713 | 33.058 | 21.556 |
| TBXA2R | 6iiu | 25.175 | 164.914 | 148.502 | 26.289 | 26.333 | 24.027 |
| AT1R | 4yay | −16.726 | 9.992 | 41.794 | 24.057 | 27.611 | 24.281 |
| Nav1.7 | 5ek0 | −85.815 | −12.83 | −14.228 | 27.526 | 28.223 | 25.905 |
| GLIC | 4f8h | 16.815 | −15.092 | 25.306 | 20.866 | 21.622 | 22.089 |
| GABAA | 6d6t | 119.613 | 169.091 | 154.264 | 24.698 | 22.391 | 21.573 |
000, a validation set of size 50
000, and a test set of size 100
000 molecules are randomly sampled from a molecule pool derived from the Enamine REAL database.26 The validation and test datasets remain fixed throughout the AL pipeline. Each molecule in these datasets is docked against the target of interest using QuickVina2 (the docking oracle), which provides ground-truth docking scores used to generate binary activity labels. The docking scores across all three datasets are converted into binary classification labels using the lowest 1% quantile score from the initial training dataset as the fixed cutoff. As a result, 1% of the training dataset is initially classified as hits (virtual active compounds), while the remaining 99% are considered non-hits (virtual inactive compounds). The MoLFormer-DR model is then trained on molecular embeddings of SMILES sequences extracted from a pre-trained MoLFormer model, alongside their corresponding binary labels. The model learns to predict binary activity labels based on the SMILES representations of input molecules within the training dataset.
From the second iteration onward, the training dataset is iteratively expanded by adding the top 100 molecules according to the BALD acquisition strategy. Additionally, an uncertainty metric is computed across the entire unlabeled pool; inactive molecules with low acquisition scores are discarded to accelerate the process, while the 100 molecules with top scores are sent to the oracle for docking. Finally, after a pre-decided user-defined number of iterations, the molecules predicted as virtual actives by the model are returned to the user. Although DDU operates as a binary classifier, compounds in the unlabeled pool can be meaningfully ranked by their predicted probability of belonging to the active class, as output by the softmax layer. The acquisition function (BALD in the recommended configuration) uses this probability alongside a model uncertainty estimate derived from Monte Carlo dropout to assign an acquisition score to each unlabeled compound. The 100 compounds with the highest acquisition scores are selected for docking at each iteration, providing a principled internal ranking within the binary classification framework.
It is noteworthy that the 100
000-molecule held-out test set used throughout the benchmarking experiments in this work is an optional component of the DDU pipeline, controlled by the test_rand_samples parameter in the configuration file. For prospective screening campaigns where maximizing library coverage is the priority, this parameter can be set to false, allowing all sampled molecules to contribute to the AL pool. The test set is recommended when quantitative model performance estimates are required, such as in benchmarking studies or when reporting screening results for publication.
000 randomly sampled molecules from the screening library, as described in the Evaluation metrics section of the Methods. This test set is held out from the beginning of training in iteration 1 and is never used for model selection or hyperparameter tuning across any of the active learning iterations. The validation set, consisting of 50
000 molecules randomly sampled from the same screening library at the start of iteration 1, is used exclusively for early stopping.
Precision is computed as
Recall is computed as
F1 metric, the harmonic mean of precision and recall, is computed as
The Top N enrichment score is computed as
| Randam(x) ∼ U(0,1) |
| Greedy(x) = û(x) |
| UCB(x) = û(x) + βσ$(x) |
| UNC(x) = σ$(x) |
represents the entropy of the average model prediction given input
and training data Dtrain, and the expectation term
accounts for the average of the entropies of predictions made by individual sampled models drawn from posterior distribution of model parameters 
| Margin(x) = p(y1|x) − p(y2|x) |
Data points with large entropy occur when the model's prediction probabilities are spread across multiple classes, signaling high uncertainty about the correct label.
| Least confidance(x) = 1 − argmaxcp(y = c|x) |
Overall, BALD, margin, entropy and least confidence sampling strategies focus on individual point metrics without ensuring diversity of selected points across the feature space.
All the data produced in this work will be made available on the public repository https://github.com/diamondspark/DDU.
Supplementary information (SI) is available. See DOI: https://doi.org/10.1039/d5sc09599a.
| This journal is © The Royal Society of Chemistry 2026 |