Open Access Article
Alexander Trachtenberg
,
Alexander Spelkov and
Barak Akabayov
*
Department of Chemistry and Data Science Research Center, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel. E-mail: akabayov@bgu.ac.il
First published on 27th April 2026
Ultra-large-scale structure-based virtual screening (SBVS) for identifying novel bioactive compounds poses significant computational challenges. These challenges arise from the size of available chemical libraries, which can contain billions of molecules that require exhaustive docking and scoring, placing prohibitive demands on CPU/GPU resources. Small- and mid-sized laboratories often lack access to the high-performance computing clusters or cloud resources necessary to process such workloads in a timely manner. Furthermore, managing and analyzing the resulting terabytes of docking data requires robust data-handling pipelines and expertise that are not universally accessible. Here, we present a data-driven drug development pipeline that leverages a subset of molecules from a database with a common scaffold, reducing the chemical search space by tens to hundreds of orders of magnitude. In this case, the common scaffold that is the key to allowing this reduction is the 2-phenylthiazole moiety, identified through NMR fragment screening. We started with a subset of over 400
000 drug-sized 2-phenylthiazole-containing molecules selected from the zinc database and trained a random forest regression model on about 1% of this data to predict binding scores for the entire library. For this purpose, we used a distribution-preserving sampling approach based on KMeans clustering and binning, and we evaluated its statistical fidelity using KS, Wasserstein, JS, and KL divergence metrics. Our approach preserved the distribution of docking scores, demonstrating the utility of data-driven strategies for scalable virtual screening and establishing a benchmark dataset for machine learning in drug discovery.
Virtual screening complements HTS by enabling in silico exploration of vastly larger, more diverse chemical spaces, thereby enhancing the efficiency and scope of early-stage drug discovery. In virtual screening, a library of small molecules is evaluated in silico to determine the likelihood that one or more will bind to a biological target.5 Virtual screens are categorized into ligand-based virtual screening (LBVS) and structure-based virtual screening (SBVS). In LBVS, screening is guided by the properties of known active compounds and is effective when high-quality activity data is available.6 However, LBVS methods are often biased toward chemical structures similar to those of known ligands and do not provide information about binding poses.7 In contrast, SBVS leverages the 3D structure of the target receptor to predict binding free energies and docking poses. SBVS is particularly valuable for revealing novel scaffolds, since it explores chemical space without the bias of known ligands.8 Linking LBVS with SBVS, therefore, creates a powerful approach that combines the speed and pattern-recognition strengths of LBVS with the structure-level precision of SBVS. Such an integration enhances hit identification by utilizing both ligand features and target structural information, thereby improving accuracy and increasing the likelihood of discovering new, high-affinity binders.7 However, even though the advantages of an integrated approach are clear, a key gap remains in systematically understanding how best to integrate and weight the outputs from LBVS and SBVS, particularly across diverse target classes and chemical spaces, which limits the generalizability and predictive power of current hybrid approaches.
Among the various techniques used in SBVS, molecular docking is the most widely used due to its computational efficiency. While docking provides a rapid estimate of ligand–receptor interactions, its scoring functions are often inaccurate,9 leading to a high rate of false positives,10 particularly in screens involving millions of compounds. To address the accuracy problem, virtual screens are often conducted in multi-stage pipelines, in which an initial rapid docking screen evaluates millions of candidates, followed by application to the top-ranked compounds of increasingly accurate (and computationally expensive) methods, such as molecular mechanics/generalized Born surface area (MM/GBSA) rescoring or receptor flexibility modeling.11 Yet as screening libraries grow to include billions of molecules, the initial stages of docking pose a significant computational challenge that demands extensive CPU and GPU resources.12 Even when the necessary hardware is available, various factors such as storage and memory constraints, network bandwidth limitations for transferring large datasets, and the need for efficient parallelization create additional hurdles. Furthermore, managing and analyzing the resulting terabytes of docking data requires robust data-handling pipelines and specialized expertise. Consequently, this process is primarily limited to laboratories with access to multi-million-dollar computational facilities.
A promising addition to the virtual screening toolbox is machine learning (ML), which has been incorporated into SBVS workflows to enhance docking prioritization while lowering computational burden. In ML-enhanced SBVS, a model is trained on docking scores obtained from a subset of compounds and then used to predict scores for the remaining library, substantially reducing the number of explicit docking calculations required.13,14 To ensure that the training subset adequately represents the underlying chemical space, we implemented a distribution-preserving sampling strategy. Conceptually, this approach is an adaptation of classical stratified sampling, designed to maintain the distributions of descriptors and scaffolds rather than introduce a fundamentally new sampling paradigm. What sets our framework apart is not just the sampling principle, but its integration within fragment-based virtual screening (FBVS). By combining distribution-aware subset selection with fragment-informed construction of chemical space, the model is trained on chemically meaningful regions. This synergy improves generalization across scaffold series, reduces bias toward overrepresented chemotypes, and stabilizes predictive performance in ultra-large virtual screening (ULVS) campaigns, where maintaining chemical diversity is critical for reliable extrapolation. Notably, ULVS methods achieve high hit accuracy—identifying most true binders—while significantly reducing docking costs by several orders of magnitude. In two representative examples published at the beginning of this decade, application of Bayesian active learning to dock only ∼2.4% of a 100-million compound library enabled the identification of ∼95% of top hits,15 and deep learning was used to prioritize 1000 high-quality leads from over a billion ZINC15 molecules without docking them all.13,14 Recently, a reduction of over 1000-fold in docking calculations was reported, with minimal loss in hit recovery.13 The above studies demonstrate that ML-guided virtual screening not only accelerates the screening process but also enables access to chemical spaces that would otherwise be computationally intractable. However, in many cases, docking remained supported solely by computational validation, requiring the need for experimental confirmation.
Sparsity—where only a small fraction of the vast chemical space contains active compounds—poses a major challenge in virtual screening. To overcome this, we implemented a two-step fragment-based virtual screening (FBVS)16 strategy designed to systematically narrow the chemical search space and enhance the efficiency of identifying promising candidates (Fig. 1). In this strategy, we applied virtual filtration to obtain a dataset of molecules with a common scaffold, previously identified by NMR fragment screening, and grew the fragment molecules (∼100 Da) into drug-sized molecules by using computational optimization. It may thus be said that FBVS identifies small, low-complexity molecular fragments that bind to a particular target, and then expands them into more complex scaffolds.16 While FBVS is a promising and rational strategy for identifying small-molecule binders, especially for challenging targets such as RNA, it remains underutilized due to its dependence on expert-driven fragment expansion and the limited availability of fragment libraries.17 In our case, a 2-phenylthiazole fragment hit, identified by NMR fragment screening, was expanded via FBVS into potent inhibitors of the bacterial ribosome.18,19 In the current in-house study, the hit molecule, 2-phenylthiazole, served as the basis for creating a focused virtual library of over 400
000 derivatives, all sharing the same common molecular scaffold. Docking of the 400
000 molecules (the scaffold-based dataset) to the hairpin 91 RNA target was then performed on this library using the same parameters and docking protocol we had used previously.19 Importantly, hairpin 91 is a 29-nucleotide RNA hairpin target within the functional core of the ribosome, the peptidyl transferase center (PTC, Fig. 1). Hairpin 91 is relatively small and has a simple 3D structure with defined secondary and tertiary interactions, making it an excellent model for evaluating ligand binding to folded RNA architectures19
![]() | ||
Fig. 1 Generating a subset of molecules with a common molecular scaffold, 2-phenylthiazole, as a strategy to decrease the chemical search space. (a) The scaffold molecule, 2-phenylthiazole, that binds to hairpin 91 was identified by NMR (T2 relaxation) screening from a commercial fragment library.19 (b) The estimated search space for all possible combinations of 20 substituents across seven positions yielded (207) ∼1.3 billion new drug-sized molecules. Filtering by molecular weight constraints (e.g., 350 Da) and incorporating additional complexity such as stereochemistry (e.g., three chiral centers) reduced the filtered search space to approximately 2.5 million molecules. (c) This targeted approach reduced the chemical search space from a fragment to a drug-sized molecule by approximately (103)7 = 21 orders of magnitude, enabling us to focus on a manageable, chemically relevant subset of candidates. (d) Docking was performed using AutoDock-GPU on a workstation with an NVIDIA RTX 3080 GPU. The compound library, comprising approximately 400 000 compounds, was divided into batches of around 30 000 for concurrent processing. Each batch took around 3.3 days, but due to resource contention, the total time extended to about one week, with final times varying based on GPU utilization and thermal conditions. In contrast, distribution-preserved sampling was performed on a minute time scale for both training and testing on the same computational setup. | ||
This scaffold-based dataset (virtual library) provides a unique model that can be leveraged to demonstrate how a chemical constraint (i.e., reduced structural diversity) facilitates efficient ML-based docking-score prediction. Our working hypothesis was that, due to the lower variance in molecular features across this scaffold-constrained chemical space, an ML model using only a small training subset—potentially as little as 1% of the entire dataset—could achieve high predictive accuracy. This approach offers not only computational advantages (reducing model training time by over 100-fold compared to large subsets) but also a practical solution for research groups lacking access to a high-performance computing infrastructure or cloud resources. Moreover, this approach provides a strategy that involves docking only a small subset of the library, with the scores for the remaining library predicted accurately using the trained model, thereby facilitating rapid hit prioritization. In addition to presenting our findings on the benefits of scaffold-based sampling, we introduce the underlying dataset as a potential benchmark for further exploring efficient docking-score prediction in realistic low-resource scenarios.
000 small molecules, all sharing a common phenylthiazole core scaffold, that were sourced from the ZINC15 database by using the filtering function embedded in the ZINC15 website (https://zinc15.docking.org/). These compounds were docked in 3D, by using AutoDock-GPU,20 against a 29-nucleotide RNA hairpin target within the 23S rRNA of the crystal structure of the Staphylococcus aureus ribosome (PDB ID: 4WCE).21 Importantly, hairpin 91 represents an experimentally resolved structured RNA structural motif within the functional core of the ribosome, the peptidyl transferase center (PTC).21 Hairpin 91 is relatively small and has a simple 3D structure with defined secondary and tertiary interactions, making it an excellent model for evaluating ligand binding to folded RNA architectures.19 Such motifs are increasingly being investigated as therapeutic targets because structured RNAs can form ligand-binding pockets analogous to those in proteins.All ligand structures were prepared and stored in Tripos Mol2 format for compatibility with the docking software. Following docking and subsequent outlier removal (e.g., extremely high docking scores), the final dataset comprised 413
109 molecules with their corresponding docking scores. All structures and docking results for this dataset are available in a public GitHub repository (https://github.com/csbarak/POC).
000 drug-like molecules docked against 58 pharmaceutically relevant protein targets (e.g., various kinases and nuclear receptors), making it a versatile resource for benchmarking virtual screening workflows. For our analysis, we selected a target from the enzyme class, 11β-hydroxysteroid dehydrogenase type 1 (HSD11B1), and extracted all ligand-docking score pairs for that target. HSD11B1 is a well-characterized protein target widely used in virtual screening benchmarks due to its pharmacological relevance and extensive docking reference data.To reduce heterogeneity and minimize the impact of extreme docking scores, we applied an interquartile range (IQR)-based filtering. Molecules with docking scores below Q1 − 2 × IQR or above Q3 + 2.5 × IQR were excluded, along with entries lacking docking values. Although IQR-based filtering removes statistical outliers to stabilize distributional comparisons, such filtering is not always applied in real-world screening pipelines, where extreme values may represent rare but genuine high-affinity hits. Here, filtering was used solely to standardize statistical analysis between datasets rather than to simulate a production virtual screening workflow. Following this filtering step, 255
085 molecules were retained, yielding a narrower, more representative dataset for distribution-preserving subset selection and predictive modeling.
The inclusion of both RNA and protein targets enables assessment of whether subset-selection strategies behave consistently across structurally distinct biomolecular classes.
000 molecules and the DOCKSTRING benchmark dataset, for the following fractions of the full data: 1%, 5%, 10%, 25%, 50%, and 75%. Two sampling strategies were employed to create these subsets: random sampling and distribution-preserving sampling. For the random sampling, we used the train_test_split method from the scikit-learn library to obtain independent and identically distributed (i.i.d.) samples for each fraction. This method preserves samples from the same probability distribution and provides an unbiased random subset of the desired size. Each subset was selected without replacement from the full dataset. For the distribution-preserving sampling, we employed a hybrid sampling approach to ensure that each subset preserved both the feature space distribution and the target (docking score) distribution of the original dataset. In other words, this “preserved” sampling aimed to maintain the characteristics common to the subset and the full dataset.
We used the following two stage evaluation setup, designed to simulate a fivefold cross-validation-like analysis for each dataset fraction: (1) random subsets: for each fraction f (1%, 5%,…, 75%), we used the five independent random training subsets (described above) as five different training sets. Each model was trained on a subset of size f and tested on the complementary portion (the remaining 99%, 95%, etc., of the data not seen during training). We recorded the performance metrics for each of these five runs. (2) Preserved subset: for each fraction f, we trained a model on a single preserved subset of size f and tested it on the remaining data. (We used the same full test set as in the random sampling case for that fraction to enable fair comparison). Model performance was evaluated using root mean squared error (RMSE) and the coefficient of determination (R2) on the test set, which provided measures of the magnitude of the prediction error and of the explained variance, respectively. We also recorded the training time (in minutes) for each model as an indicator of computational efficiency relative to the size of the training set. For the random sampling strategy, the five runs for each fraction allowed us to compute an average RMSE and R2, as well as their variability (standard deviation), to account for the uncertainty due to random subset selection. We report these average performance metrics and compare them with the single-run performance of the distribution-preserving subset of the same size.
This experimental design ensured a fair, size-matched comparison between the two sampling methods. By always performing the evaluation on the full complement of data not used for training, we maintained a consistent evaluation set for a given fraction f. Comparing the accuracy of the random forest technique when trained on a preserved subset vs. on multiple random subsets revealed whether distribution-preserving sampling leads to better predictive performance (i.e., lower RMSE or higher R2) for a given training set size. In addition, tracking the model performance as the training fraction increased provided insight into how quickly each method approached the performance that would have been obtained by using the full dataset. The distributional fidelity of the subsets was evaluated along with their practical utility in modeling tasks. By examining the distributional divergence and model predictive performance side by side, we could determine whether preserving the data's intrinsic structure translated into tangible improvements in learning outcomes, compared to traditional random sampling.
The bimodal distribution observed for the scaffold dataset may reflect two dominant ligand–target interaction regimes. In structured RNA systems, ligands can bind either within compact groove-like pockets or along exposed surface regions, resulting in distinct score distributions. Because in the scaffold dataset all ligand molecules share the same core scaffold, their binding poses may cluster into a limited number of interaction modes, which can manifest as separated score peaks. In contrast, the benchmark dataset contains structurally diverse molecules lacking a common scaffold, leading to a broader and more continuous distribution of docking scores.
For the KS statistic, the preserved subsets consistently showed markedly narrower distributions of divergence values than their randomly sampled counterparts across all subset sizes, even for the smallest fraction (1%). This observation held for both datasets and indicated that preserved subsets more reliably retain the cumulative structure of the original feature distributions.
For the Wasserstein distance, distributions of preserved subsets had similar widths to their randomly sampled counterparts across all subset sizes (except for the 25% subset in the scaffold-based dataset and the 1% and 10% subsets in the benchmark dataset, where randomly sampled subsets had narrower distributions than preserved). Nevertheless, the scale of the divergence values was notably different across the two datasets: the y-axis range for the benchmark dataset was approximately four times larger than that of the scaffold-based dataset. This wider spread in the benchmark data indicated greater heterogeneity and less stability in feature distribution shifts across subsets, likely due to the chemical diversity of the molecules and the higher overall descriptor variance.
For JS divergence, the divergence distributions were overall quite similar between the two datasets. Nonetheless, for the scaffold-based dataset, random subsets tended to exhibit slightly more variability, especially at lower subset fractions. Starting from the 50% fraction, preserved subsets began to show a modest narrowing of the distribution, reinforcing the sampling method's effect at larger data sizes.
For the KL divergence, a more dataset-specific trend was evident. In the scaffold-based dataset, random subsets displayed somewhat broader divergence distributions than preserved subsets across most fractions. Notably, for fractions of 50% or higher, the preserved subsets became distinctly narrower. In contrast, for the benchmark dataset, the difference between the preserved and random subsets was less pronounced, possibly due to its greater chemical diversity and baseline variability in descriptor distributions.
Taken together, these results indicate that the distribution-preserving sampling method, based on KMeans clustering and target-value binning, effectively retained the statistical structure of the full dataset, particularly for the scaffold-based dataset. The narrower distributions of divergence values across multiple metrics and subset sizes provide strong evidence that the method worked as intended. This robustness in distributional preservation is a critical foundation for downstream ML tasks, as it ensures that the training data remains representative of the original chemical and biological space.
In the case of JS divergence, preserved scaffold-based subsets showed greater divergence than benchmark subsets at lower fractions (1–25%, except for 5%, which was slightly negative), but this trend was reversed at higher fractions (50–75%). For random subsets, the scaffold-based subsets diverged to a greater extent for nearly all fractions, as indicated by having positive values. For preserved subsets, small fluctuations around zero, with both positive and negative values, were observed across subset sizes, indicating that the relative divergence between datasets depended on the fraction rather than following a single monotonic trend. For KL divergence, scaffold-based subsets initially diverged to a greater extent than benchmark subsets for small to moderate subset sizes (1–25%), but benchmark divergence exceeded scaffold-based divergence for larger fractions (50–75%). To summarize, mean differences between the two datasets were metric-dependent and typically small in magnitude, suggesting that neither dataset consistently exhibited greater divergence across all conditions. Importantly, these plots summarize mean divergence values; as shown in Fig. 4, feature-level distributions reveal that preserved sampling generally produces narrower divergence distributions across descriptors, supporting the idea that scaffold-constrained chemical space enables more stable and representative subsampling, especially at lower fractions. This observation supports the broader premise of the study that for chemically consistent datasets, such as the scaffold-based dataset, smaller training subsets can still maintain important distributional properties, potentially resulting in more accurate and computationally efficient modeling.
In practical virtual screening terms, RMSE reflects how accurately predicted docking scores approximate computed scores used for ranking compounds. Because screening decisions typically depend on relative ranking rather than absolute score values, modest RMSE values can still be sufficient for prioritizing top candidates, provided ranking consistency is preserved. These findings demonstrate that, within a chemically constrained scaffold-defined space, a small and well-sampled subset can be sufficient for accurate prediction of docking scores. In comparison, the benchmark dataset, which exhibits much higher structural diversity, required substantially larger training sets to achieve comparable performance.
A key advantage of using smaller subsets is the reduction in computational cost, the results of which are depicted in Table S1. For example, training a model on only 1% of the scaffold-based dataset required approximately 1.5 minutes, while training on 75% took over 105 minutes, representing a more than 70-fold speedup. Similarly, for the benchmark dataset, training time increased from just over 1 minute at 1% to more than 115 minutes at 75%, resulting in a more than 100-fold difference. This substantial improvement in computational efficiency underscores the value of intelligently selected small subsets in large-scale virtual screening pipelines.
Notably, model performance was generally comparable between random and preserved subsets, especially for the scaffold-based dataset. This finding can be attributed to the robustness of random forests, which are ensemble models that benefit from decorrelated decision trees and tend to perform well even on randomly drawn data, particularly when the underlying distribution is relatively homogeneous, as is the case in the scaffold-constrained dataset. Nonetheless, distribution-preserving sampling still offers advantages in maintaining diversity and reducing feature divergence, especially for very small fractions.
Our pipeline centered on a focused library of over 400
000 molecules, all derived from a 2-phenylthiazole scaffold identified through FBVS and experimentally validated. The focused library was compared to a widely used benchmark dataset from DOCKSTRING, which was generated using the same docking software family (AutoDock Vina and AutoDock-GPU) and shared similar docking score ranges. Notably, the scaffold-based dataset exhibited markedly lower variance in RDKit molecular descriptors (Fig. 3). This reduction in feature variance is a direct consequence of the common scaffold architecture, which constrains chemical diversity and reduces the degrees of freedom in the feature space. These properties suggest that the scaffold-based dataset is particularly well suited for ML tasks, where lower variance may lead to simpler and more generalizable models.
To assess the representativeness of training subsets selected from the full dataset, we compared random sampling against a distribution-preserving strategy based on KMeans clustering and binning of the docking score target. Feature-wise divergence metrics (KS, Wasserstein, JS, KL) showed that the preserved subsets generally produced narrower distributions of divergence values across most subset sizes (Fig. 4), indicating that this method effectively maintains the original feature distribution. This conservation of the original feature distribution was particularly evident for smaller subset fractions (e.g., 1–10%), which are the most relevant subsets for reducing computational effort. A comparison of mean divergence values between the two datasets (Fig. 5) suggested that benchmark subsets diverged more than scaffold-based subsets at small subset sizes, especially under random sampling, although differences between strategies and datasets were often minimal and occasionally reversed. These findings suggest that the scaffold-based dataset may be more amenable to accurate model training on minimal data, while also highlighting that preserved sampling does not universally outperform random sampling under all conditions and dataset configurations.
Model performance analysis (Fig. 6) further confirmed the utility of the scaffold-based dataset. A random forest regressor trained on just 1% of the scaffold-based dataset achieved an RMSE of ∼0.47 and R2 of ∼0.82, with a training time of only 1.5 min. Notably, training on 75% of the data improved the RMSE only marginally (∼0.42) but required 70 times more training time (∼105 min).
In virtual screening workflows, predictive models are primarily used to prioritize top-scoring candidates rather than reproduce exact docking values. Therefore, acceptable prediction error depends on whether highly ranked compounds remain near the top of the list. While enrichment metrics were not explicitly evaluated here, the observed predictive accuracy and preserved score distributions suggest that the approach is suitable for shortlist generation prior to downstream validation. Importantly, these findings demonstrate the practical advantage of scaffold-constrained libraries, namely, accurate predictions can be obtained rapidly and with minimal resources. Interestingly, the performance of random and distribution-preserving splits was comparable for the two datasets. This finding may be attributed to the robustness of the random forest model, which sample features and instances during training and is less sensitive to skewed distributions than other models.
Notably, meaningful predictive accuracy was achieved without hyperparameter optimization for the Random Forest model, indicating that the observed performance gains arise primarily from dataset design and subset representativeness rather than model-specific tuning. This supports the general applicability of the proposed pipeline, since it does not depend on computationally intensive model selection or parameter searches. Specifically, our pipeline allows researchers to conduct docking calculations on a small subset of a scaffold-based library and then to use the resulting model to predict scores across the entire library. This advantage is particularly impactful for research groups lacking access to high-performance computing resources, as it offers a scalable alternative to exhaustive docking campaigns. Furthermore, the scaffold dataset introduced in this study may serve as a new community benchmark for evaluating sampling strategies and predictive models in scaffold-focused virtual screening. Overall, our findings highlight the synergistic benefits of combining FBVS-derived scaffold libraries with ML and distribution-preserving sampling. This approach offers a cost-effective pathway for exploring ultra-large chemical libraries, enhancing hit discovery, and democratizing access to structure-based virtual screening.
However, it must be noted that this study evaluated only a single scaffold (2-phenylthiazole), and therefore generalization to other scaffolds cannot be assumed. The magnitude of performance gains may depend on scaffold rigidity, substituent flexibility, or intrinsic descriptor variance, which can differ substantially across chemical classes. Because the scaffold was experimentally identified and generating large-scale additional scaffold-restricted datasets is resource-intensive, systematic validation across multiple scaffolds was beyond the scope of this work. Future studies should therefore examine diverse scaffolds using the same methodology to determine which structural families benefit most from subset-based learning and distribution-preserving sampling strategies. This can eventually lead to prospective applications, in which top-ranked compounds predicted by the model are prioritized for experimental validation via synthesis or purchase, followed by biochemical or biophysical assays. Such validation would provide a direct test of whether subset-trained models can successfully recover true binders while substantially reducing computational cost.
Supplementary information (SI) is available. See DOI: https://doi.org/10.1039/d6ra00279j.
| This journal is © The Royal Society of Chemistry 2026 |