Distribution-preserved sampling (DPS) for smarter machine learning assisted ultra-large-scale virtual screening

Alexander Trachtenberg; Alexander Spelkov; Barak Akabayov

doi:10.1039/D6RA00279J

Distribution-preserved sampling (DPS) for smarter machine learning assisted ultra-large-scale virtual screening

Alexander Trachtenberg,

^a Alexander Spelkov^a and Barak Akabayov

*^a

Author affiliations

* Corresponding authors

^a Department of Chemistry and Data Science Research Center, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel
E-mail: akabayov@bgu.ac.il

Abstract

Ultra-large-scale structure-based virtual screening (SBVS) for identifying novel bioactive compounds poses significant computational challenges. These challenges arise from the size of available chemical libraries, which can contain billions of molecules that require exhaustive docking and scoring, placing prohibitive demands on CPU/GPU resources. Small- and mid-sized laboratories often lack access to the high-performance computing clusters or cloud resources necessary to process such workloads in a timely manner. Furthermore, managing and analyzing the resulting terabytes of docking data requires robust data-handling pipelines and expertise that are not universally accessible. Here, we present a data-driven drug development pipeline that leverages a subset of molecules from a database with a common scaffold, reducing the chemical search space by tens to hundreds of orders of magnitude. In this case, the common scaffold that is the key to allowing this reduction is the 2-phenylthiazole moiety, identified through NMR fragment screening. We started with a subset of over 400 000 drug-sized 2-phenylthiazole-containing molecules selected from the zinc database and trained a random forest regression model on about 1% of this data to predict binding scores for the entire library. For this purpose, we used a distribution-preserving sampling approach based on KMeans clustering and binning, and we evaluated its statistical fidelity using KS, Wasserstein, JS, and KL divergence metrics. Our approach preserved the distribution of docking scores, demonstrating the utility of data-driven strategies for scalable virtual screening and establishing a benchmark dataset for machine learning in drug discovery.

Supplementary files

Article information

DOI: https://doi.org/10.1039/D6RA00279J
Article type: Paper
Submitted: 11 Jan 2026
Accepted: 20 Apr 2026
First published: 27 Apr 2026
This article is Open Access

Download Citation

RSC Adv., 2026,16, 21855-21866

Permissions

Request permissions

Distribution-preserved sampling (DPS) for smarter machine learning assisted ultra-large-scale virtual screening

A. Trachtenberg, A. Spelkov and B. Akabayov, RSC Adv., 2026, 16, 21855 DOI: 10.1039/D6RA00279J

This article is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported Licence. You can use material from this article in other publications, without requesting further permission from the RSC, provided that the correct acknowledgement is given and it is not used for commercial purposes.

To request permission to reproduce material from this article in a commercial publication, please go to the Copyright Clearance Center request page.

If you are an author contributing to an RSC publication, you do not need to request permission provided correct acknowledgement is given.

If you are the author of this article, you do not need to request permission to reproduce figures and diagrams provided correct acknowledgement is given. If you want to reproduce the whole article in a third-party commercial publication (excluding your thesis/dissertation for which permission is not required) please go to the Copyright Clearance Center request page.

RSC Advances

Distribution-preserved sampling (DPS) for smarter machine learning assisted ultra-large-scale virtual screening

Abstract

Supplementary files

Article information

Download Citation

Permissions

Distribution-preserved sampling (DPS) for smarter machine learning assisted ultra-large-scale virtual screening

Social activity

Search articles by author

Spotlight

Advertisements