Issue 5, 2025

A workflow to create a high-quality protein–ligand binding dataset for training, validation, and prediction tasks

Abstract

Development of scoring functions (SFs) used to predict protein–ligand binding energies requires high-quality 3D structures and binding assay data for training and testing their parameters. In this work, we show that one of the widely-used datasets, PDBbind, suffers from several common structural artifacts of both proteins and ligands, which may compromise the accuracy, reliability, and generalizability of the resulting SFs. Therefore, we have developed a series of algorithms organized in a semi-automated workflow, HiQBind-WF, that curates non-covalent protein–ligand datasets to fix these problems. We also used this workflow to create an independent data set, HiQBind, by matching binding free energies from various sources including BioLiP, Binding MOAD and Binding DB with co-crystalized ligand–protein complexes from the PDB. The resulting HiQBind workflow and dataset are designed to ensure reproducibility and to minimize human intervention, while also being open-source to foster transparency in the improvements made to this important resource for the biology and drug discovery communities.

Graphical abstract: A workflow to create a high-quality protein–ligand binding dataset for training, validation, and prediction tasks

Supplementary files

Article information

Article type
Paper
Submitted
05 Nov 2024
Accepted
25 Mar 2025
First published
02 Apr 2025
This article is Open Access
Creative Commons BY-NC license

Digital Discovery, 2025,4, 1209-1220

A workflow to create a high-quality protein–ligand binding dataset for training, validation, and prediction tasks

Y. Wang, K. Sun, J. Li, X. Guan, O. Zhang, D. Bagni, Y. Zhang, H. A. Carlson and T. Head-Gordon, Digital Discovery, 2025, 4, 1209 DOI: 10.1039/D4DD00357H

This article is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported Licence. You can use material from this article in other publications, without requesting further permission from the RSC, provided that the correct acknowledgement is given and it is not used for commercial purposes.

To request permission to reproduce material from this article in a commercial publication, please go to the Copyright Clearance Center request page.

If you are an author contributing to an RSC publication, you do not need to request permission provided correct acknowledgement is given.

If you are the author of this article, you do not need to request permission to reproduce figures and diagrams provided correct acknowledgement is given. If you want to reproduce the whole article in a third-party commercial publication (excluding your thesis/dissertation for which permission is not required) please go to the Copyright Clearance Center request page.

Read more about how to correctly acknowledge RSC content.

Social activity

Spotlight

Advertisements