Yingze Wang‡
a,
Kunyang Sun‡a,
Jie Li
a,
Xingyi Guan
a,
Oufan Zhanga,
Dorian Bagnia,
Yang Zhangdef,
Heather A. Carlsong and
Teresa Head-Gordon
*abc
aKenneth S. Pitzer Theory Center and Department of Chemistry, University of California, Berkeley, CA 94720, USA
bDepartment of Bioengineering, University of California, Berkeley, CA 94720, USA
cDepartments of Bioengineering and Chemical and Biomolecular Engineering, University of California, Berkeley, CA 94720, USA
dDepartment of Computer Science, School of Computing, National University of Singapore, 117417, Singapore
eCancer Science Institute of Singapore, National University of Singapore, 117599, Singapore
fDepartment of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, 117596, Singapore
gOdyssey Therapeutics Inc., 1350 Highland Dr, Ann Arbor, MI 48108, USA
First published on 2nd April 2025
Development of scoring functions (SFs) used to predict protein–ligand binding energies requires high-quality 3D structures and binding assay data for training and testing their parameters. In this work, we show that one of the widely-used datasets, PDBbind, suffers from several common structural artifacts of both proteins and ligands, which may compromise the accuracy, reliability, and generalizability of the resulting SFs. Therefore, we have developed a series of algorithms organized in a semi-automated workflow, HiQBind-WF, that curates non-covalent protein–ligand datasets to fix these problems. We also used this workflow to create an independent data set, HiQBind, by matching binding free energies from various sources including BioLiP, Binding MOAD and Binding DB with co-crystalized ligand–protein complexes from the PDB. The resulting HiQBind workflow and dataset are designed to ensure reproducibility and to minimize human intervention, while also being open-source to foster transparency in the improvements made to this important resource for the biology and drug discovery communities.
PDBbind has been an invaluable resource to the biomolecular community during its two-decade development, but a significant portion of the PDBbind dataset contains structural errors, statistical anomalies, and a sub-optimal organization of protein–ligand classes that can limit SF training and validation.34,35 These inconsistencies undermines the purpose of the refined set, which is intended to serve as a high-quality benchmark for evaluation of scoring functions and docking methods. Another concern in regards PDBbind is that the data processing procedure is neither open-sourced nor automated, potentially relying on individual groups needing to introduce their own manual intervention that may lead to inconsistencies. Furthermore, the PDBbind data curation process became more problematic in 2021 when PDBbind ceased to be freely available for data curated after 2020, which limits access and hinders the development and validation of new scoring functions (and other additional uses).
Fortunately, other curation efforts have created alternative protein–ligand structural and/or binding datasets that have increased the size and comprehensiveness of available data for drug discovery efforts. BindingDB is a database containing 2.9 million binding measurements spanning 1.3 million compounds for thousands of protein targets, which are curated from the literature and patents.36–38 Binding MOAD is a curated database of 41409 protein–ligand structural complexes, with binding affinity data available for 15
223 (37%) of them; Binding MOAD's curation involved extracting high-quality structures from the PDB and finding associated binding data from publications with the aid of an NLP-based annotation tool.39–42 BioLiP is a large database of over 900
000 biologically-relevant protein–ligand interactions curated from the PDB, and enriched with various functional annotations, including Enzyme Commission numbers, Gene Ontology terms, catalytic sites, and binding affinities from Binding MOAD, BindingDB, as well as manual surveys.43,44 Other related datasets that focuses more on the geometries of proteins and ligands, including PLINDER45 and DockGen,46 contain an expanded set of protein–ligand structural complexes but do not have annotations of binding affinity data. However, in general, these curation efforts have largely focused on increasing the size and comprehensiveness of protein–ligand data, rather than increasing the quality and reliability of the data themselves. Therefore, there is a pressing need for an open-source and systematic workflow to prepare protein–ligand binding datasets with well-defined binding affinity annotations and higher-quality structures in order to foster greater reproducibility, transparency, and accessibility.
In this work, we introduce HiQBind-WF, a workflow of algorithms for data cleaning and structural preparation that creates a curated dataset of high-quality, non-covalent protein–ligand complex structures with binding affinity annotations. This workflow contains several modules: (1) a curating procedure that rejects ligands covalently bonded to proteins, ligands with rarely-occurring elements, and structures containing severe steric clashes; (2) a ligand-fixing module to ensure the correctness of the ligand structure including correct bond order and reasonable protonation states; (3) a protein-fixing module to extract and, when necessary, add missing atoms to all chains involved in the protein–ligand binding; (4) a structure refinement module to simultaneously add hydrogens to both proteins and ligands in their complex state, as opposed to the current practice in PDBbind that completes the hydrogen chemistry for protein and ligand independently. The motivation for adding this hydrogen growth module is that although many SFs only take heavy atoms into consideration, future physics-based SFs could potentially benefit from explicit hydrogens to better model intermolecular interactions such as hydrogen bonding.
We utilized this workflow to optimize PDBbind v2020 and compared the processed structures. Analysis of the structural differences between the same PDB entry demonstrated that HiQBind-WF is able to correct for various observed structural imperfections. Further, to illustrate the applicability of the HiQBind-WF, we created HiQBind, a new dataset with high-quality protein–ligand binding structures and affinities by processing PDB entries included in BioLiP2 and Binding MOAD associated with binding affinities drawn from BindingDB. The HiQBind dataset includes >18000 unique PDB entries and >30
000 protein–ligand complex structures. We also confirmed that HiQBind shares similar properties with existing datasets like PDBbind, demonstrating its feasibility to be used for developing and validating SFs and other structure-based drug-design tools. The HiQBind-WF and HiQBind dataset are provided open-source to foster transparency and sustainability as new data appears, in order to maintain this important resource for the biology and drug discovery communities.
![]() | ||
Fig. 1 Schematic representation of the semi-automated HiQBind-WF to refine protein–ligand binding structures. HiQBind-WF downloads the pdb and mmCIF files from the RCSB PDB,47 followed by splitting each structure into three components—ligand, protein, and additives. A series of filters are then applied to remove covalent binders, ligands with rare elements, very small ligands, and complexes exhibiting steric clashes. Subsequently, the protein structure is fixed by adding missing atoms and residues (ProteinFixer) and the ligand structure is fixed by correcting bond orders, protonation states, and aromaticity (LigandFixer). Finally, the fixed protein and ligand structures are recombined and subjected to a constrained energy minimization to resolve potential unreasonable structures and to refine hydrogen positions. |
We define three classes of ligand(s) for any given protein–ligand complex structure: (1) any residue will be identified as a ligand if its name matches the Chemical Component Dictionary (CCD) code deposited in given reference datasets (PDBbind, BioLiP or Binding MOAD). Ligands identified in this manner are referred to as “small molecules”. (2) Otherwise, chains in the original PDB file that are less than 20 residues but more than one residue as ligands will be selected as ligands. For example, PDBbind entries that contain patterns such as “*-mer,” or symbols like “–“, “&” or “+“, MOAD entries with more than one CCD in the name column of its csv-formatted dataset, and BioLiP entries with Ligand CCD column to be “peptide”, “dna” or “rna”. These ligands are typically polypeptides, oligosaccharides, or oligonucleotides, collectively referred to as “polymers”. (3) For each identified ligand, we label any biopolymer chains within 10 Å as the associated protein structure. Then, for each protein structure, we labeled residues specified by the “HETATM” record in the pdb file within 4 Å as additives, which includes ions, solvents, and co-factors. The additives are saved in pdb format and directly deposited in the database, and the protein and ligand structure are ready to proceed to the next workflow steps.
After the splitting for the ligand categories and their associated protein–ligand complexes, we define an additional set of downselect filters, including some borrowed from the processing protocols of LP-PDBbind.34 The purpose of these filters are to exclude protein–ligand complex structures that specifically can interfere with training of SFs, with eliminations that would meet any of the following criteria:
• Covalent binder filter: excludes ligands covalently bound to the protein, as indicated by the “CONECT” record in the pdb file. When covalently-bound ligands are identified, they are eliminated. This is because covalent binding inherently is different from non-covalent binding which does not involve bond breaking, and thus requires special treatment in any SFs. Those covalent binder entries are included in the ESI† which may be helpful as a separate curation of the data or may be accessible from CovBinderInPDB.48
• Rare element filter: excludes ligands containing elements other than H, C, N, O, F, P, S, Cl, Br, I. For example, Te or Se are infrequently encountered, and their inclusion can make it challenging for SFs to learn key binding features giving data sparsity for these ligands. These ligand entries are also included in the ESI† which may be helpful as a separate curation of the data.
• Small ligand filter: excludes ligands containing less than 4 heavy atoms, which includes small inorganic binders like O2, NH3, CO2, NO2, N3−, which are beyond the scope of common protein–ligand binding studies.
• Steric clashes filter: excludes structures with protein–ligand heavy atom pairs closer than 2 angstroms. Such steric clashes often arise from electron density uncertainties or inaccurate structural reconstruction from electron densities and are not physically feasible non-covalent interactions. Including such structures in SF development could be detrimental, for example leading to an underestimation of the repulsion energy in physics-based SFs. Additionally, the steric clash filter helps to exclude covalent ligands if the covalent bond is not properly represented in the “CONECT” record.
For protein–ligand complexes that pass these filters, two structure-fixing modules are implemented separately for proteins and ligands. In the ProteinFixer module, we first use the sequence information from the mmcif file header to detect missing atoms and residues. Then, for missing residues or missing atoms within an existing residue, PDBFixer49 (version 1.9) is used to add them, except when the missing residues are longer than 10 amino acids or are located at the sequence terminals. Adding missing atoms to protein structure is essential near binding sites because incomplete structures can compromise accurate modeling of binding interactions, and any molecular dynamics or alchemical binding free energy calculation also require complete structures to ensure the correct structural ensemble are sampled during simulations. However, long missing segments or missing terminus residues in crystal structure are often attributable to intrinsically disordered regions (IDR),50 domains that are not expressed in the samples for crystallization, or his-tags introduced in the protein purification process.51 If far enough removed from the ligand binding site(s), we regard it safe to skip modeling these residues explicitly. Hence, we leave these regions in their original form and in themselves do not define a criterion for being discarded in the final dataset. The final step of the protein-fixing module is to add hydrogen atoms at pH = 7.4 with PDBFixer. At this pH, the protonation state assignment of titratable side-chains obeys the following rules: all lysine (LYS) and arginine (ARG) are positively charged and glutamic acid (GLU) and aspartic acid (ASP) are deprotonated. Histidine remains in a neutral form and whether the HID or HIE variant (the hydrogen is added to Nδ or Nε, respectively) is selected will be based on which one forms a better hydrogen bond, which is the default behavior of PDBFixer.
In the LigandFixer module, we first obtain an sdf file for each ligand instance either by downloading from the RCSB PDB (if possible) or converting from the native pdb format with OpenBabel.52 Since explicit atom connections may not be present in the pdb format, the bond orders in this converted sdf file are typically inferred from local atomic geometries and the resulting structure is herein referred to as “inferred structure”. Then, a reference SMILES is obtained, which is used to correct bond orders and aromaticity specifications that could sometimes be mislabeled in the inferred structure. The bond order assignment protocol is implemented as follows: if the inferred and reference structure are isomorphic, a one-to-one atom mapping will be generated by structure matching and then bond orders, atom hybridization and aromatic specifications will be assigned according to the reference. Otherwise, the bond order assignment will come to a failure point, which means that the inferred structure does not share the same number of atoms or bond connectivity as the reference, indicating that there are missing atoms or distorted geometries in the crystal structure. Therefore, such structures will be excluded.
After a correct structure is obtained, protonation states are assigned to the ligand. We acknowledge that it is a non-trivial task to correctly determine protonation states for titratable groups within a ligand at a given pH and many algorithms that use empirical rules, QM/MM calculations or machine learning have been reported.53–55 However, since our workflow is designed for high-throughput processing, we improve the efficiency using a simple set of predefined rules to determine the protonation states by relevant matching functional groups in SMARTS patterns. Acids, nitro groups, thiophenols, azides, and N-oxides are deprotonated. Aliphatic amines and guanidines/imines are protonated, while anilines are not protonated. There are other special considerations that should also be accounted for: amines will not be protonated if the nitrogen is directly bonded with atoms other than H or C. Only one nitrogen atom on diamines and piperaizines will be protonated to avoid two positive charged groups close to each other, which is not favorable at normal biologically-relevant pH. Enols with the motif OC–C
C–OH are deprotonated. The protonation state assignment is implemented by modifying the default behavior of the dimophite_dl package56 which can be found in the Github repository.
One thing that should be noted here is the source of the reference SMILES string. If the ligand is a small molecule with a CCD code or is a polymer with a BIRD (Biologically Interesting Molecule Reference Dictionary47) code, we will query RCSB PDB for its reference SMILES. If the ligand is a polymer consisting only of alpha-amino acids, we will assume it is a simple non-cyclic peptide and generate a SMILES string based on its sequence information and amide-bond formation rules. Apparently, for the latter case, any mismatch between the inferred and reference structure does not mean the inferred structure is wrong – the ligand may just be a cyclic peptide or contain disulfide bonds. However, such structures will also be excluded as we are unable to verify its correctness automatically at this stage. For such cases, human inspection will be inevitable and it's beyond the scope of the workflow. In addition, we found that some of the SMILES strings deposited in RCSB PDB are incorrect such that all the bonds are labeled as single bonds. Most of these errors were caught by a geometric check for sp3/sp2/sp carbons. For these cases, we manually corrected the SMILES according to the original literature and use the corrected one to do the ligand fixing. The list of manually corrected SMILES can be found in the public Github repository. The bond-order assignment, protonation state assignment, and added hydrogens in the ligand-fixing module are all performed with RDKit57 (version 2024.03.4).
The last part of the HiQBind-WF structure preparation is a refinement module in which the fixed ligand structure and protein structure are combined, followed by a constrained energy minimization with a well-established force field. AMBER14SB58 is used for the protein and OpenFF-2.1.0 (ref. 7) together with Gasteiger charges59 are used for the ligand. Coordinate constraints are applied to all atoms that are experimentally resolved, which means only positions of hydrogens (both on the ligand and protein) and atoms added by PDBFixer in the protein-fixing module are allowed to be optimized. We found this physically-based structural optimization is useful to resolve any remaining steric clashes between added atoms introduced by treating the protein and ligand structure separately in the previous structure fixing modules. Additionally, the hydrogen-bonding network between the ligand and protein is also optimized in this process. The constrained energy minimization was performed with OpenMM 8.1.1 (ref. 60) by setting masses of all constrained atoms to zero.
The binding affinity in terms of ΔG is directly related to the dissociation coefficient Kd or Ki through the standard relationship ΔG = RTln(Kd/i).61 However, a large portion of the data in the binding datasets is reported in terms of IC50, which cannot be easily translated to ΔGs due to its dependence on other experimental conditions and inhibition mechanisms.62 Furthermore, the IC50 values for the same protein–ligand complex can vary up to over order of magnitude in different assays,63 and some deposited binding data are not reported as exact values but just ranges. Therefore, the binding affinity data is reorganized into a machine-readable format (csv) with comments as to the form of the experimental binding free energy data: Kd, Ki, IC50 and EC50.
Of the original 19443 unique PDB entries for proteins with ligands in PDBbind v2020, 1330 entries were discarded by the filters and 2452 entries were discarded because they were unable to pass the structure fixing and refinement modules. Our final optimized PDBbind dataset contains 15
661 unique PDB IDs and 27
757 protein–ligand complexes structures. In addition, considering that the original PDBbind general set was further filtered to create a “refined” and “core” set based on structure quality, binding data quality, and redundancy reduction,27 the optimized PDBbind data yields totals of 4969 and 279 entries in the refined and core sets, respectively. Finally, the associated binding affinity data is reorganized into a machine-readable format (csv) with comments as to the form of the experimental binding free energy data: Kd, Ki, IC50 and EC50.
HiQBind-WF also fixes a small but practical problem in PDBbind. PDBbind provides two file formats for the ligand structure, mol2 and sdf. However, among all 19443 entries, 45 mol2 files and 3175 sdf files cannot be processed by RDKit57 (version 2024.03.4), a widely-used open-source cheminformatics tool. This may be due to the fact that these files are prepared by some other software and their sdf specifications are not compatible with RDKit. Examples are undesired aromaticity specification (oxygens tagged as aromatic to represent equivalent atoms in RSO3−, RCOO−, RPO32−) or formal charge specification (nitrogen with 4 explicit valence tagged to be neutral). HiQBind-WF naturally addresses this technical problem because it uses RDKit57 to process ligand structures.
We then applied HiQBind-WF to process all these starting entries, yielding 18160 unique PDB entries. For the 2189 entries that failed to pass HiQBind-WF, 761 of them are discarded by the filters and the remaining 1428 entries are due to the failure of structure fixing and refinement modules (Table S4†). A large portion of the discarded datapoints are “polymers” for which it is hard to verify their structural correctness because of the difficulty in obtaining a reliable reference SMILES string which is more suitable to small molecules. Almost certainly, human inspection and expertise will rescue some of the discarded data, but the design goal here is to automate the corrections with a high throughput procedure as much as possible.
At the same time we retain 32275 protein–ligand complex structures. The reasons behind the increase in the amount of data compared to PDBbind is that we have included multiple protein–ligand complexes from the same RCSB PDB entry because: (1) multiple conformers or stereoisomers can contribute to the binding (Fig. 5a); (2) the same ligand can bind to different protein pockets (Fig. 5b); (3) there is more than one ligand with a reported binding affinity (Fig. 5c); 131 PDB entries are of this reason; and (4) for structures containing homo-multimers, structural fluctuations between chains that share sequence identity are non-negligible.
![]() | ||
Fig. 5 Structural variations and more than one protein–ligand complexes in the same PDB entry. (a) Two stereoisomers (FCA and FCB) contribute to the binding for 1ABF. (b) The same ligand (16F) binds to two different pockets in 3MRV (c) Two different ligands (GDP) and (GNH) bind to the protein in different pockets for 1A4R. (d) Distribution of protein RMSD. (e) Distribution of ligand RMSD. (f) Percentage of rotamer state changes for residues around the ligand binding sites. Bottom row: visualization of changes in rotamer states between chain A (blue) and chain B (yellow) of PDB ID: 3GEP. (g) Structure overlay between two chains and their respective ligands. (h) Structural differences around the free loop regions between two chains visualized as sticks. (i) Rotamer comparisons of all 29 residues that change their states across chains. |
A moderate amount of PDB entries contain multiple records of the same ligand of interest in the deposited structures. The reason lies in the fact that proteins can form various quaternary structures using copies of the same chain, and ligands as binders can interact with the macromolecule at the tertiary or quaternary level. For example, when a ligand binds to a specific chain of a homodimer protein, two PDB entries are present. PDBbind27 usually keeps only one randomly-selected sample of the interacting protein–ligand complex. However, since different chains in PDB are resolved separately using their electron density maps, there are still some level of non-negligible structural variations among different copies, making them valuable data sources for training SFs.
Although the overall protein and ligand RMSD distributions between identical chains of the same entry do not show a great difference (Fig. 5d and e), there is a significant amount of side-chain rotamer state changes observed for different chains as shown in Fig. 5f. Here, following the common practice in the field,74 we used the angle cutoff of 60° to any of the four side-chain torsion angles to define a switch in the rotamer states. To illustrate, the protein chains A and B for PDB entry 3GEP75 in Fig. 5g and h shows that 29 out of 57 residues near the binding site have a change in their side-chain rotameric states, including 12 residues in the free loop area (L101, S103–I113). In particular, the distance between the side chain of D107 and the ligand in chain A (blue) is smaller than 4 Å, compared to chain B where the free loop is further from the ligand. Therefore, including multiple records of protein–ligand interactions with the same PDB entry can be informative and beneficial.
To characterize and validate the HiQBind dataset, Fig. 6 provides the distributions of binding affinities and drug-like properties compared to the PDBbind dataset. It is seen that the new HiQBind dataset shares a very close set of distributions of drug-like properties with PDBbind, especially for the binding affinities in which both datasets cover a large window of approximately 10 log units. We also noticed that HiQBind dataset is a bit more druglike (QED score) with smaller ligands having fewer rotatable bonds and better clog
P/hydrogen-bonding properties. This demonstrates the feasibility of the new HiQBind dataset as a useful resource for future SFs development, benchmarks and other structure-based drug design studies. It is also important to note that, since no standardized method exists for further filtering and data splitting, it is up to the users decide how to perform these additional operations. For example, one might filter out NMR structures and entries with IC50 or EC50 values, as done in the PDBbind refined set,24 or split the data based on time,33,45 sequence similarity,34 ECOD classifications,45,46 or protein–ligand interaction profiles.45 In the ESI,† we also provided analysis upon the time distribution of PDB entries in HiQBind as well as its overlapping with PDBbind v2020 (Fig. S1†). This shall benefit users who want to do time-based splitting or create independent test set for those models that have already been trained on PDBbind v2020.
We have developed an optimization workflow, HiQBind-WF that aims to improve the structural integrity in a semi-automated way and produce high-quality structures with binding affinity annotations. We compared PDBbind v2020 to the structures processed with HiQBind-WF. Differences between the complexes highlighted the strength of our workflow in assigning correct bond orders, protonation states, and protein structure refinements. We also used this workflow to prepare HiQBind, a newly compiled dataset based on BioLiP2 (ref. 44) and Binding MOAD41 that offers high-quality, non-covalent protein–ligand complexes with binding-affinity data. HiQBind provides more than 30000 protein–ligand structures spanning over 18
000 unique PDB entries and is feasible to be deployed in the development of scoring functions or force fields or related activities.
As an open source effort, we believe that HiQBind-WF provides a sustainable framework for continuously updating and refining protein–ligand binding datasets for drug discovery, by meeting scientific goals of ensuring transparency and reproducibility. We also envision that structure-based drug design studies can benefit from the new HiQBind data that has no overlap with PDBbind, and thus reporting evaluation metrics upon this new dataset that will become a common practice as part of future computational modeling efforts.
Footnotes |
† Electronic supplementary information (ESI) available: Details of the data formats and usage notes are provided. See DOI: https://doi.org/10.1039/d4dd00357h |
‡ Authors contributed equally. |
This journal is © The Royal Society of Chemistry 2025 |