PoseBusters: AI-based docking methods fail to generate physically valid poses or generalise to novel sequences

The last few years have seen the development of numerous deep learning-based protein–ligand docking methods. They offer huge promise in terms of speed and accuracy. However, despite claims of state-of-the-art performance in terms of crystallographic root-mean-square deviation (RMSD), upon closer inspection, it has become apparent that they often produce physically implausible molecular structures. It is therefore not sufficient to evaluate these methods solely by RMSD to a native binding mode. It is vital, particularly for deep learning-based methods, that they are also evaluated on steric and energetic criteria. We present PoseBusters, a Python package that performs a series of standard quality checks using the well-established cheminformatics toolkit RDKit. The PoseBusters test suite validates chemical and geometric consistency of a ligand including its stereochemistry, and the physical plausibility of intra- and intermolecular measurements such as the planarity of aromatic rings, standard bond lengths, and protein–ligand clashes. Only methods that both pass these checks and predict native-like binding modes should be classed as having “state-of-the-art” performance. We use PoseBusters to compare five deep learning-based docking methods (DeepDock, DiffDock, EquiBind, TankBind, and Uni-Mol) and two well-established standard docking methods (AutoDock Vina and CCDC Gold) with and without an additional post-prediction energy minimisation step using a molecular mechanics force field. We show that both in terms of physical plausibility and the ability to generalise to examples that are distinct from the training data, no deep learning-based method yet outperforms classical docking tools. In addition, we find that molecular mechanics force fields contain docking-relevant physics missing from deep-learning methods. PoseBusters allows practitioners to assess docking and molecular generation methods and may inspire new inductive biases still required to improve deep learning-based methods, which will help drive the development of more accurate and more realistic predictions.


Introduction
Docking, an essential step in structure-based drug discovery 1 , is the task of predicting the predominant binding modes of a protein-ligand complex given an experimentally solved or computationally modelled protein structure and a ligand structure 2 .The predicted complexes are often used in a virtual screening workflow to help select molecules from a large library of possible candidates 3 ; or directly by medicinal chemists to understand the binding mode and to decide whether a small molecule is a suitable drug candidate 4 .
Docking methods are designed with the understanding that binding is enabled by interactions between target and ligand structures but due to the complexity of this property methods tend to strike a balance between fast calculation and accuracy 5 .
Deep learning (DL) promises to disrupt the dominant design principle of classical docking software, and DL-based docking methods promise to unlock fast and accurate virtual screening for drug discovery.To this end, a handful of different DL-based docking methods have already been proposed [6][7][8][9][10] .
Classical non-DL-based docking methods include within their search and scoring functions terms that help ensure chemical consistency and physical plausibility; for example limiting the degrees of movement in the ligand to only the rotatable bonds in the ligand and including penalties if the protein and ligand clash 11,12 .Some current DL-based docking methods, as we will show, still lack such key "inductive biases" resulting in the creation of unrealistic poses despite obtaining root-mean-squared deviation (RMSD) values from the experimental binding mode that are less than the widely-used 2 Å threshold 13 .To assess such docking methods, an independent test suite is necessary to check the chemical consistency and physical plausibility alongside established metrics, such as the binding mode RMSD.Such a test suite would help the field to identify missing inductive biases re-quired to improve DL-based docking methods, driving the development of more accurate and realistic docking predictions.
The problem of assessing the physical plausibility of docking predictions is akin to the structure validation of ligand data in the Protein Data Bank (PDB) 14,15 .Structure validation assesses the agreement of the ligands bond lengths and angles with those observed in related chemical structures and the presence of steric clashes both within the ligand and between it and its surroundings 15 .While these tests were designed for users to select those ligand crystal structures which are likely to be correct 15 , docking methods are evaluated on their ability to recover crystal structures so their output should pass the same physical plausibility tests.
Physical plausibility checks are also part of some workflows for conformation generation 16,17 .Friedrich et al. use geometry checks performed by NAOMI 18 which measures-like the PDB tests mention above-the deviation from known optimal values for bond lengths and bond angles, and also tests for divergences from the planarity of aromatic rings 17 .
In addition to physical checks, chemical checks are also needed 19 .Chemical checks proposed for checking PDB structures include the identification of mislabelled stereo assignment, inconsistent bonding patterns, missing functional groups, and unlikely ionisation states 19 .The problem of checking chemical plausibility has also come up in de novo molecule generation, where Brown et al. proposed a test suite including checks for the chemical validity of any proposed molecule 20 .For docking, the focus is less on stability and synthetic accessibility of a molecular structure as it is hoped that these have been tested prior to attempting docking, but more on chemical consistency and physical realism of the predicted bound conformation.Some comparisons of docking methods have included additional metrics based on volume overlap 21 or protein-ligand interactions 22 to supplement pose accuracy-based metrics such as RMSD of atomic positions and run time measurements, but the majority of comparisons of docking methods are predominantly based on binding mode RMSD 13,[23][24][25] .
The current standard practice of comparing docking methods based on RMSD-based metrics alone also extends to the introduction papers of recent new methods.The five DL-based docking methods we test in this paper [6][7][8][9][10] all claim better performance than standard docking methods but these claims rest entirely on RMSD.None of these methods test their outputs for physical plausibility.
In this paper we present PoseBusters, a test suite that is designed to identify implausible conformations and ligand poses.We used PoseBusters to evaluate the predicted ligand poses generated by the five DL-based docking methods (DeepDock 6 , Diff-Dock 7 , EquiBind 8 , TankBind 9 , and Uni-Mol 10 ) and two standard non-DL-based docking methods (AutoDock Vina 12 and Gold 26 ).These poses were generated by re-docking the cognate ligands of the 81 protein-ligand crystal complexes in the Astex Diverse set 27 and 428 ligands of the protein-ligand crystal complexes in the PoseBusters Benchmark set, a new set of complexes released from 2021 onwards, into their cognate receptor crystal structures.On the commonly-used Astex Diverse set, the DL-based dock-

Methods
Five DL-based and two classical docking methods were used to re-dock known ligands into their respective proteins and the predicted ligand poses were evaluated with the PoseBusters test suite.The following section describes the docking methods, the data sets, and the PoseBusters test suite for checking physicochemical consistency and structural plausibility of the generated poses.

Docking methods
The selected five DL-based docking methods 6-10 cover a wide range of DL-based approaches for pose prediction.Table 1 lists the methods and their publications.In order to examine the ability of standard non-DL-based methods to predict accurate chemically and physically valid poses, we also included the wellestablished docking methods AutoDock Vina 28 and Gold 29 .The five DL-based docking methods can be summarised as follows.Full details of each can be found in their respective references.DeepDock 6 learns a statistical potential based on the distance likelihood between ligand heavy atoms and points of the mesh of the surface of the binding pocket.DiffDock 7 uses equivariant graph neural networks in a diffusion process for blind docking.EquiBind 8 applies equivariant graph neural networks for blind docking.TankBind 9 is a blind docking method that uses a trigonometry-aware neural network for docking in each pocket predicted by a binding pocket prediction method.Uni-Mol 10 carries out docking with SE3-equivariant transformers.All five DLbased docking methods are trained on subsets of the PDBbind General Set 30 as detailed in Table 2. DeepDock is trained on The docking protocols that were used to generate predictions with each method and the software versions used are given in section S1 of the Supplementary Information.Table 3 lists the search space definitions that we used for each method.Deep-Dock and Uni-Mol require the definition of a binding site while DiffDock, EquiBind, and TankBind are 'blind' docking methods that search over the entire protein.We used the default search spaces for the DL-based methods but larger than default search spaces for AutoDock Vina and Gold such that they are more comparable with the blind docking DL-based methods.SI Figure S1 shows the search spaces for one example protein-ligand complex.

The PoseBusters test suite
The PoseBusters test suite is organised into three groups of tests.The first checks chemical validity and contains tests for the chemical validity and consistency relative to the input.The second group checks intramolecular properties and tests for the ligand geometry and the ligand conformation's energy computed using the universal force field (UFF) 32 .The third group considers intermolecular interactions and checks for protein-ligand and ligandcofactor clashes.Descriptions of all the tests PoseBusters performs in the three sections are listed in Table 4. Molecule poses which pass all tests in PoseBusters are 'PB-valid'.
For evaluating docking predictions, PoseBusters requires three input files: an SDF file containing the re-docked ligands, an SDF file containing the true ligand(s), and a PDB file containing the protein with any cofactors.The three files are loaded into RDKit molecule objects with the sanitisation option turned off.

Chemical validity and consistency
The first test in PoseBusters checks whether the ligand passes RD-Kit's sanitisation.RDKit's sanitisation processes information on the valency, aromaticity, radicals, conjugation, hydridization, chirality tags, and protonation to check whether a molecule can be represented as an octet-complete Lewis dot structure 33 .Passing RDKit's sanitisation is a commonly-used test for chemical valid-

Method
Search space

Gold
Sphere of radius ity in cheminformatics, for example in de novo molecular generation 20 .
The next test in PoseBusters checks for docking-relevant chemical consistency between the predicted and the true ligands by generating 'standard InChI' strings 34 for the input and output ligands after removing isotopic information and neutralising charges by adding or removing hydrogens where possible.InChI is the de facto standard for molecular comparison 35 , and the 'standard InChI' strings generated include the layers for the molecular formula (/), molecular bonds (/c), hydrogens (/h), net charge (/q), protons (/p), tetrahedral chirality (/t), and double bond stereochemistry (/b).Standardisation of the ligand's protonation and charge state is needed because the stereochemistry layer is dependent on the hydrogen (/h), net charge (/q) and proton (/p) layers.These can unexpectedly change during docking even though most docking software considers the charge distribution and protonation state of a ligand as fixed 12,36 .The normalisation protocol also removes the stereochemistry information of double bonds in primary ketimines which only depends on the hydrogen atom's ambiguous location.

Intramolecular validity
The first set of physical plausibility tests in the PoseBusters test suite validates bond lengths, bond angles, and internal distances between non-covalently bound pairs of atoms in the docked ligand against the corresponding limits in the distance bounds matrix obtained from RDKit's Distance Geometry module.To pass the tests, all molecular measurements must lie within the user-specified tolerances.The tolerance used throughout this manuscript is 25 % for bond lengths and bond angles and 30 % for non-covalently bound pairs of atoms e.g.: if a bond is less than 75 % of the Distance Geometry bond length lower bound, it is treated as anomalous.This was selected as all but one of the crystal ligands in the Astex Diverse set and all of those in the

File loads
The input molecule can be loaded into a molecule object by RDKit.

Molecular formula
The molecular formula of the input molecule is the same as that of the true molecule.

Bonds
The bonds in the input molecule are the same as in the true molecule.

Tetrahedral chirality
The specified tetrahedral chirality in the input molecule is the same as in the true molecule.

Double bond stereochemistry
The specified double bond stereochemistry in the input molecule is the same as in the true molecule.

Bond lengths
The bond lengths in the input molecule are within 0.75 of the lower and 1.25 of the upper bounds determined by distance geometry.

Bond angles
The angles in the input molecule are within 0.75 of the lower and 1.25 of the upper bounds determined by distance geometry.

Planar aromatic rings
All atoms in aromatic rings with 5 or 6 members are within 0.25 Å of the closest shared plane.

Planar double bonds
The two carbons of aliphatic carbon-carbon double bonds and their four neighbours are within 0.25 Å of the closest shared plane.

Internal steric clash
The interatomic distance between pairs of non-covalently bound atoms is above 0.8 of the lower bound determined by distance geometry.

Energy ratio
The calculated energy of the input molecule is no more than 100 times the average energy of an ensemble of 50 conformations generated for the input molecule.The energy is calculated using the UFF 32 in RDKit and the conformations are generated with ETKDGv3 followed by force field relaxation using the UFF with up to 200 iterations.

Minimum protein-ligand distance
The distance between protein-ligand atom pairs is larger than 0.75 times the sum of the pairs van der Waals radii.

Minimum distance to organic cofactors
The distance between ligand and organic cofactor atoms is larger than 0.75 times the sum of the pairs van der Waals radii.

Minimum distance to inorganic cofactors
The distance between ligand and inorganic cofactor atoms is larger than 0.75 times the sum of the pairs covalent radii.

Volume overlap with protein
The share of ligand volume that intersects with the protein is less than 7.5 %.The volumes are defined by the van der Waals radii around the heavy atoms scaled by 0.8.

Volume overlap with organic cofactors
The share of ligand volume that intersects with organic cofactors is less than 7.5 %.The volumes are defined by the van der Waals radii around the heavy atoms scaled by 0.8.

Volume overlap with inorganic cofactors
The share of ligand volume that intersects with inorganic cofactors is less than 7.5 %.The volumes are defined by the van der Waals radii around the heavy atoms scaled by 0.5.
The PoseBusters test for flatness checks that groups of atoms lie in a plane by calculating the closest plane to the atoms and checking that all atoms are within a user-defined distance from this plane.This test is performed for 5-and 6-membered aromatic rings and non-ring non-aromatic carbon-carbon double bonds.The chosen threshold of 0.25 Å admits all Astex Diverse and Pose-Busters Benchmark set crystal structures by a wide margin and as with all other thresholds can be adjusted by the user.
The final test for intramolecular physicochemical plausibility carried out by PoseBusters is an energy calculation to detect unlikely conformations.Our metric for this is the ratio of the energy of the docked ligand conformation to the mean of the energies of a set of 50 generated unconstrained conformations as in Wills et al. 37 .The conformations are generated using RDKit's ETKDGv3 conformation generator 38 followed by a force field relaxation using the UFF 32 and up to 200 iterations.The test suite rejects conformations for which this ratio is larger than a userspecified threshold.Wills et al. set a ratio of 7 based on the value where 95 % of the crystal ligands in the PDBbind data set are considered plausible 37 .We selected a less strict ratio of 100 where only one structure each from the Astex Diverse and PoseBuster Benchmark set is rejected.

Intermolecular validity
Intermolecular interactions are evaluated by two sets of tests in the PoseBusters test suite.The first set checks the minimum distance between molecules and the second checks the share of overlapping volume.Both sets of tests report on intermolecular interactions of the ligand with four types of molecules: the protein, organic cofactors, and inorganic cofactors.
For the distance-based intermolecular tests PoseBusters calculates the ratio of the pairwise distance between pairs of heavy atoms of two molecules and the sum of the two atoms' van der Waals radii.If this ratio is smaller than a user-defined threshold then the test fails.The default threshold is 0.75 for all pairings.For inorganic cofactor-ligand pairings the covalent radii are used.All crystal structures in the Astex Diverse set and all but one in the PoseBuster Benchmark set pass at this threshold.
For the second set of intermolecular checks, PoseBusters calculates the share of the van der Waals volume of the heavy atoms of the ligand that overlaps with the van der Waals volume of the heavy atoms of the protein using RDKit's ShapeTverskyIndex function.The tests have a configurable scaling factor for the volume-defining van der Waals radii and a threshold that defines how much overlap constitutes a clash.A threshold is necessary because many crystal structures already contain clashes.For example, Verdonk et al. found that 81 out of 305 selected highquality protein-ligand complexes from the PDB contain steric clashes 26 .The overlap threshold is 7.5 % for all molecule pairings and the scaling factor is 0.8 for protein-ligand and organic cofactor-ligand pairings and 0.5 for inorganic cofactor-ligand pairings.

Quality of fit
PoseBusters calculates the minimum heavy-atom symmetryaware root-mean-square deviation (RMSD) between the predicted ligand binding mode and the closest crystallographic ligand.Coverage, a metric often used for testing docking methods, is the share of predictions that are within a user adjustable threshold which by default is 2 Å RMSD.This value is arbitrary but commonly-used and recommended for regular-size ligands 13 .

Sequence identity
In this paper, sequence identity between two amino acid chains is the number of exact residue matches after sequence alignment divided by the number of residues of the query sequence.The sequence alignment used is the Smith-Waterman algorithm 39 implemented in Biopython 40 using an open gap score of −11 and an extension gap score of −1 and the BLOSUM62 substitution matrix.Unknown amino acid residues are counted as mismatches.

Molecular mechanics energy minimisation
Post-docking energy minimisation of the ligand structure in the binding pocket was performed using the AMBER ff14sb force field 41 and the Sage small molecule force field 42 in OpenMM 43 .The protein files were prepared using PDBfixer 43 and all protein atom positions were fixed in space only allowing updates to the ligand atoms positions.Minimisation was performed until energy convergence within 0.01 kJ mol −1 .

Astex Diverse set
The Astex Diverse set 27 published in 2007 is a set of hand-picked, relevant, diverse, and high-quality protein-ligand complexes from the PDB 14 .The complexes were downloaded from the PDB as MMTF files 44 and PyMOL 45 was used to remove solvents and all occurrences of the ligand of interest from the complexes before saving the proteins with the cofactors in PDB files and the ligands in SDF files.

PoseBusters Benchmark set
The PoseBusters Benchmark set is a new set of carefully-selected publicly-available crystal complexes from the PDB.It is a diverse set of recent high-quality protein-ligand complexes which contain drug-like molecules.It only contains complexes released since 2021 and therefore does not contain any complexes present in the PDBbind General Set v2020 used to train many of the methods.Table S2 lists the steps used to select the 428 unique proteins and 428 unique ligands in the PoseBuster Benchmark set.The complexes were downloaded from the PDB as MMTF files and PyMOL was used to remove solvents and all occurrences of the ligand of interest before saving the proteins with the cofactors in PDB files and the ligands in SDF files.

Results
The following section presents the analysis of the PoseBusters test suite on the re-docked ligands of five DL-based docking methods and two standard non-DL-based docking methods on the 85 ligands of the Astex Diverse set and the 428 ligands of the Pose-Busters Benchmark set into the receptors crystal structures.The RMSD criterion alone (striped green bars in Figure 1) gives the impression that DiffDock (72 %) performs better than TankBind (59 %), Gold (67 %), AutoDock Vina (58 %) and Uni-Mol (45 %).However, when we look closer, accepting only ligand binding modes that are physically sensible, i.e., those predictions that pass all PoseBusters tests and are therefore PB-valid (solid green bars in Figure 1), many of the apparently impressive DL predictions are removed.The best three methods when considering RMSD and physical plausibility are Gold (64 %), AutoDock Vina (56 %), and DiffDock (47 %) followed by Uni-Mol (12 %), DeepDock (11 %) and TankBind (5.9 %).DiffDock is therefore the only DL-based method that has comparable performance to the standard methods on the Astex Diverse set when considering physical plausibility of the predicted poses.

Results on the Astex Diverse set
All five DL-based docking methods struggle with physical plausibility, but even the poses produced by the classical methods Gold and AutoDock Vina do not always pass all the checks.Figure 2 shows a waterfall plot that indicates how many predicted binding modes fail each test.The waterfall plots for the remaining methods are shown in SI Figure S5.The DL-based methods fail on different tests.TankBind habitually overlooks stereochemistry, Uni-Mol very often fails to predict valid bond lengths, and EquiBind tends to produce protein-ligand clashes.The classical methods Gold and AutoDock Vina pass most tests but also generated a few protein-ligand clashes.Figure 3 shows examples of poses generated by the methods illustrating various failure modes.
The results on the Astex Diverse set suggest that despite what the RMSD ⩽ 2 Å criterion would indicate, no DL-based method outperforms classical docking methods when the physical plausibility of the ligand binding mode is taken into account.However, DiffDock in particular is capable of making a large number of useful predictions.

Results on the PoseBuster Benchmark set
The results of the seven (AutoDock Vina, Gold, DeepDock, Diff-Dock, EquiBind, TankBind, Uni-Mol) docking methods on the  Figure 4 shows the results of the docking methods on the Pose-Busters Benchmark set but stratified by the target protein receptor's maximum sequence identity with the proteins in the PDBbind 2020 General Set 30 .As the DL-based methods were all trained on subsets of the PDBbind 2020 General Set, this roughly quantifies how different the test set protein targets are from those that the methods were trained on.We bin the test cases into three categories low [0, 30 %], medium (30 %, 90 %], and high (90 %, 100 %] maximum percentage sequence identity.Without considering physical plausibility (striped bars), the classical methods appear to perform as well on the three protein similarity bins while the DL-based methods perform worse on the proteins with lower sequence identity.This suggests that the DL-based methods are overfitting to the protein targets in their training sets.
We also compared the performance of the docking methods on the PoseBusters Benchmark set stratified by whether proteinligand complexes contain cofactors (SI Figure S3).Here, we loosely define cofactors as non-protein non-ligand compounds such as metal ions, iron-sulfur clusters, and organic small molecules in the crystal complex within 4.0 Å of any ligand heavy atom.About 45% of protein-ligand complexes in the PoseBusters Benchmark set have a cofactor (SI Figure S2).The classical methods perform slightly better when a cofactor is present while the DL-based docking methods perform worse on those systems.

Results with pose-docking energy minimisation
In order to examine whether the outputs of the DL-based methods can be made physically plausible we performed an additional post-docking energy minimisation of the ligand structures in the binding pocket for the PoseBusters Benchmark set (Figure 5).

Discussion
We present PoseBusters, a test suite designed and built to identify chemically inconsistent and physically implausible ligand poses predicted by protein-ligand docking and molecular generation methods.We show the results of applying the PoseBusters test suite to the output of seven different docking methods, five current DL-based docking methods (DeepDock, DiffDock, EquiBind, TankBind, and Uni-Mol) and two standard methods (AutoDock Vina and Gold).We find that no DL-based docking method yet outperforms standard docking methods when both physical plausibility and binding mode RMSD is taken into account.Our work demonstrates the need for physical plausibility to be taken into account when assessing docking tools because it is possible to perform well on an RMSD-based metric while predicting physically implausible ligand poses (Figure 3).Using the tests in the Pose-Busters test suite as an additional criterion when developing DLbased docking methods will help improve methods and the development of more accurate and realistic predictions.
In addition, the individual tests in the PoseBusters test suite  highlight docking-relevant failure modes.The results show that Uni-Mol for example predicts non-standard bond lengths and TankBind creates internal ligand clashes.The ability to identify such failure modes in predicted ligand poses makes PoseBusters a helpful tool for developers to identify inductive biases that could improve their binding mode prediction methods.
Our results also show that, unlike classical docking methods, DL-based docking methods do not generalise well to novel data.The performance of the DL-based methods on the PoseBusters Benchmark set overall was poor and the subset of the PoseBusters Benchmark set with low sequence identity to PDBbind 2020 revealed that DL-based methods are prone to overfitting to the proteins they were trained on.Our analysis of the targets with sequence identity lower than 30 % to any member of PDBbind General Set v2020 revealed that across all of the DL-based docking methods almost no physically valid poses were generated within the 2 Å threshold.
The most commonly-used train-test approach for building DLbased docking models is time-based, e.g., complexes released before a certain date are used for training and complexes released later for testing.Based on our results, we argue that this is insufficient for testing generalisation to novel targets and the sequence identity between the proteins in the training and test must be reported on.
Post-docking energy minimisation of the ligand using force fields can considerably improve the docking poses generated by DL-based methods.However, even with an energy minimisation step, the best DL-based method, DiffDock, does not outperform classical docking methods like Gold and AutoDock Vina.This shows that at least some key aspects of chemistry and physics encoded in force fields are missing from deep learning models.
The PoseBusters test suite provides a new criterion, PB-validity, beyond the traditional "RMSD ⩽ 2 Å" rule to evaluate the predictions of new DL-based methods, and hopefully will help to iden-tify inductive biases needed for the field to improve docking and molecular generation methods, ultimately resulting in more accurate and realistic predictions.PoseBusters is made available as a pip-installable Python package and as open source code under the BSD-3-Clause license at github.com/maabuu/posebusters.The next generation of DL-based docking methods should aim to outperform standard docking tools on both RMSD criteria and in terms of chemical consistency, physical plausibility, and generalisability.

Supplementary information
Martin Buttenschoen, Garrett M. Morris

S1 Docking protocols
The following protocols detail how the seven docking methods were used to re-dock the ligands into the crystal structures of the Astex Diverse set and the PoseBuster set.Methods that require an initial ligand conformation were given identical starting conformations generated with RDKit's ETKDGv3 conformer generator 1 followed by an energy minimisation using the universal force field 2 .
Protein preparation Hydrogen atoms were added with reduce and then the PDBQT files were generated with the ADFR prepare_receptor script.
Parameters A bounding box with side-length 25 Å was created around the centroid of the crystal ligand.Vina was used to create 40 poses with an exhaustiveness setting of 32 and the top-ranked pose was selected.

CCDC Gold
Software version CCDC Python API version 3.0.14

Ligand preparation
The initial ligand conformations described above were prepared with LigandPreparation using the default settings which include adding missing hydrogens, removing unknown atoms, and rulebased protonation of the ligand.

Protein preparation
The protein and co-factors were loaded from separate files and all hydrogens were added.
Parameters A settings file was created for each complex using the Docker class default settings.The binding site was defined around the crystal ligand using BindingSiteFromLigand with radius 25 Å.Ligand preparation The generated starting ligand conformations were used without further processing.
Protein preparation The steps in example notebook Docking_example.ipynbwere used to generate protein surface meshes.The function compute_inp_surface generated binding site surfaces using the crystal ligands and the crystal protein structures with a distance threshold of 10 Å.
Parameters The protocol and settings in notebook Docking_example.ipynb in the DeepDock repository were used for docking.

DiffDock
Software version DiffDock commit hash fff8f0b from authors' public code repository https://github.com/gcorso/DiffDock Ligand preparation The generated starting ligand conformations were used without further processing.
Protein preparation ESM was used to generate FASTA files.
Parameters The protocol in README.md was used to generate ESM embeddings and then to do inference.40 poses were sampled using 20 inference steps with no noise on the final step.The top-ranked pose was selected.

Ligand preparation
The generated starting ligand conformations were processed with Open Babel and then with RDKit to add missing hydrogens.

Protein preparation
The receptors were processed with Open Babel.Then reduce was used to correct receptor residues and to add hydrogens.Then the protein chains which have at least one residue within 10 Å of the crystal ligand were selected.
Parameters The configuration file configs_clean/inference.yml in the repository was used.

TankBind
Software version TANKBind commit hash 804e9fc from authors' public code repository https://github.com/luwei0917/TankBind, p2rank 2.3 Ligand preparation The notebook prediction_example_using_PDB_6hd6.ipynb was used to renumber the ligand atoms and generate features from the ligands.
Protein preparation The notebook prediction_example_using_PDB_6hd6.ipynb was used to generate features from the crystal protein structures.
Parameters The steps in the notebook prediction_example_using_PDB_6hd6.ipynb were used for inference.
The steps are running p2rank to generate a list of binding pockets and then docking using the TankBind model.

Ligand preparation
The ligands were generated according to the protocol described in the README.mdfile in the top folder of the Uni-Mol repository.

Protein preparation
The binding pockets residues are those within 6 Å of any crystal ligand heavy atom.
Parameters The default arguments (recycling=3, batch_size=8, dist_threshold=8.0)were used.For more information refer to Table 3 in the main text.

S4 Data sets
The following sections list the protein database 6 (PDB) codes and chemical component dictionary 7 (CCD) codes for the protein-ligand complexes and the corresponding ligands of interest for the two data sets used.
Astex Diverse data set

Energy too high
Figure S13: Energy ratio distributions.The ratio is the energy of the predicted ligand conformation over the average energy of an ensemble of 50 conformations generated with ETKDGv3.The UFF implemented in RDKit was used.The dashed red line shows the cutoff value of 100.There is only one crystal ligand in each data set with a higher energy ratio than the cutoff but all docking methods generate multiple high energy conformations above the cutoff.

Figure 1
shows the overall results of the seven (AutoDock Vina12 , Gold26 , DeepDock 6 , DiffDock7  , EquiBind 8 , TankBind 9 , Uni-Mol 10 ) docking methods on the Astex Diverse set in ocean green.The striped bars show the performance only in terms of RMSD coverage (RMSD ⩽ 2 Å) and the solid bars show the performance after also considering physical plausibility, i.e., only predictions which in addition pass all tests in PoseBusters and are therefore PB-valid.The Astex Diverse set is a well-established and commonly-used benchmark for evaluating docking methods.Good performance on this set is expected because the five DL-based methods evaluated here have been trained on most of these complexes.47 of the 81 complexes in the Astex Diverse set are in the PDBbind 2020 General Set and 67 out of the 81 of the Astex Diverse set proteins have more than 95 % sequence identity with proteins found in PDBbind 2020 General Set.AutoDock Vina may also perform well on this data set because the linear regression model behind the scoring function was trained on an earlier version of PDBbind12 which already included most of the Astex Diverse set.

Fig. 3
Fig. 3 Examples of failure modes that PoseBusters is able to detect.Predictions are shown on the left with white carbons and the crystal structures on the right have cyan carbons.Oxygen atoms are red, nitrogen atoms are dark blue, chlorine atoms are green.Most of the shown predictions have a RMSD within 2 Å but all are physically invalid.
J o u r n a l N a me , [ y e a r ] , [ v o l .] , 1-10 | 7

Fig. 4
Fig.4 Comparative performance of docking methods on the PoseBusters Benchmark set stratified by sequence identity relative to the PDBBind General Set v2020.The sequence identity is the maximum sequence identity between all chains in the PoseBuster test protein and all chains in the PDBBind General Set v2020.The striped bars show the share of predictions of each method that have an RMSD within 2 Å and the solid bars show those predictions which in addition pass all PoseBuster tests and are therefore PB-valid.The DL-based methods perform far better on proteins that they are similar to those they were trained on.

Fig. 5
Fig. 5 Comparative performance of docking methods with post-pocking energy minimisation of the ligand (while keeping the protein fixed) on the PoseBusters Benchmark set.The striped bars show the share of predictions of each method that have an RMSD within 2 Å of the crystal pose and the solid bars show those predictions which in addition pass all PoseBuster tests and are therefore PB-valid.Post-docking energy minimisation significantly improves the relative physical plausibility of the DL-based methods' predictions.This indicates that force fields contain docking-relevant physics which is missing from DL-based methods.
(a) Gold: Sphere of radius 25 Å centered on the geometric centre of the crystal ligand heavy atoms.(b) AutoDock Vina: Cube with side length 25 Å centered on the geometric centre of crystal ligand heavy atoms.(c) DeepDock: Protein surface mesh nodes within 10 Å of any crystal ligand atom.(d) Uni-Mol: Protein residues within 6 Å of any crystal ligand heavy atom.

Figure S1 :
Figure S1: Search spaces of the docking methods illustrated on PDB entry 1G9V for ligand RQ3.The search spaces for the blind docking methods DiffDock, EquiBind, and TankBind are the entire protein crystal structure.For more information refer to Table 3 in the main text.

Figure S3 :
Figure S3: Comparative performance of docking methods on the PoseBusters Benchmark set stratified by the presence of cofactors.Cofactors are loosely defined as non-protein non-ligand compounds such as metal ions, iron-sulfur clusters, and organic small molecules that are within 4.0 Å of any ligand heavy atom.The striped bars show the share of predictions of each method that have an RMSD within 2 Å and the solid bars show those predictions which in addition pass all PoseBuster tests and are therefore PB-valid.The classical docking methods perform better on those systems with cofactors present while the DL-based methods perform worse on those systems.
ni tis at io n fa ils M ol ec ul ar fo rm ul a no t pr es er ve d Bo nd s no t pr es er ve d Te tr ah ed ra l ch ir al ity ch an s ou t of bo un ds Bo nd an gl es ou t of bo un ds In te rn al st er ic cl as h D ef or m ed ar om at ic ri ng s D ef or m ed do ub le bo nd s En er gy to o hi gh M in im um pr ot ei nlig an d di st an ce to o sm al l M in .di st an ce to or ga ni c co fa ct or s to o sm al l M in .di st an ce to in or ga ni c co fa ct or s to o sm al l Vo lu m e ov er la p w ith pr ot ei n Vo lu m e ov er la p w ith or ga ni c co fa ct or s Vo lu m e ov er la p w ith in or ga ni c co fa ct or s Pa ss in g al l te st s ni tis at io n fa ils M ol ec ul ar fo rm ul a no t pr es er ve d Bo nd s no t pr es er ve d Te tr ah ed ra l ch ir al ity ch an ge d D ou bl e bo nd st er eo ch em is tr y ch an ge d Bo nd le ng th s ou t of bo un ds Bo nd an gl es ou t of bo un ds In te rn al st er ic cl as h D ef or m ed ar om at ic ri ng s D ef or m ed do ub le bo nd s En er gy to o hi gh M in im um pr ot ei nlig an d di st an ce to o sm al l M in .di st an ce to or ga ni c co fa ct or s to o sm al l M in .di st an ce to in or ga ni c co fa ct or s to o sm al l Vo lu m e ov er la p w ith pr ot ei n Vo lu m e ov er la p w ith or ga ni c co fa ct or s Vo lu m e ov er la p w ith in or ga ni c co fa ct or s Pa ss in g al l te st s at io n fa ils M ol ec ul ar fo rm ul a no t pr es er ve d Bo nd s no t pr es er ve d Te tr ah ed ra l ch ir al ity ch an s ou t of bo un ds Bo nd an gl es ou t of bo un ds In te rn al st er ic cl as h D ef or m ed ar om at ic ri ng s D ef or m ed do ub le bo nd s En er gy to o hi gh M in im um pr ot ei nlig an d di st an ce to o sm al l M in .di st an ce to or ga ni c co fa ct or s to o sm al l M in .di st an ce to in or ga ni c co fa ct or s to o sm al l Vo lu m e ov er la p w ith pr ot ei n Vo lu m e ov er la p w ith or ga ni c co fa ct or s Vo lu m e ov er la p w ith in or ga ni c co fa ct or s Pa ss in g al l te st s at io n fa ils M ol ec ul ar fo rm ul a no t pr es er ve d Bo nd s no t pr es er ve d Te tr ah ed ra l ch ir al ity ch an ge d D ou bl e bo nd st er eo ch em is tr y ch an ge d Bo nd le ng th s ou t of bo un ds Bo nd an gl es ou t of bo un ds In te rn al st er ic cl as h D ef or m ed ar om at ic ri ng s D ef or m ed do ub le bo nd s En er gy to o hi gh M in im um pr ot ei nlig an d di st an ce to o sm al l M in .di st an ce to or ga ni c co fa ct or s to o sm al l M in .di st an ce to in or ga ni c co fa ct or s to o sm al l Vo lu

Figure S5 :
Figure S5: Waterfall plots showing test results for the Astex Diverse dataset.The leftmost (dotted) bars show the number of complexes in the test set.The red bars show the number of predictions that fail with each additional test going from left to right.The right most (solid) bar indicates the number of predictions that pass all tests.As a reading example, panel (a) shows that out of AutoDock Vina's 85 predictions 37 are not within 2 angstrom RMSD and one additional prediction fails the energy ratio check so that overall 47 ligands have a low RMSD and pass all tests.AutoDock Vina and CCDC Gold pass the most tests.
ni tis at io n fa ils M ol ec ul ar fo rm ul a no t pr es er ve d Bo nd s no t pr es er ve d Te tr ah ed ra l ch ir al ity ch an ge d D ou bl e bo nd st er eo ch em is ou t of bo un ds In te rn al st er ic cl as h D ef or m ed ar om at ic ri ng s D ef or m ed do ub le bo nd s En er gy to o hi gh M in im um pr ot ei nlig an d di st an ce to o sm al l M in .di st an ce to or ga ni c co fa ct or s to o sm al l M in .di st an ce to in or ga ni c co fa ct or s to o sm al l Vo lu m e ov er la p w ith pr ot ei n Vo lu m e ov er la p w ith or ga ni c co fa ct or s Vo lu m e ov er la p w ith in or ga ni c co fa ct or s Pa ss in g al l te st s ni tis at io n fa ils M ol ec ul ar fo rm ul a no t pr es er ve d Bo nd s no t pr es er ve d Te tr ah ed ra l ch ir al ity ch an s ou t of bo un ds Bo nd an gl es ou t of bo un ds In te rn al st er ic cl as h D ef or m ed ar om at ic ri ng s D ef or m ed do ub le bo nd s En er gy to o hi gh M in im um pr ot ei nlig an d di st an ce to o sm al l M in .di st an ce to or ga ni c co fa ct or s to o sm al l M in .di st an ce to in or ga ni c co fa ct or s to o sm al l Vo lu m e ov er la p w ith pr ot ei n Vo lu m e ov er la p w ith or ga ni c co fa ct or s Vo lu m e ov er la p w ith in or ga ni c co fa ct or s Pa ss in g al l te st s ni tis at io n fa ils M ol ec ul ar fo rm ul a no t pr es er ve d Bo nd s no t pr es er ve d Te tr ah ed ra l ch ir al ity ch an ge d D ou bl e bo nd st er eo ch em is tr y ch an ge d Bo nd le ng th s ou t of bo un ds Bo nd an gl es ou t of bo un ds In te rn al st er ic cl as h D ef or m ed ar om at ic ri ng s D ef or m ed do ub le bo nd s En er gy to o hi gh M in im um pr ot ei nlig an d di st an ce to o sm al l M in .di st an ce to or ga ni c co fa ct or s to o sm al l M in .di st an ce to in or ga ni c co fa ct or s to o sm al l Vo lu m e ov er la p w ith pr ot ei n Vo lu m e ov er la p w ith or ga ni c co fa ct or s Vo lu m e ov er la p w ith in or ga ni c co fa ct or s Pa ss in g al l te st s at io n fa ils M ol ec ul ar fo rm ul a no t pr es er ve d Bo nd s no t pr es er ve d Te tr ah ed ra l ch ir al ity ch an s ou t of bo un ds Bo nd an gl es ou t of bo un ds In te rn al st er ic cl as h D ef or m ed ar om at ic ri ng s D ef or m ed do ub le bo nd s En er gy to o hi gh M in im um pr ot ei nlig an d di st an ce to o sm al l M in .di st an ce to or ga ni c co fa ct or s to o sm al l M in .di st an ce to in or ga ni c co fa ct or s to o sm al l Vo lu m e ov er la p w ith pr ot ei n Vo lu m e ov er la p w ith or ga ni c co fa ct or s Vo lu m e ov er la p w ith in or ga ni c co fa ct or s Pa ss in g al l te st s at io n fa ils M ol ec ul ar fo rm ul a no t pr es er ve d Bo nd s no t pr es er ve d Te tr ah ed ra l ch ir al ity ch an ge d D ou bl e bo nd st er eo ch em is tr y ch an ge d Bo nd le ng th s ou t of bo un ds Bo nd an gl es ou t of bo un ds In te rn al st er ic cl as h D ef or m ed ar om at ic ri ng s D ef or m ed do ub le bo nd s En er gy to o hi gh M in im um pr ot ei nlig an d di st an ce to o sm al l M in .di st an ce to or ga ni c co fa ct or s to o sm al l M in .di st an ce to in or ga ni c co fa ct or s to o sm al l Vo lu

Figure S6 :
Figure S6: Waterfall plots showing test results for the PoseBuster data set.The leftmost (dotted) bar shows the number of complexes in the test set.The red bars show the number of predictions that fail with each additional test going from left to right.The right most (solid) bar indicates the number of predictions that pass all tests.Refer to the main article for a description of each test.As a reading example, panel (a) shows that out of AutoDock Vina's 428 predictions, 200 are not within 2 Å RMSD, three clash with the protein and 1 clashes with an organic cofactor leaving 224 prediction with a low RMSD passing all tests.AutoDock Vina and CCDC Gold pass the most tests.

Figure S17 :
Figure S16: Minimum distances between ligand and inorganic cofactors.Distance is the smallest pairwise distance of heavy atoms of the ligand and inorganic cofactors normalized by their sum of covalent radii.The red area highlights the rejection zone below the cutoff of 0.75

Table 1
Selected DL-based docking methods.The selection includes five methodologically different DL-based docking methods published over the last two years.DiffDock appears to perform best in terms of RMSD alone but when taking physical plausibility into account, Gold and AutoDock Vina perform best.On the PoseBusters Benchmark set, a test set that is harder because it contains only complexes that the DL methods have not been trained on, Gold and AutoDock Vina are the best methods in terms of RMSD alone and when taking physical plausibility into account or when proteins with novel sequences are considered.The DL-based methods make few valid predictions for the unseen complexes.Overall, we show that no DL-based method yet outperforms standard docking methods when consideration of physical plausibility is taken into account.
The PoseBusters test suite will enable DL method developers to better understand the limitations of current methods, ultimately resulting in more accurate and realistic predictions.

Table 2
Data sets used to train the selected five machine learning-based docking methods.All five DL-based methods were trained on subsets of the PDBBind 2020 General Set.

Table 3
Search spaces of the docking methods used.
25Å centered on the geometric centre of the crystal ligand heavy atoms

Table 4
Description of the checks used in the PoseBusters test suite.
Fig.1Comparative performance of the docking methods.The Astex Diverse set (85 cases) was chosen as an easy test set containing many complexes the five DL-based methods were trained on while the Pose-Busters Benchmark set (428 cases) was chosen to be a difficult test set containing complexes none of the methods was trained on.The striped bars show the share of predictions of each method that have an RMSD within 2 Å and the solid bars show the subset that in addition have valid geometries and energies, i.e., pass all PoseBuster tests and are therefore 'PB-Valid'.DiffDock appears to outperform the classical methods on the Astex Diverse set when only binding mode RMSD is considered (striped teal bars).However, when physical plausibility is also considered (solid teal bars) or when presented with the PoseBusters Benchmark set (coral bars), AutoDock Vina and Gold outperform all DL-based methods.
Waterfall plot showing the PoseBusters tests as filters for the TankBind predictions on the Astex Diverse data set.The tests in the PoseBuster test suits are described in Table 4.The leftmost (dotted) bar shows the number of complexes in the test set.The red bars show the number of predictions that fail with each additional test going from left to right.The rightmost (solid) bar indicates the number of predictions that pass all tests, i.e. those that are 'PB-Valid'.For the 85 test cases in the Astex Diverse set 50 (59 %) predictions have RMSD within 2 Å RMSD and 5 (5.9 %) pass all tests.Figures S5 and S6 in the Supplementary Information show waterfall plots for all methods and both data sets.

Table S1 :
Selection process of the PDB entries and ligands for the PoseBuster data set.The filters are based on the PDB meta data, the PDB quality reports, and the PDB structure data.The final PoseBuster set consists of 428 unique PDB entries containing 428 unique ligands.
Figure S2: Comparison of the 85 ligands in the Astex Diverse and the 428 ligands in the PoseBuster data set in terms of molecular weight, number of heavy atoms, number of rotatable bonds, and number of rings.