Fred P.
Davis
*
Howard Hughes Medical Institute, Janelia Farm Research Campus, 19700 Helix Dr, Ashburn, VA 20147, USA. E-mail: davisf@janelia.hhmi.org; Tel: +1 571 209 4000 x3037
First published on 24th November 2010
Small molecules that modulate protein–protein interactions are of great interest for chemical biology and therapeutics. Here I present a structure-based approach to predict ‘bi-functional’ sites able to bind both small molecule ligands and proteins, in proteins of unknown structure. First, I develop a homology-based annotation method that transfers binding sites of known three-dimensional structure onto protein sequences, predicting residues in ligand and protein binding sites with estimated true positive rates of 98% and 88%, respectively, at 1% false positive rates. Applying this method to the human proteome predicts 8463 proteins with bi-functional residues and correctly recovers the targets of known interaction modulators. Proteins with significantly (p < 0.01) more bi-functional residues than expected were found to be enriched in regulatory and depleted in metabolism functions. Finally, I demonstrate the utility of the method by describing examples of predicted overlap and evidence of their biological and therapeutic relevance. The results suggest that combining the structures of known binding sites with established fold detection algorithms can predict regions of protein–protein interfaces that are amenable to small molecule modulation. Open-source software and the results for several complete proteomes are available at http://pibase.janelia.org/homolobind.
A combination of experimental2 and computational8 methods have been used to identify interaction modulators. For traditional targets, computational approaches for small molecule discovery typically begin with a crystal structure or homology model of the target protein. Next, a target site is identified using either pocket detection algorithms or the known location of an endogenous substrate. Finally, docking algorithms are used to virtually screen a small molecule library and identify candidate ligands. Virtual screening has been widely used to discover small molecule ligands, and recent work suggests it can be complementary to experimental high-throughput screens.9 This overall computational framework has also been applied, with some adaptations, to protein interaction targets.8 For example, the presence of cryptic cavities at protein interfaces has inspired the use of molecular dynamics simulations to sample the conformational space around protein–protein interfaces for transient druggable pockets that are then subjected to virtual screening.10
The identification of druggable sites on interaction targets is particularly challenging for two reasons. First, endogenous substrate binding sites, often used as starting points for traditional targets, are not typically available.3 Second, the flexible nature of protein interfaces can hide cryptic cavities in crystal structures of the target protein complex. Here I present an approach to predict druggable binding sites at protein interfaces, in proteins of unknown structure, by using structural information from homologs.
This approach builds on three related observations. First, proteins often physically sample conformational space in the same direction and magnitude as the conformational variability observed between homologs.11 This observation has been exploited in protein structure modeling and design procedures,12–14 and suggests that binding sites in homologous structures may complement molecular dynamics sampling for identifying cryptic druggable sites. Second, protein homologs often use similar surface regions to interact with their protein interaction partners.15,16 This observation has been useful in predicting binding sites for proteins of unknown function.16,17 Third, identifying ‘bi-functional’ positions that bind both ligands and proteins within families of protein structures recovers the targets of known interaction modulators, and can be used to predict the biological effects of small molecules.18 Here, I extend this approach to proteins of unknown structure, with the aim of predicting druggable interface regions that are suitable for follow-up with higher resolution, but more computationally demanding, methods.
I first describe a method to predict bi-functional sites in protein sequences of unknown structure and benchmark its performance on binding sites of known structure. Next, I use this method to predict bi-functional sites in several complete proteomes and examine their compositional and functional properties. I close by discussing the relevance of the results for small molecule modulation of protein interactions.
![]() | ||
Fig. 1 Overview of the method to predict overlapping ligand and protein binding sites. |
The boundaries and classification of domains in the target protein sequences were obtained from the SUPERFAMILY resource, which uses a hidden Markov model library of SCOP structural domains to annotate complete genomes.23 The ASTRAL alignments, described above, were then used to transfer template binding sites onto the SUPERFAMILY domains in the target protein sequences. The binding sites were transferred at sequence identity thresholds estimated to predict residues with a 1% false positive rate, using a benchmarking strategy described next.
![]() | ||
Fig. 2 The coverage and accuracy of predicted binding residues. Coverage refers to the fraction of known binding residues in each family that align to a template binding site in a homologous protein; accuracy refers to the true and false positive rates in predicting these covered residues. (a) The distributions of binding site residue coverage per domain family is shown for each kind of binding site. (b) The distributions of sequence identity thresholds (per template binding site) estimated to achieve a maximum false positive rate of 1% and (c) the resulting true positive rates in predicting binding residues in each domain family. (d) The distribution of 95% confidence interval widths for true positive rates, estimated using Bayesian bootstrap with 500 replicates.24 |
The accuracy of the method was established by first determining sequence identity thresholds for each template binding site that would achieve a 1% false positive rate, as estimated on a simulated set of negative binding residues (see Materials and methods). A wide distribution of sequence identity thresholds was observed, with an average of 31% for ligand, 31% for peptide, 25% for inter-molecular domain, and 24% for intra-molecular domain binding sites (Fig. 2B). The corresponding true positive rates were then estimated in a family-wide fashion by determining the number of known (and covered) binding residues that passed the sequence identity thresholds determined above to achieve 1% false positive rates (Fig. 2C and 2D). The average true positive rates were estimated to be 98% for ligand, 89% for peptide, 88% for inter-molecular domain, and 91% for intra-molecular domain binding residues. These estimates are in concordance with published benchmarks of homology-transfer procedures.17
Type | # Proteins | # Domains (# Families) | # Residues |
---|---|---|---|
Input data | |||
Complete proteome | 46![]() |
— (—) | 23![]() ![]() |
Annotated domains | 30![]() |
64![]() |
9![]() ![]() |
Predicted binding sites | |||
Peptide | 8091 | 11![]() |
516![]() |
Domain | 20![]() |
42![]() |
2![]() ![]() |
Ligand | 10![]() |
13![]() |
511![]() |
Bi-functional | 8463 | 10![]() |
294![]() |
All binding sites | 22![]() |
45![]() |
2![]() ![]() |
Amino acid | Propensity at ligand-only residues | (95% confidence interval) | Propensity at protein-only residues | (95% confidence interval) | Propensity at bi-functional residues | (95% confidence interval) |
---|---|---|---|---|---|---|
A | 1.006 | (0.99, 1.022) | 0.891 | *(0.884, 0.896) | 0.875 | *(0.861, 0.889) |
C | 0.974 | *(0.949, 1) | 1.147 | *(1.137, 1.157) | 0.753 | *(0.732, 0.77) |
D | 0.878 | *(0.862, 0.896) | 1.044 | *(1.037, 1.051) | 1.111 | *(1.095, 1.128) |
E | 0.675 | *(0.663, 0.69) | 1.085 | *(1.08, 1.092) | 0.971 | *(0.958, 0.986) |
F | 1.311 | *(1.287, 1.334) | 0.912 | *(0.904, 0.918) | 1.093 | *(1.074, 1.11) |
G | 1.392 | *(1.372, 1.411) | 1.049 | *(1.042, 1.054) | 1.235 | *(1.218, 1.25) |
H | 1.058 | *(1.031, 1.084) | 1.121 | *(1.111, 1.131) | 1.003 | (0.982, 1.026) |
I | 1.16 | *(1.139, 1.177) | 0.874 | *(0.867, 0.88) | 1.047 | *(1.032, 1.063) |
K | 0.83 | *(0.815, 0.845) | 1.055 | *(1.049, 1.062) | 1.061 | *(1.047, 1.077) |
L | 1.017 | *(1.004, 1.029) | 0.838 | *(0.834, 0.842) | 0.904 | *(0.894, 0.914) |
M | 1.309 | *(1.277, 1.342) | 0.879 | *(0.871, 0.891) | 1.247 | *(1.218, 1.277) |
N | 0.978 | *(0.956, 0.999) | 1.012 | *(1.002, 1.018) | 0.911 | *(0.891, 0.927) |
P | 0.773 | *(0.757, 0.79) | 1.089 | *(1.082, 1.097) | 0.795 | *(0.781, 0.81) |
Q | 0.697 | *(0.681, 0.715) | 1.064 | *(1.057, 1.072) | 0.814 | *(0.799, 0.83) |
R | 0.899 | *(0.883, 0.917) | 1.094 | *(1.088, 1.102) | 1.128 | *(1.113, 1.145) |
S | 0.98 | *(0.964, 0.996) | 1.027 | *(1.022, 1.033) | 0.903 | *(0.891, 0.916) |
T | 1.054 | *(1.036, 1.073) | 1.082 | *(1.076, 1.09) | 0.977 | *(0.963, 0.992) |
V | 1.018 | *(1.002, 1.033) | 0.894 | *(0.889, 0.899) | 0.98 | *(0.966, 0.994) |
W | 1.081 | *(1.047, 1.116) | 1.096 | *(1.083, 1.111) | 0.958 | *(0.93, 0.989) |
Y | 1.159 | *(1.134, 1.182) | 1.076 | *(1.066, 1.084) | 1.415 | *(1.391, 1.437) |
I next quantified the amino acid residue propensities of the predicted binding sites to facilitate comparison with bi-functional positions of known structure and previously described energetic hot-spots (eqn (2)). The predicted bi-functional residues exhibited a distinct amino acid residue propensity compared to predicted mono-functional residues (Fig. 3A; Table 2). The bi-functional residue propensities are mostly similar to those described previously for bi-functional positions of known structure.18 The most significant differences are that bi-functional positions of known structure exhibited enrichment for tryptophan and histidine, and near background levels of cysteine.
![]() | ||
Fig. 3 The composition, frequency, and functional propensity of bi-functional residues predicted in the human proteome. (a) The residue type propensity (eqn (2)) at residues predicted to bind both ligands and proteins (black; n = 294![]() ![]() ![]() ![]() ![]() ![]() |
The bi-functional residue propensities are also similar in several respects to previously described energetic ‘hot-spots’.5,25 Hot-spot residues have been found to exhibit the following compositional trends: (1) enrichment for tryptophan, arginine and tyrosine, (2) under-representation of leucine, serine, threonine and valine, (3) over-abundance of isoleucine relative to leucine, and (4) preference for aspartate and asparagine over glutamate and glutamine.25 The predicted bi-functional residues exhibit all of these trends except for near-background levels of tryptophan and only slight enrichment for arginine (Fig. 3A).
To quantify the levels of overlap predicted between ligand and protein binding sites, an odds ratio was computed for each protein that considers the number of residues predicted to bind ligands (nl), proteins (np), or both ligands and proteins (nb), as well as the number of solvent-exposed residues (ns):
![]() | (1) |
Most significant overlapping proteins | |||
---|---|---|---|
ID | Protein | Overlap | p-Value (left) |
ENSP00000269577 | Topoisomerase (DNA) IIα | 6.093 | 2.2 × 10−16 |
ENSP00000264998 | Transferrin | 5.130 | 2.2 × 10−16 |
ENSP00000261266 | Protein tyrosine phosphatase, receptor type, B | 4.895 | 2.2 × 10−16 |
ENSP00000359932 | TNNI3 interacting kinase | 4.809 | 2.2 × 10−16 |
ENSP00000264331 | Topoisomerase (DNA) IIβ | 4.501 | 2.2 × 10−16 |
ENSP00000231751 | Lactotransferrin | 4.495 | 2.2 × 10−16 |
ENSP00000370076 | Baculoviral IAP repeat-containing protein 1 | 4.124 | 2.2 × 10−16 |
ENSP00000371935 | ATP-binding cassette, sub-family C (CFTR/MRP), member 5 | 3.681 | 2.2 × 10−16 |
ENSP00000261714 | Bleomycin hydrolase | 3.537 | 2.2 × 10−16 |
ENSP00000319684 | Tensin 2 | 3.497 | 2.2 × 10−16 |
Most significant non-overlapping proteins | |||
---|---|---|---|
ID | Protein | Overlap | p-Value (right) |
ENSP00000223423 | Prostaglandin–endoperoxide synthase 1 | 0.094 | 9.78 × 10−16 |
ENSP00000376187 | Discs, large homolog 1 | 0.000 | 2.771 × 10−15 |
ENSP00000353047 | MAGUK p55 subfamily member 4 | 0.000 | 2.787 × 10−11 |
ENSP00000295550 | Collagen, type VI, α3 | 0.000 | 8.347 × 10−11 |
ENSP00000381234 | Cystathionine-β-synthase | 0.464 | 3.899 × 10−10 |
ENSP00000241052 | Catalase | 0.416 | 1.125 × 10−9 |
ENSP00000361049 | 3′-Phosphoadenosine 5′-phosphosulfate synthase 2 | 0.176 | 6.262 × 10−9 |
ENSP00000376708 | von Willebrand factor A domain containing 2 | 0.000 | 3.026 × 10−8 |
ENSP00000359210 | Dihydropyrimidine dehydrogenase | 0.679 | 3.728 × 10−8 |
ENSP00000367937 | Lysyl-tRNA synthetase | 0.405 | 3.848 × 10−8 |
Function | Propensity of proteins with significantly low bi-functional positions | (95% confidence interval) | Propensity of proteins with significantly high bi-functional positions | (95% confidence interval) |
---|---|---|---|---|
Homo sapiens | (n = 624) | (n = 3516) | ||
Information | 1.187 | (0.929, 1.442) | 0.795 | *(0.689, 0.897) |
Metabolism | 1.528 | *(1.394, 1.658) | 0.445 | *(0.412, 0.479) |
Extracellular processes | 1.075 | *(1.004, 1.154) | 1.184 | *(1.143, 1.227) |
Intracellular processes | 0.381 | *(0.305, 0.465) | 1.002 | (0.932, 1.072) |
Regulation | 0.807 | *(0.722, 0.89) | 1.271 | *(1.219, 1.322) |
General | 0.559 | *(0.489, 0.64) | 1.009 | (0.959, 1.062) |
Other | 3.91 | *(3.302, 4.509) | 0.507 | *(0.415, 0.609) |
Drosophila melanogaster | (n = 338) | (n = 1902) | ||
Information | 1.645 | *(1.201, 2.191) | 0.939 | (0.771, 1.141) |
Metabolism | 1.93 | *(1.764, 2.08) | 0.575 | *(0.527, 0.628) |
Extracellular processes | 0.59 | *(0.41, 0.787) | 0.693 | *(0.605, 0.788) |
Intracellular processes | 0.142 | *(0.077, 0.215) | 1.238 | *(1.152, 1.341) |
Regulation | 0.778 | *(0.648, 0.905) | 1.456 | *(1.365, 1.553) |
General | 0.903 | (0.776, 1.047) | 1.087 | *(1.012, 1.168) |
Other | 0.417 | *(0.154, 0.729) | 0.551 | *(0.401, 0.725) |
Caenorhabditis elegans | (n = 303) | (n = 1678) | ||
Information | 1.345 | (0.913, 1.852) | 0.661 | *(0.507, 0.833) |
Metabolism | 1.916 | *(1.74, 2.109) | 0.482 | *(0.436, 0.532) |
Extracellular processes | 1.521 | *(1.28, 1.771) | 1.082 | (0.972, 1.19) |
Intracellular processes | 0.364 | *(0.224, 0.503) | 1.085 | (0.963, 1.212) |
Regulation | 0.464 | *(0.365, 0.558) | 1.438 | *(1.358, 1.521) |
General | 0.658 | *(0.513, 0.796) | 1.022 | (0.937, 1.113) |
Other | 0.129 | *(0, 0.341) | 0.74 | *(0.545, 0.992) |
Saccharomyces cerevisiae | (n = 124) | (n = 493) | ||
Information | 1.118 | (0.667, 1.643) | 0.794 | (0.599, 1.033) |
Metabolism | 1.672 | *(1.497, 1.855) | 0.563 | *(0.493, 0.644) |
Extracellular processes | 0 | (0, 1) | 1.248 | (0, Inf) |
Intracellular processes | 0.305 | *(0.125, 0.53) | 1.205 | *(1.012, 1.442) |
Regulation | 0.649 | *(0.359, 0.991) | 1.733 | *(1.429, 2.072) |
General | 0.503 | *(0.351, 0.675) | 1.324 | *(1.185, 1.484) |
Other | 0.396 | (0, 1.524) | 0.699 | (0.22, 1.469) |
Escherichia coli | (n = 117) | (n = 288) | ||
Information | 1.181 | (0.642, 1.78) | 0.835 | (0.537, 1.207) |
Metabolism | 1.088 | (0.945, 1.231) | 0.681 | *(0.595, 0.769) |
Extracellular processes | 0 | (0, 1) | 0 | (0, 1) |
Intracellular processes | 0.755 | (0.41, 1.153) | 1.134 | (0.831, 1.464) |
Regulation | 1.869 | *(1.252, 2.583) | 0.627 | *(0.388, 0.928) |
General | 0.609 | *(0.417, 0.824) | 1.789 | *(1.553, 2.028) |
Other | 0.696 | (0, 1.625) | 0.798 | (0.277, 1.651) |
NCBI viral sequence set | (n = 203) | (n = 2324) | ||
Information | 0.546 | *(0.4, 0.697) | 1.119 | *(1.052, 1.182) |
Metabolism | 1.444 | *(1.242, 1.664) | 0.491 | *(0.445, 0.535) |
Extracellular processes | 0.146 | *(0, 0.515) | 0.274 | *(0.17, 0.403) |
Intracellular processes | 1.043 | (0.758, 1.339) | 1.34 | *(1.225, 1.468) |
Regulation | 0.826 | (0.479, 1.206) | 1.554 | *(1.382, 1.75) |
General | 1.128 | (0.935, 1.347) | 1.178 | *(1.105, 1.253) |
Other | 1.079 | (0.788, 1.382) | 0.858 | *(0.774, 0.953) |
These results largely agree with the functional analysis of bi-functional positions of known 3D structure, although the propensity values presented here are more statistically significant due to a larger sample size.18 The trends for the regulation and metabolism functions were similarly found in the previous analysis of protein families. The only significant difference is that the previous analysis found overlapping proteins to be enriched in intracellular processes, while that category is near-background in the present analysis. One possible reason for these differences is the level of analyses: the previous analysis was performed at the level of individual domains in contrast to the results presented here at the protein level, which consider all component domains.
These functional trends were further explored by predicting bi-functional residues in several other species. Similar trends were observed for the metabolism and regulation functions in nearly all the other proteomes tested: Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae, Escherichia coli, and the NCBI viral sequence set (Table 4). The sole exception was the reversal of the regulatory function in E. coli, with overlapping proteins exhibiting a depletion, while non-overlapping proteins were enriched.
To determine the predictive utility of the method, I next examined examples of predicted overlap between ligand and protein binding sites. Below, I describe four examples of therapeutically relevant targets and small molecules. These examples involve protein–protein interactions from several distinct functional classes, including enzyme–substrate/inhibitor, regulatory, and structural interactions.
TopoIIa belongs to a broad family of proteins that share a homologous ATPase domain and includes several chemotherapeutic targets: bacterial DNA gyrase, topoisomerase IV, and topoisomerase VI are antibiotic targets; topoIIa and Hsp-90 are antineoplastic targets.27 Small molecules have shown cross-reactivity between these family members, and this feature has been exploited to discover inhibitors. For example, radicicol was initially discovered as an antifungal antibiotic and was later shown to inhibit both Hsp-90 and mammalian topoIIa.30 Novobiocin, a natural product with antibacterial and weak anti-mammalian topoisomerase activity, has been derivatized to yield selective Hsp-90 inhibitors.32 Both of these compounds act by binding to the ATP substrate pocket. In addition, radicicol prevents ATP-mediated topoisomerase VI homo-dimerization;27 a coumarin antibiotic structurally related to novobiocin interferes with Hsp90 dimerization.33 This example illustrates the utility of distant homologs for predicting binding sites: The folds are shared across prokaryotic and eukaryotic species, and the ligands exhibit cross-reactivity across this evolutionary range. It also indicates that even well-established drug classes – novobiocin was discovered over 50 years ago34 – that target traditional targets like enzyme active sites, may also disrupt protein interactions.
This example highlights the issue of ligand/family member specificity. XIAP, cIAP1, and cIAP2 all have three BIR domains, each of which interacts with different proteins. For example, XIAP interacts with caspase-9 through its BIR3 domain and with capsases-3 and -7 through its BIR2 domain.38 Most small molecules have been designed against BIR3, although at least one study has targeted BIR2. No small molecules have been described that bind to the BIR1 domain. Cross-reactivity has also been observed between XIAP BIR2 and BIR3.36 As expected, HOMOLOBIND predicted bi-functional residues for all three BIR domains of cIAP1, using mostly the same template ligand, peptide, and domain–domain binding sites. The prediction of binding specificity between homologous domains (both within the same protein and in different proteins) is largely beyond the scope of the method, which only aims to predict the binding site.
Several small molecules have been synthesized with varying selectivity for the anti-apoptotic Bcl-2 members including Bcl-2 itself, Bcl-XL, and Mcl-1. Recent studies suggest that both the Mcl-1/Bcl2A1 and the Bcl-2/Bcl-XL/Bcl-w sub-families of anti-apoptotic proteins must be inhibited for effective induction of cancer cell apoptosis.42 Given the structural similarity between Bcl-2 members, and the observed cross-reactivity of small molecules,43 the predicted Mcl-1 bi-functional residues are likely relevant targets.
This example again highlights the issue of specificity. The Bcl-2 ligand used to predict the ligand binding site on Mcl-1 is an acyl-sulfonamide compound designed to bind both Bcl-2 and Bcl-XL.44 Although this compound was not tested against Mcl-1, an earlier study testing a similar compound found weak Mcl-1 binding.45 Recently, small molecules have been designed with activity against multiple anti-apoptotic Bcl-2 members, including Bcl-2, Bcl-XL, Bcl-w, and Mcl-1.46 As mentioned previously, the method presented here does not aim to predict actual ligands, as this requires estimation of binding affinities using explicit structural models at a much higher resolution than the fold detection sequence alignments used here.
![]() | ||
Fig. 4 The method correctly recovers the targets of known interaction modulators.2 Predicted binding sites are depicted as colored tick-marks within larger boxes representing SUPERFAMILY domain annotations. Template ligand and protein binding sites are shown as ribbon diagrams, produced by UCSF Chimera. |
![]() | ||
Fig. 5 Examples of overlap predicted between ligand and protein binding sites on human proteins. Predicted binding sites are depicted as colored tick-marks within larger boxes representing SUPERFAMILY domain annotations. Template ligand and protein binding sites are shown as ribbon diagrams, produced by UCSF Chimera. Full-length topoIIA is nearly 1600 residues; only the first 400 residues, where binding sites were predicted, are shown. |
Homology transfer of binding sites has been extensively demonstrated in a variety of systems and applied to the annotation of protein sequences, structures, and their interactions.17,47–49 Beyond its specific application to the prediction of bi-functional residues, HOMOLOBIND is a systematically benchmarked method that integrates several well-established protein structure resources to facilitate comprehensive prediction of binding sites in complete proteomes. The underlying structural domain assignment algorithm (SUPERFAMILY) has been rigorously benchmarked, the domain definitions and classification (SCOP) are considered gold standard, and the binding site libraries (PIBASE, LIGBASE) are comprehensive and regularly updated. The automated nature of the method makes it suitable for large-scale studies of binding sites. As illustrated above, it can be run on a genomic scale using either the pre-computed genomic domain assignments available from SUPERFAMILY, or domain assignments made for a newly sequenced genome using SUPERFAMILY software.
The bi-functional residues predicted with HOMOLOBIND are suitable candidates for evaluation with higher resolution computational methods that are also more computationally expensive.8 For example, a first step would be to build an explicit structural model of the target protein by comparative modeling26 and high resolution refinement techniques.52 Next, flexible docking algorithms could be used to predict small molecules that bind to the predicted bi-functional region.53 Finally, computationally validated targets and predicted ligands could then be evaluated using the extensive repertoire of experimental biophysical and biochemical techniques used to identify and evaluate interaction inhibitors.3
Experimental high-throughput screening data, such as bioassay data available through ChEMBL (http://www.ebi.ac.uk/chembldb/) and PubChem,54 might also be useful in evaluating the predictions. In the most simple usage, bioassay data could be searched to determine if ligands whose template binding sites were used to annotate a target protein sequence in fact bind to the target. Although clearly confirmatory, this scenario is likely to occur in only a few cases, as HOMOLOBIND only predicts binding sites rather than actual ligands, due to the complexity and difficulty in predicting ligand specificity. However, even in the absence of experimental data for a particular ligand, the bioactivity profiles for the target and template protein sequences might be useful as a pharmacological similarity measure, to complement the sequence similarity thresholds that HOMOLOBIND uses to make the binding site predictions. The (untested) hypothesis is that if a pair of proteins bind similar or identical ligands, then binding sites transferred by homology from one to the other protein are likely to be correct. For example, several Mcl-1 inhibitors in PubChem also demonstrate activity against both Bcl-2 and Bcl-XL (e.g., CIDs 406171 and 1002248). Although these cross-reactive compounds don't include the actual ligands used for the HOMOLOBIND prediction, this cross-reactivity gives more confidence in the Mcl-1 binding site predicted by homology from Bcl-2 (Fig. 5D). The concept of ligand-based protein similarity has been successfully used to develop statistical models of polypharmacology and to predict off-target effects of drugs.55,56
Besides predictions for individual proteins, the proteome-wide compositional and functional properties of bi-functional residues, and the similar functional trends observed across a wide phylogenetic range, suggest that these residues are a biologically relevant phenomena (Fig. 3A and C; Table 2 and 4). Analyzing their biophysical properties could further clarify their relevance for modulating protein interactions. In particular, the relationship of bi-functional residues to energetic hot-spot residues remains an open question. The predicted bi-functional residues described here, as well as the structurally characterized bi-functional positions reported previously,18 exhibit residue propensities similar in many ways to energetic hot-spots (Fig. 3A).5,25 Comparing HOMOLOBIND predictions to experimentally observed58 and computationally predicted hot-spots59–61 will establish how frequently bi-functional residues are also energetic hot-spots.
Another property that has been observed at successfully targeted protein interfaces is that the conformation ‘captured’ by a small molecule is often distinct from that involved in the protein–protein interaction.6,10 The flexibility of bi-functional residues can be investigated using temperature factors from crystallographically determined structures, order parameters of structures determined by nuclear magnetic resonance spectroscopy, or through molecular dynamics simulations. If a clear difference in flexibility is observed at bi-functional residues, this feature might provide an additional means to predict and evaluate the relevance of bi-functional residues. At the extreme of flexibility, the past decade has seen a growing number of studies that demonstrate the importance of intrinsic disorder in protein interaction networks.62 Tools that have been developed to predict protein binding residues in disordered proteins could shed light on the occurrence of bi-functional residues in disordered regions.63
The distinctive features of bi-functional residues, such as their residue propensities, suggest an alternative template-independent approach for their prediction. Previous studies have explored the utility of structural, physicochemical, and evolutionary features for predicting protein binding sites directly from sequence or structure.64 A similar approach might be useful for predicting bi-functional sites in proteins whose structures are unknown, do not have detectable similarity to a protein of known structure, or for which template binding sites are not available. Although the predictive accuracy of these feature-based methods remains to be explored in the context of bi-functional positions, such a method would be complementary to homology- and physics-based predictions.
In summary, I have presented a method that aims to maximize the utility of experimentally determined protein structures for identifying potentially druggable regions of protein–protein interfaces. The method is implemented in open-source software that integrates with well-established protein structure resources. The results provide a protein structure resource for targeting interactions and is complementary to a growing number of computational methods that catalog, characterize, and predict small molecule interaction modulators.8,10,65–70
The procedure is implemented in a Perl program, HOMOLOBIND, that is licensed under GPL v3 and runs on a single CPU or a Sun Grid Engine computing cluster. The program takes as input a SUPERFAMILY assignment file describing the domains found in a set of target protein sequences. The output is a list of target residues with similarities to a template binding site. HOMOLOBIND can also generate diagrams depicting the locations of predicted binding sites relative to the SUPERFAMILY domain architecture of a target protein. Annotation of all human proteins with a SUPERFAMILY domain assignment (n = 30712) takes 2.5 h on a single 2 GHz Xeon processor. HOMOLOBIND is compatible with the current SUPERFAMILY version (v 1.73), and its binding site library will be updated as SUPERFAMILY transitions to newer versions of SCOP.
The accuracy of the method was characterized by the false and true positive rates achieved as a function of the sequence identity threshold used to transfer binding sites from templates onto target sequences. Transferring binding sites without imposing a sequence identity threshold would yield a low false negative rate, correctly identifying all (correctly aligned) binding residues, but at the cost of a high false positive rate. At the other extreme of a 100% sequence identity threshold, all predicted residues would be correct (low false positive rate), but many real binding site residues would be omitted (high false negative rate). Ideally, these error rates would be estimated using a benchmark set of protein sequences where all individual residues are systematically known either to be involved (positive set) or not involved (negative set) in binding ligands and proteins. Such an ideal benchmark set is of course not available because of the vast number of possible ligand and protein binding partners and the currently sparse experimental sampling of this space.
The best source for positive binding residues are the PDB structures used in the template library. A negative set of residues that are not involved in binding is more difficult to construct because the absence of binding in the PDB for a particular protein residue does not rule out all possible binding events. An artificial negative set was constructed using a sequence-shuffling method originally developed for fold recognition,72 and later applied to protein–protein interaction prediction,73 that performs comparably to a more physical model of structural sampling. A set of negative binding residues was defined for each template binding site by creating shuffled sequences (n = 10000) from the family-wide alignment, while preserving gap structure, and selecting those shuffled residues in alignment columns corresponding to the actual template binding site. These negative binding residues were used during benchmarking to estimate the false positive rates achieved for each template binding site, at varying sequence identity thresholds.
Briefly, sequence identity thresholds were first established for each binding site that achieved a 1% residue-level false positive rate (FPR) and the resulting true positive rates (TPR) were estimated for each family by cross-validation. FPR was calculated by counting the number of negative binding residues that passed progressively higher sequence identity thresholds. The lowest sequence identity value that achieved a FPR ≤ 1% was chosen as the threshold. If such a threshold was not identified for a binding site, it was not used as a template. This happened for 18 ligand, 0 peptide, 20 inter-molecular domain, and 8 intra-molecular domain binding sites.
These binding site-specific sequence identity thresholds were then used to estimate the corresponding TPR in a family-wide fashion by cross-validation of the template binding sites. Briefly, each sequence in the family with a binding site was annotated using the other template binding sites in the family at the previously established sequence identity thresholds. Known binding residues that were aligned to template binding residues were considered the positive set: those that passed the sequence identity threshold were considered true positives, and the remainder considered false negatives. Confidence intervals for the TPR were estimated using Bayesian bootstrap resampling with 500 replicates.24
![]() | (2) |
![]() | (3) |
This journal is © The Royal Society of Chemistry 2011 |