Small-molecule binding sites to explore protein – protein interactions in the cancer proteome †

The Cancer Genome Atlas (TCGA) offers an unprecedented opportunity to identify small-molecule binding sites on proteins with overexpressed mRNA levels that correlate with poor survival. Here, we analyze RNA-seq and clinical data for 10 tumor types to identify genes that are both overexpressed and correlate with patient survival. Protein products of these genes were scanned for binding sites that possess shape and physicochemical properties that can accommodate small-molecule probes or therapeutic agents (druggable). These binding sites were classified as enzyme active sites (ENZ), protein–protein interaction sites (PPI), or other sites whose function is unknown (OTH). Interestingly, the overwhelming majority of binding sites were classified as OTH. We find that ENZ, PPI, and OTH binding sites often occurred on the same structure suggesting that many of these OTH cavities can be used for allosteric modulation of enzyme activity or protein–protein interactions with small molecules. We discovered several ENZ (PYCR1, QPRT, and HSPA6) and PPI (CASC5, ZBTB32, and CSAD) binding sites on proteins that have been seldom explored in cancer. We also found proteins that have been extensively studied in cancer that have not been previously explored with small molecules that harbor ENZ (PKMYT1, STEAP3, and NNMT) and PPI (HNF4A, MEF2B, and CBX2) binding sites. All binding sites were classified by the signaling pathways to which the protein that harbors them belongs using KEGG. In addition, binding sites were mapped onto structural protein–protein interaction networks to identify promising sites for drug discovery. Finally, we identify pockets that harbor missense mutations previously identified from analysis of TCGA data. The occurrence of mutations in these binding sites provides new opportunities to develop small-molecule probes to explore their function in cancer.


Introduction
Cancer is a collection of more than 100 diseases that share a number of characteristics as defined by Hanahan and Weinberg: 1 self-sufficiency in growth signals, insensitivity to growth inhibitory signals, evasion from programmed cell death (apoptosis), ability to undergo limitless cycles of cell growth, sustained ability to be supplied by blood (angiogenesis), and tissue invasion and spread to other parts of the body (metastasis).Large-scale sequencing studies of human tumors such as The Cancer Genome Atlas project (TCGA) provide an opportunity to uncover the genetic basis of the processes that drive cancer.Analysis of this genomic data has revealed that the complex phenotypes that define cancer are driven by tens of somatic mutations that occur on proteins across the cellular network. 2 Recent whole genome sequencing studies have profiled the molecular signatures of individual tumors, including ovarian, 3 colorectal, 4 breast, 5 renal, 6 and lung 7,8 cancer, to identify underlying driver mutations of each disease.Tumors were found to harbor tens of mutations.Whole-genome gene expression profiling studies have been instrumental not only in classifying tumors and uncovering genetic alterations in cancer cells (mutations, copy number, and rearrangements), but as a rich source of potential targets in cancer. 9,10A growing list of threedimensional protein structures make it now possible to rationally develop small-molecule probes to explore these targets.Small-molecule probes can also provide leads for drug-discovery validation.
TCGA is an ongoing effort that aims to catalog clinical and molecular profiles of tumor samples from over 30 cancer types to discover cancer-causing alterations in large cohorts through integrated multi-platform analyses.The project aims and lead compounds for drug discovery.In addition to detecting binding sites, algorithms have been developed to score these binding sites based on whether they can accommodate a small molecule.Both SiteMap and fpocket provide descriptors to assess binding sites that are suitable for small-molecule ligands based on the amino acid composition of the binding site and its collective physicochemical properties.SiteMap uses the hydrophobicity and accessibility of a detected binding site to assess how likely a small-molecule inhibitor will bind.It provides two scores, SiteScore and DrugScore.The latter score goes beyond just assessing a binding site for ligand binding.It measures whether a binding site is druggable, or whether it possesses similar properties to other binding sites that have led to FDA-approved drugs.fpocket provides a measure called the druggability score, which is a general logistical model based on the local hydrophobic density of the binding site, as well as a hydrophobicity and normalized polarity score.The discovery of binding sites within structures that are encoded by overexpressed genes with clinical relevance is highly significant as these binding sites can be used to develop novel cancer therapeutics that are likely to exhibit greater efficacy in humans.
In addition to druggability, the binding sites must be functionally important to serve as targets for small molecules.For example, binding sites located at enzyme active sites or at the interface between a protein-protein complex are expected to disrupt protein function.Protein kinases are one example of an enzyme class with druggable binding sites that occur at the enzyme active site. 27The ATP binding site of kinases is highly druggable with a SiteMap SiteScore and DrugScore above 1.1. 28here are fewer small-molecule inhibitors of protein-protein interactions, which is partly due to the lack of druggable binding sites at protein-protein interfaces.The only examples of PPI inhibitors that have shown in vivo efficacy, such as MDM2/p53 or BcL-xL, possess druggable binding sites (DrugScore of 0.92 and 0.82, respectively). 29Therefore, the identification of binding sites that are considered druggable at protein-protein interaction interfaces can provide new avenues to develop chemical probes and cancer therapeutics.Finally, it is worth mentioning that binding sites located outside an enzyme active site or proteinprotein interface can also be functionally relevant.1][32][33][34] Allosteric inhibitors have been successfully used to inhibit kinase activity and in some cases, such as AKT, have shown more promise than competitive inhibitors.
Here, we collect gene expression profiles for 10 cancer types from TCGA and compare the expression profiles between cancer and normal samples to identify genes that are overexpressed in each cancer type.We search the Protein Data Bank for crystal structures of the protein products of these genes.We scan the surface of these proteins and identify binding sites.The functional relevance of these binding sites is explored by classifying them into known enzyme active sites, protein-protein interaction sites, or other sites that may lie outside of functional sites.To further explore the biological outcome of small molecules that bind to these binding sites, proteins harboring binding sites are further characterized in the context of a global PPI network and cancer signaling pathways to gain insight into the biological effect of binding at these binding sites.Patient data is used to investigate the correlation of overexpressed genes with clinical outcome.Our analysis uncovered new unexplored and potentially druggable and clinically-relevant protein targets.The study also provides new avenues for the rational design of small-molecule probes for well-established oncogenes.This is the first study that maps binding pockets on three-dimensional structures of the PDB within the context of cancer genomic data.

Three-dimensional structures of proteins encoded by differentially-expressed genes
We collected mRNA gene expression profiles of 10 cancer types from TCGA: breast invasive carcinoma (BRCA), colon adenocarcinoma (COAD), glioblastoma multiforme (GBM), head-andneck squamous cell carcinoma (HNSC), kidney renal clear cell carcinoma (KIRC), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), thyroid adenocarcinoma (THCA), triple-negative breast cancer (TNBC), and uterine corpus endometrioid carcinoma (UCEC).For each cancer type, we collected the gene expression profiles of both normal and tumor samples from RNA sequencing platforms using TCGA's Level 3 data.A search from among the 20 192 reference proteins using UniProt 35 identifiers led to 7044 proteins that are encoded by TCGA overexpressed genes (Table 1 and Table S1, ESI †).For each cancer type, we identified the number of overexpressed genes with protein products having at least one high-resolution crystal structure by mining the Protein Data Bank (PDB).A total of 5069 unique protein chains on 2758 crystal structures from the PDB mapped to at least one of the 7044 overexpressed genes.In cases where more than one crystal structure was identified for a protein, the computer program CD-HIT was used to cluster the protein sequences of the crystal structures to find a set of non-redundant representative structures for the given protein.This resulted in 1624 unique crystal structures of proteins encoding overexpressed genes.The total number of proteins that encoded overexpressed genes ranged from 839 for TNBC to 2096 for LUSC (Table 2).Overall, the percentage of differentially-expressed genes with at least one crystal structure spanning at least a portion of the gene sequence ranges from 20% in LUSC to 34% in GBM.Additionally, we introduce more stringent cutoffs to distinguish between proteins that can act as probes versus those that feature druggable binding sites by increasing cutoffs of both the log 2 fold change and the druggability property of a binding site.Using these increased cutoffs, we identify 5218 overexpressed proteins in TCGA, with only 1218 having a high quality crystal structure at the PDB (Table 1).
Identification of binding sites on protein structures at the PDB Using the three-dimensional structure of overexpressed genes for each disease, we scanned their surfaces for binding sites using the SiteMap computer program.SiteMap identifies binding sites by overlaying a three-dimensional grid around the entire protein to determine the van der Waals energies at each point of the grid (site point).By linking together site points on the protein surface that are protected from the solvent, SiteMap identifies potential binding sites on a protein surface.Each binding site identified by SiteMap is evaluated based on its ability to bind a ligand (SiteScore) and its druggability (DrugScore).Both SiteScore and DrugScore use the weighted sums of the same parameters, namely the (i) number of site points in the binding site; (ii) enclosure score that is a measure of how open the binding site is to solvents; and (iii) hydrophilic character of the binding site (hydrophilic score).Unlike DrugScore, SiteScore limits the impact of hydrophilicity in charged and highly polar sites.A binding site with SiteScore and DrugScore of 0.8 is considered to be able to fit a small molecule ligand.SiteScore and Drug-Score values closer to 0.8 are considered 'difficult' to drug, while binding sites with SiteScore and DrugScore closer to 1.1 are classified as highly 'druggable'. 28In this work, we consider a binding site with SiteScore and DrugScore values of 0.8 or greater as able to be probed and a binding site with DrugScore greater than 1.0 as druggable.
Among 1624 overexpressed proteins with at least one highresolution human crystal structure, 1044 (B64%) had at least one binding site (Table 1).Similarly, among the 1218 highly overexpressed proteins with crystal structures, 405 (B33%) had at least one druggable binding site.For individual diseases, roughly 30% of proteins with crystal structures corresponding to highly overexpressed genes possessed at least one druggable binding site (Table 2).For example, 51 proteins with a crystal structure from among 211 in TNBC had a druggable binding site, while 114 proteins with a crystal structure in LUAD were found to have a binding site among 363.Generally, we found more binding sites than proteins with crystal structures, suggesting that although many of the proteins harbored more than one binding site, a large portion might only act as probes rather than druggable sites.An average of about 0.38 druggable binding sites were identified per protein with crystal structures.For example, a total of 145 druggable binding sites were identified on the 429 proteins with crystal structures corresponding to differentiallyexpressed GBM genes.Among the most frequently overexpressed proteins with druggable binding sites are the members of the matrix metalloproteinases (MMPs) and protein kinases related to cell signaling (Fig. S1, ESI †).

Classification of binding sites
To characterize the potential functional impact of each of these binding sites, we classified each binding site by its functional role based on its structural features and location on the protein surface, particularly whether it corresponds to a catalytic site or to a binding site located at a protein-protein interaction interface.Using the proximity of known structural features and the functional annotations of key residues, we characterize each binding site on the protein structure of overexpressed genes from TCGA into three groups: enzyme (ENZ), protein-protein interaction (PPI), and other (OTH).Fig. S2 (ESI †) shows examples of each of the three binding sites.For example, the ATP binding site of a protein kinase is classified as enzyme (ENZ), while a binding site at the interaction interface between two members of the protein families CDKs and cyclins are classified as PPI.
All other binding sites are referred to as ''other'' (OTH).Within the binding sites that we identified, there is a wide distribution of binding site functions for each cancer type (Tables 1 and 2).Overall, there are many more 'OTH' binding sites than ENZ and PPI across all tumors.OTH binding sites constitute approximately 70% of the binding sites observed, while ENZ and PPI are observed in about 20 and 10% of structures, respectively.Among those binding sites that we classify as druggable, the distributions are 25, 11, and 66% for the ENZ, PPI, and OTH binding sites, respectively.OTH binding sites may correspond to uncharacterized enzyme active sites or may occur at PPI interfaces that have not been characterized.

Cavities at enzyme active sites
Enzyme active site binding sites were identified by first mapping known catalytic residues from Catalytic Site Atlas (CSA) 36 and UniProtKB 35 onto the identified structures of each protein.
CSA identifies catalytic residues as those that are (i) directly involved in a catalytic mechanism; (ii) alter the pK A of another residue or water involved in the catalytic mechanism; (iii) stabilize a transition or intermediary state; and/or (iv) activate a substrate. 36niProt defines these residues as being directly involved in catalysis. 35If one of the catalytic residues was within the binding site, we classify the binding site as ENZ.In total, we identified 434 unique enzyme active site binding sites and 126 druggable binding sites on proteins that are encoded by overexpressed genes at TCGA (Table 1).The number of druggable ENZ binding sites ranged from 10 for HNSC to 49 for LUSC.For example, there were 34, 21, and 38 druggable enzyme binding sites for GBM, TNBC and LUAD, respectively (Table 2).We further classify enzymes by their catalytic function and distinguish between the druggability of the binding site (Fig. S3, ESI †).We treat kinases separately from the transferases.When kinases and transferases are combined, they, along with the hydrolases, are the largest group among the enzyme active site binding sites.There were 70, 91, 83, and 141 oxidoreductases, transferases, kinases, and hydrolases, respectively.Lyases, isomerases, and ligases, on the other hand, were the least common among proteins with ENZ binding sites (26, 16, and 9, respectively).

Cavities at protein-protein interaction interfaces
Despite the fact that protein-protein interactions play a crucial role in a range of diseases including cancer, few successful PPI inhibitors have been developed to date.This is attributed to the fact that PPI interfaces are usually large and devoid of welldefined binding cavities.Druggable binding sites that occur at protein-protein interfaces could be used to develop small molecules to disrupt the protein-protein interaction.PPI binding sites were identified by looking at the crystal structures with protein complexes with respect to the representative structures for a given protein.For each representative structure of a given protein, we went back to our sequence-based clustering approach in CD-HIT and identified the set of protein structures that shared significant sequence identity with the representative structure.
We then aligned all the crystal structures from this alternative set of structures back onto the representative structure.This superimposition resulted in the identification of PPI interfaces that might not have appeared in the reference structure and their positions with respect to the previously identified binding sites.In total, we identified 231 unique binding sites located at protein-protein interaction interfaces, of which only 55 were druggable.As expected, there were significantly fewer binding sites that occurred at PPI interfaces than any of the other classes of binding sites.These ranged from 4 for HNSC to 19 for KIRC.For example, there were 13, 10, and 15 druggable PPI binding sites identified for GBM, TNBC, and LUAD proteins, respectively (Table 2).
Proteins with binding sites located at both enzyme active sites and protein-protein interaction interfaces While OTH binding sites were predominant among the different cancer types, the ENZ and PPI binding sites give greater insight into the binding site's function.Interestingly, there are proteins that contain binding sites that are classified as both ENZ and PPI (Table 3).Of these 24 proteins, 10 have binding sites that are druggable and are part of the enzyme active site and a PPI interface.Among these are proteins that are implicated in cancer progression and metastasis, such as CDA 37 (Fig. 1A), MMP14 38 and DDR1. 39In these cases, the binding site at the catalytic site is also part of a PPI interface.Many of the cases where the ENZ and PPI binding sites overlap correspond to binding sites that occur at the active site of proteases.The binding partner is usually a protease inhibitor, for example, AGT and TIMP1 in ANPEP and MMP14, respectively.Generally, these interactions may not be promising targets since proteolytic activity may contribute to tumor invasion and metastasis.However, the overexpression of protease inhibitors such as TIMPs and serpins suggest that inhibition of proteases may oppose growth and metastasis of a tumor.
Other proteins contain distinct enzyme and PPI binding sites (Table 4).Of these 24 proteins, only ALOX12 and NR1L2 feature both druggable ENZ and PPI binding sites.These proteins can be placed into two categories based whether or not the binding sites are on the same protein domains.Some have ENZ and PPI binding sites on the same domain such as the decarboxylase GAD1, which has a catalytic site as well as a  This journal is © The Royal Society of Chemistry 2016 PPI binding site at its homodimer interface.Another example is the phosphoribosyltransferase NAMPT, which is implicated in cancer metabolism, 40 and has an ENZ binding site with an inhibitor bound as well as a PPI binding site between the homodimer structure (Fig. 1B).Other proteins have ENZ and PPI binding sites on separate domains.For example, the serine/ threonine-protein kinase PLK1 has both an enzyme ATP binding site on its protein kinase domain and a binding site at the PPI interface at its POLO-box domain.Another similar example is the receptor tyrosine kinase EPHB4, which has an enzyme ATP binding site on its protein kinase domain (Fig. 1C) and a binding site at the PPI interface with an ephrin ligand EFNB2 on its ligand binding domain (Fig. 1D).These binding sites may be used to develop allosteric modulators.Small molecules that bind to the PPI binding site may alter substrate binding to the active site.A small molecule inhibitor of enzyme activity may affect the proteinprotein interaction of the protein.

Unclassified binding sites
Binding sites that were neither enzyme active sites nor located at protein-protein interactions were classified as OTH.In total, more than 1500 of these binding sites were identified on proteins that are encoded by differentially-expressed genes.These binding sites could potentially be either unassigned enzyme active sites, part of structurally unresolved protein-protein interaction sites, or allosteric sites.A binding site is considered allosteric only if it occurs on a protein that has enzyme activity or that engages other ligands at sites that are distant from the allosteric binding site.Among the 782 proteins with OTH binding sites, 323 also have at least one ENZ or PPI binding site.These binding sites offer an opportunity to design allosteric small molecule modulators of enzyme activity or protein-protein interactions.Allosteric regulation of enzyme activity has been successfully achieved with small molecules in several systems. 41For example, small-molecule kinase inhibitors have been developed to bind to allosteric binding sites to inhibit the enzyme activity of the protein kinase. 42More recently, small molecules that bind to an allosteric binding site on the Ral GTPase was shown to modulate the distal interaction with its effector protein. 43any OTH binding sites occur on proteins with existing ENZ and/or PPI binding sites, which may be potential allosteric sites for protein inhibition.When the enzyme active site is well characterized on a protein surface, additional binding sites represent opportunities for allosteric inhibition of the protein's function.For example, the sulfotransferase SULT2B1 has four binding sites on its protein surface (Fig. 2A).The ENZ binding site is not shown on the figure but encompasses the adenosine nucleotide.Three additional OTH binding sites were detected on the surface of the protein and represent potential sites for allosteric sites.Another example of a protein with both ENZ and OTH binding sites is the protein kinase RET (Fig. 2B).In this structure, a known inhibitor occupies the ENZ ATP binding site, while an additional allosteric binding site is formed near the aC helix.Similarly, there are proteins with both PPI and OTH binding sites.One example is the PPI between CHN2 and SLC9A1 (Fig. 2C), where an a-helix from SLC9A1 occupies two PPI binding sites on CHN2.An additional potentially allosteric OTH binding site is formed on the backside of CHN2.Another example is the protein complex formed between PLAUR, PLAU, and VTN (Fig. 2D).In this example, binding sites were found on  A search of protein-protein interaction networks to identify OTH binding sites located at PPI interfaces The majority of OTH binding sites occur on proteins with no discernable ENZ or PPI binding sites.To determine whether these binding sites could potentially be located at proteinprotein interaction interfaces, a database of predicted proteinprotein complexes known as PrePPI was explored. 44The PrePPI method uses both structural and non-structural evidence to predict whether two proteins form a complex.For complexes predicted based on structural information, PrePPI superimposes monomeric crystal structures onto a reference complex based on the structural similarities of the monomeric structures with the two structures forming the interaction interface.This model is then evaluated based on how well the individual residues of the predicted interaction interface overlap with the structural model.If the likelihood ratio of this structural modeling is above a given cutoff, PrePPI provides the identifiers of both the individual proteins and the reference structure for further evaluation.For the 458 proteins that contained only binding sites classified as OTH, we evaluated the structural models given by PrePPI to determine whether or not OTH binding sites overlapped with potential PPI interfaces.These 458 proteins are represented by 395 unique crystal structures consisting of 806 binding sites of unknown function.Of these 806 OTH binding sites, 48 were on proteins without models of structural complexes in PrePPI.Among the remaining 758 OTH binding sites, we identified 17 OTH binding sites on 13 proteins that are likely binding sites at protein-protein interfaces (Table 5).In each of these 17 cases, a previously classified OTH binding site was predicted by PrePPI to be part of a known protein-protein interaction interface, and perhaps directly contributing to the PPI itself.It is interesting to note that several of these predicted protein-protein interactions are well-established despite the lack of a co-crystal structure: these include the ANK1-ILK 45 and CHN1-RAC1 46 interactions.In each of these cases, there was a high degree of homology between the structure containing the OTH binding site and the PrePPI protein-protein complex to which it was superimposed.In most cases, however, the protein containing the OTH binding site did not show any homology with a protein in a PrePPI complex.In these cases, the similarity between the interaction interfaces of the two proteins and a model protein complex was used.The NCS1-PPP3CA, LCN1-OVCH1, and ZBTB32-BCL6 interactions are examples in which the interaction was uncharacterized in both the literature and existing PPI databases.These three interactions were predicted based on the structural complementarity of both the interaction interface and the crystal structure.Overall, we predict that approximately 2% of OTH binding sites with unknown function to be part of a previously uncharacterized PPI interface.

Cancer signaling pathways
Pathways reveal signaling transduction across a cascade of proteins that elicit a variety of cell phenotypes.Individual targets in these pathways are potential sites through which small-molecule inhibition is expected to enhance or alter the subsequent cell phenotype.Alteration of individual genes within these signaling pathways lead to cancer related processes such as cell growth and adhesion.We have identified 27 cancer related signaling pathways in KEGG 47 and their respective proteins.Using the members in each of these signaling pathways, we map binding sites onto these individual proteins.We distinguish between binding sites with DrugScore greater than 0.8 on proteins with log 2 fold change greater than 1.5 (i.e., able to be probed) (Fig. 3A) and those with DrugScore greater than 1.0 and log 2 fold change greater than 2 (i.e., druggable binding sites) (Fig. 3B).While some signaling pathways like the cell cycle contained binding sites of all functional types, no binding sites could be identified for the Hedgehog pathway on differentially-expressed genes.To address cross-talk between signaling pathways, binding sites were also evaluated as being either unique to that signaling pathway or on proteins that occur in multiple signaling pathways.In a majority of cancer signaling pathways, there were more binding sites that occurred in multiple signaling pathways than in a signaling pathway, revealing proteins targets that are involved in multiple signaling processes.This journal is © The Royal Society of Chemistry 2016 Only the Citrate Cycle, HIF-1, and PPAR signaling pathways had many more binding sites that were unique to the signaling pathway itself than in multiple signaling pathways.In signaling pathways such as focal adhesion and cytokine-cytokine receptor interactions, almost all of the druggable binding sites belonged to proteins that were involved in cross-talk across cancer signaling pathways.Finally, signaling pathways such as the cell cycle and Hippo pathways have an even mix of binding sites on unique and overlapping proteins.

Correlation with patient survival for proteins encoded by differentially-expressed genes
We collected patient survival data from TCGA clinical records for each disease to identify the impact of gene expression on  overall survival of cancer patients.To determine the overall survival rate, we first identified the date of death or date of the last checkup for deceased and living patients, respectively.For each differentially-expressed gene among the 10 diseases we considered, the median expression value was used to divide patient tumors into two groups, high and low expression.For a given gene, we then paired a patient's gene expression with their survival outcome to build a Cox proportional hazards regression model for differentially-expressed genes.The ratio of the hazard rates between the high and low expression groups are summarized by a metric known as the hazard ratio.The hazard ratio derived from the regression model defines the probability that an event will occur in the next time interval.In this model, this time interval is made sufficiently small that the hazard rate is considered instantaneous.Therefore, the hazard ratio is used to describe the ratio between the hazard rates of two groups, that is, the survival of patients expressing a gene at high and low levels.In total, we identified 1343 differentiallyexpressed genes across all 10 diseases with a hazard ratio above 1 and log 2 fold change above 1.5.Among them, 202 contained at least one binding site (Fig. 4A).Both KIRC (121 total) and LUAD (57 total) had the most number of proteins that were both overexpressed and correlated with patient outcome.There were 45 druggable genes that were found to be both overexpressed and correlated with patient outcome in more than one cancer type.The most frequently occurring are MELK and RRM2 in 4 separate cancers, while another 9 protein have significant fold changes and hazard ratios in 3 cancers.The binding sites on these 202 proteins show a wide distribution in both their druggability and binding site type (Fig. 4B).Of the 601 unique binding sites on these proteins, 102 are ENZ, 46 are PPI, 444 are OTH, and 9 are classified as both ENZ and PPI.
Both the SiteScore and DrugScore of the PPI binding sites have upper limits of about 1.1 for both metrics, while there are many ENZ and OTH binding sites that exceed this cutoff.Similarly, we focused on the subset of the proteins that were highly overexpressed and featured druggable binding sites.In total, we identified 60 proteins with at least one druggable binding site across 10 diseases with a log 2 fold change greater than 2.0 and hazard ratio greater than 1.0 (Fig. 4D).Similarly, there are far fewer binding sites among proteins that fit these criteria.Of the 92 binding sites, 20 are ENZ, 6 are PPI, 65 are OTH, and 1 is both ENZ and PPI (Fig. 4E).

Protein-protein interaction network
In addition to looking at differentially-expressed genes in the context of their expression, we addressed their impact on the global protein-protein interaction network.Networks have been used to not only model biological relationships, such as the relationship between drugs and diseases 48 or genes and diseases, 49 to understand their underlying mechanisms, but also to identify new drug targets by identifying the relationships between a drug's side effects 50 or gene expression profile. 51Using experimental data, a global protein-protein interaction network was constructed from physical interactions in humans by integrating data from seven major interaction databases.This resulted in 203 068 non-redundant proteinprotein interactions.To address the robustness of the network, we further filtered the interactions by only keeping those interactions that appeared in at least two of the seven databases.This resulted in a network with 38 164 non-redundant protein-protein interactions.We then identified the network properties of each protein within this network to measure the centrality and essentiality of each protein to the overall network.Among the topological properties of a given protein are its degree, which describes the number of interactions that are formed by that protein, and its betweenness centrality, which describes the number of shortest paths that go through the given protein.In a biological context, betweenness centrality is a measure of the available paths that a signal can travel through a given network. 52Thus, proteins with high betweenness are thought to be essential to biological function and are frequently targeted in drug discovery. 53For example, TP53 has a betweenness centrality and degree of 4.1 Â 10 À2 and 236, respectively, while EGFR is 2.3 Â 10 À2 and 181 for the same properties.We examine the topological properties of all proteins that are overexpressed (log 2 FC Z 1.5) and whose expression correlate with patient outcome (Fig. 4C).Of these 1343 proteins, 1001 (B75%) did not have a high quality crystal structure and an additional 141 (B10%) had a structure but no binding sites.Of the remaining proteins, 117 (9%) and 84 (6%) have binding sites and druggable binding sites, respectively.When the differentialexpression cutoff is increased to 2 and the minimum DrugScore is increased to 1.0, 60 proteins have at least one druggable binding site (Fig. 4F).Among the proteins with the highest centrality and degree are PLK1, KPNA2, AURKA, and AURKB.

New unexplored targets for the development of small-molecule probes and cancer therapeutics
For each of the previously identified 60 targets, we integrate their structural, genomic, biological, and clinical data to examine their druggability.We divide these targets into those that are already established in cancer (Table 6) and those that are uncommon or novel (Table 7) based on the number of citations found in PubMed.Similarly, we analyzed the 202 proteins that were identified using the lower cutoffs in fold change and binding site DrugScore (Table S2, ESI †).We rank-ordered the top targets for each cancer based on their interconnectivity in the PPI network.Among these potential targets, we see a variety of biological processes represented, including many involved in the immune response, metabolism, homeostasis and cell cycle.Similarly, some are well-studied in cancer but lack smallmolecule inhibitors, while others are have no co-crystallized small-molecule inhibitors but inhibitors have been reported in the literature.For example, the well-studied transcription regulator TOP2A is altered in cancer cells resulting in chromosome instability and is among the genes that are overexpressed and correlate with survival, but has many topoisomerase-specific inhibitors available. 54Other genes may act as markers for cancer and indicate late progression into cancer or are vital to the immune response against tumorigenesis.However, there are many targets whose biology and lack of potential inhibitors may prove to be interesting targets for future considerations.
We highlight examples of proteins with ENZ binding sites that have seldom been considered in cancer and lack therapeutics (e.g.PYCR1, QPRT, HSPA6), or are well-studied in cancer but lack small-molecule inhibitors (e.g.PKMYT1, STEAP3, NNMT) (Fig. S4, ESI †).Similarly, we highlight examples of proteins with PPI binding sites that have not been previously targeted by small-molecule inhibitors and are either seldom considered in cancer (e.g.CASC5, ZBTB32, and CSAD), or are well-studied in cancer but lack small-molecule inhibitors (e.g.HNF4A, MEF2B, and CBX2) (Fig. S5 and Table S3, ESI †).OTH binding sites can provide an avenue to modulate either enzymatic function or protein-protein interactions of the target.Compounds that bind to OTH sites could act either in an orthosteric manner if the binding site happens to be the binding site of a substrate or protein, or allosterically if the binding site is outside an enzyme active site or protein binding site.Among the genes whose overexpression strongly correlated with patient outcome and that possessed an OTH binding site, several had never been studied in cancer before nor do they have small-molecule inhibitors either in the literature or in co-crystallized complexes.
We highlight four examples that span a variety of tumors: a protein of unknown function FAM83A, a water channel AQP2, a serine protease SERPIND1, and a protein associated with the immune response TNFAIP8L2 (Fig. S6, ESI †).Among these targets, 26 have been previously probed with small-molecule ligands and X-ray crystallography (Table S4, ESI †).Interestingly, many of these co-crystallized structures occur at binding sites at or below our higher DrugScore cutoff of 1.0, suggesting that a more stringent cutoff may discard otherwise druggable binding sites.Additionally, we mapped these druggable binding sites to conserved protein domains, and find that these binding sites are mainly parts of the protein kinase, serpin, kinesin, and peptidase domains (Table S5, ESI †).When we consider only those without co-crystallized small-molecule inhibitors, protein kinases and trypsin domains are removed.The majority of binding sites across both targeted and untargeted proteins are classified as OTH.In well-studied systems where the active site is known, these OTH sites represent opportunities for allosteric regulation.
We next looked at the secondary structure of residues that compose the individual binding sites of these proteins across their individual binding site annotations.By examining the residues around a binding site, we generalized the type of secondary structures that were used to construct the binding site itself (Fig. S7, ESI †).The majority of binding sites identified were a mixture of secondary structures or random coils among all proteins with or without small molecule inhibitors.Combined, these two secondary structures generally making up the large majority of all binding sites in each binding site type.In each  case, the least frequently observed secondary structure among these binding sites were the helix-like (i.e.a-helix, 3 10 helix, or p-helix) and sheet-like structures (i.e.beta bridges and beta bulges).We then examined the secondary structures of the residues of the binding partner inside PPI binding sites.About 27 and 46% of the residues of the binding partners in the binding site were coil-like and helical (a-helix, 3 10 helix, or phelix), respectively.Only 10% of the binding sites were characterized by strand-like structures (b-sheet or b-bridge).The remaining PPI binding sites were a combination of these.

Missense mutations on protein structures
A set of somatic mutations were obtained from a recent study from TCGA's Pan-Cancer initiative. 55We identified missense mutations from this study onto patients in 7 of 10 diseases and mapped these to protein structures.We classified these mutations as being (i) adjacent to a binding site; (ii) elsewhere on the protein surface; or (iii) buried in the interior of the protein (Fig. 5A).We find that the majority of these missense mutations are found on the surface of proteins but not within a predicted binding site.The frequency of mutations occurring in the interior of a protein is higher than the frequency of mutations that occur at binding sites.We explored some of the proteins with mutations occurring most frequently in the binding site (Fig. 5B).They include well-known genes that have been previously reported to be heavily mutated in cancer such as PIK3CA, 56 SI, 57 and PTEN. 58On the most commonly mutated target, PIK3CA, mutation rates are approximately five-fold less at the binding site than the entire protein.][61] We matched these proteins with missense mutations with their gene expression levels and correlation with patient outcome.We find 29 binding sites on 26 proteins that are (i) overexpressed (log 2 fold change Z2); (ii) correlate with patient outcome (hazard ratio 41); and (iii) have a missense mutation adjacent to a binding site in a given disease (Table 8).These 29 binding sites include 9 ENZ, 3 PPI, and 17 OTH pockets.Among these mutations adjacent to binding sites is the W167L mutation on the PPI interface between MAD2L1 and MAD1L1 in LUAD (Fig. 5C).This interaction is part of the spindle assembly checkpoint in the cell cycle. 62Considering the significant reduction in contact area upon replacing tryptophan with leucine, and the fact that tryptophan residues tend to often occur at protein-protein interaction interfaces, we expect that this mutation may impair the protein-protein interaction.Another mutation is the R121P mutation adjacent to the DNA-binding OTH binding site on EXO1 in LUAD (Fig. 5D).The DNA-binding protein is also involved in DNA repair during Fig. 5 Proteins with missense mutations.(A) Missense mutations were mapped to patients in 7 of 10 diseases (COAD, THCA, and UCEC not included).Individual mutations were mapped to the protein structure and classified as being adjacent to the binding site, elsewhere on the protein surface, or buried in the interior of the protein structure.(B) Percentage of samples with missense mutations adjacent to a binding site in a given disease, showing the top 20 proteins rank-order using the sum of frequencies.(C) The W167L (green stick) mutation on the PPI interface between MAD2L1 (white) and MAD1L1 (cyan) is shown in cartoon (PDB ID: 1GO4).The PPI binding site is shown as transparent spheres.(D) The R121P (green stick) mutation adjacent to the DNA-binding OTH site (tan, transparent spheres) on EXO1 (white cartoon) (PDB ID: 3QEB).DNA in the binding site from the crystal structure is also shown as cartoon.(E) The counts of missense mutations at the amino acid level divided classified as being adjacent to the binding site, elsewhere on the surface of the protein, or buried in the protein interior.The original amino acid is listed row-wise and the subsequent mutation is listed column-wise.
This journal is © The Royal Society of Chemistry 2016 cell cycle regulation. 63Unlike the previous mutation, arginine contains a positively charged group while proline is a neutral non-polar amino acid.
We examined the mutation rates of individual amino acids by looking at the wild-type and mutated amino acids as a result of a mutation at each of the three locations on the protein (Fig. 5E).We find differences in the relative frequencies of specific point mutations between each location.For example, mutations to alanine is less favored in the pocket or on the surface of the protein than it is in the interior, especially at charged or polar groups.Among the most common mutations in the binding site and on the surface is from lysine to glutamic acid, which occurs at a much lower frequency in the interior of the protein.

Discussion
The sequencing of the genome of human tumors has provided access to an unprecedented number of new opportunities for the development of cancer therapeutics.While biological methods such as siRNA or CRIPSR/Cas9 methods are useful tools to explore the role of potential targets, chemical tools provide a complementary approach to interrogate new targets.Small molecules do not affect the expression of the target thereby causing little disruption to the signaling networks.In addition, small molecules have significantly greater precision as they can be designed to binding to a single cavity within a protein and modulate the function of the protein by disruption of proteinprotein interactions or enzyme activity.Small molecules can work either in an orthosteric manner if they directly interfere with the binding of a protein or a substrate.They can also work in an allosteric manner by binding to cavities located outside proteinprotein and protein-substrate binding interfaces and modulating the conformation and dynamics of the target.
For small molecules to engage their targets with high affinity, a well-defined cavity that possesses suitable shape and physicochemical properties.The lack of such cavities is partly responsible for the difficulty in developing small-molecule therapeutic agents that bind directly to highly promising cancer targets such as mutated RAS GTPase or transcription factors such as c-MYC.Conversely, the success of kinases as oncology targets can be attributed to the well-defined ATP-binding site.Using binding sites of kinases and other druggable targets, several algorithms have been developed to predict the druggable nature of a binding site using the three-dimensional structure of the protein that harbors them. 64Among them, SiteScore and DrugScore, which have been developed using data from binding sites occupied by approved drugs. 25,28Druggable sites, the highly conserved nature of the ATP-binding site has been the main impediment in the development of kinase drugs.Developing highly selective kinase inhibitors is notoriously difficult, although some successes have been reported.Identifying novel targets with unique druggable binding sites located on potential cancer targets may lead to cancer therapeutics with greater efficacy and lower toxicity.
Here, in an effort to facilitate the chemical probing of new targets in cancer, we explore RNA-seq data of 10 tumor types at TCGA to identify unique and druggable binding sites on proteins encoded by protein products of overexpressed genes.The largescale effort of TCGA to sequence the genome of tumors from more than 30 cancers provides an unprecedented opportunity to uncover new targets for the development of cancer therapeutics.We identified genes whose mRNA levels are overexpressed in tumors compared with normal tissue.Patient data provided by TCGA was used to further narrow the list of targets to genes whose overexpression correlates strongly with patient survival.This was accomplished by constructing survival curves and evaluating a hazard ratio for each overexpressed gene.Genes with hazard ratio of 1 or greater where considered to correlate with worse patient survival.For each of the 10 diseases that we have considered in this work, we identified protein products of genes whose mRNA levels are differentially-expressed that strongly correlate with patient survival.Additionally, we explored these targets in the context of cancer related signaling pathways and the protein-protein interaction network.
The exponentially growing list of three-dimensional structures of proteins prompted us to search the PDB to identify structures for protein products of up-regulated genes that we identified.We used a stringent threshold for these scores to ensure that small molecules that bind to the druggable binding sites have the potential to be developed into therapeutic agents.Among all up-regulated genes we found that 23% of their protein products had a structure at the PDB.Among the 1218 proteins with structures, 405 (33%) had druggable binding sites.A similar ratio was found among individual diseases.For example, 51 proteins with a crystal structure from among 211 in TNBC had a druggable binding site, while 114 proteins with a crystal structure in LUAD were found to have a binding site among a total of 363.When overexpressed genes are further filtered by hazard ratio, a total of 54 proteins that possess druggable binding sites and 65 possessed binding sites are identified among 1344 differentially-expressed genes.There were 15 druggable proteins that are present in multiple tumor types.The most frequently-occurring were MELK in 4 tumors.
The presence of a binding site is not sufficient to serve as a suitable target site for chemical probe development and drug discovery.The binding site must possess functional relevance.Its position must be located at a site such that the binding of a small molecule will impair the function of the protein harboring the binding site.For example, small molecules that bind to a binding site located at an enzyme active site or proteinprotein interface will disrupt enzyme activity or protein-protein interactions and thereby impair the function of the target protein.Binding sites located outside an enzyme active site or protein-protein interface, may or may not modulate the activity of a protein.We classified all binding sites into enzyme active sites, protein-protein interaction sites, or other sites with yet unknown function that may provide an opportunity to modulate protein function through an allosteric mechanism.
Many of the enzyme active sites occur on well-established oncology targets or have been inhibited by small molecules.However, there were several examples of enzymes whose function was explored in cancer but were never targeted with small molecules; these include PKMYT1, STEAP3, and NNMT.There were also several druggable active site binding sites that occurred on enzymes that have seldom been considered in cancer, such as PYCR1, HSPA6, and QPRT.We identified several proteins whose overexpression correlate with patient outcome that occurred at protein-protein interfaces.This discovery is highly significant as protein-protein interactions have been historically challenging due to the lack of well-defined binding sites at protein-protein interfaces. 65,66Protein-protein interfaces can offer an opportunity to develop highly selective compounds since many of these interfaces are structurally unique.Among all differentially-expressed proteins with binding sites, 18% have binding sites that occurred at protein-protein interfaces.For the proteins encoded by genes that correlate with patient survival, we identified 28 binding sites (7 druggable) on 25 proteins that occurred at protein-protein interfaces.Among these proteins, 13 have been studied in cancer.Examples include MEF2B, HNF4A, and CBX2.The remaining 15 proteins have seldom been studied in cancer, such as CASC5 and ZBTB32.Interestingly, several protein structures possess both PPI and ENZ binding sites either on the same domain (e.g.GAD1, NAMPT, and NR1I2) or on different domains (e.g.EPHB2, PLK1, and NTRK1).Small molecules that bind to a binding site on these proteins may serve as allosteric modulator of PPI interactions.
We found that the majority of binding sites were not located either at an enzyme active site or protein-protein interaction site.We refer to these binding sites as other (OTH).Of the 601 unique binding sites on the 202 proteins encoded by genes whose overexpression correlates with patient survival, 102 are ENZ, 46 are PPI, 444 are OTH, and 9 have been classified as both ENZ and PPI.It is likely that many of these OTH binding sites occur at protein-protein interfaces.To explore this possibility, we searched protein-protein interaction databases such as PrePPI for binding partners.Among 759 OTH binding sites located on overexpressed proteins, we identified 17 candidates that have the potential to be located at PPI interfaces.Examples of these proteins include ANK1, CHN1, and NCS1.While OTH binding sites that occur at enzyme active sites or proteinprotein interaction sites can be used to develop probes that directly modulates the function of the target harboring these binding sites, the remaining OTH binding sites can provide an opportunity to modulate receptors through an allosteric mechanism. 31,67Whether a small molecule that binds to a binding site will allosterically modulate enzyme function or a PPI interaction is difficult to predict.Small molecules can serve as positive or negative allosteric regulators. 34,68,69These OTH binding sites can also be used for the development of small molecules that can be attached to probes for proteasome degradation. 70inally, we mapped mutations that were previously identified at TCGA 55 onto the three-dimensional structure of proteins that are encoded by overexpressed genes that correlate with This journal is © The Royal Society of Chemistry 2016 patient outcome.A recent study explored the role of mutations on tumorigenesis 71 and more recently using a structural genomics based approach. 72,73Our work complements these studies by identifying druggable binding pockets and classifying pockets into whether they occur at enzyme active sites or protein-protein interaction sites.Mutations that occur within these pockets are expected to have direct consequences to the function of a protein.These pockets could provide promising targets for the development of small-molecule therapeutic agents.Interestingly, several mutations occurred in enzyme active sites.These mutations may either enhance or inhibit enzyme activity.Most of the enzyme mutations appear to involve dramatic changes in physico-chemical properties such as H113Q, G568W, R140L, M80R for CA6, KIFC1, NEK2, and SULT4A1.Others involved subtler mutations such as V46A, A287S, and M52T for CHEK1, PCK1, and PSPH, respectively.Since we have focused on proteins that are expected to be overexpressed, it is likely that these mutations will further enhance the activity of these enzymes.Three mutations were identified to occur at protein-protein interfaces, R293P, W167L, and Q107H, which correspond to ADORA2A, MAD2L1, and RHCG, respectively.The first two may have disruptive effects considering that proline residues tend to disrupt secondary structures and tryptophan residues are generally believed to tighten protein-protein interactions.The overwhelming majority occurred at OTH binding sites.These mutations provide an opportunity to validate the importance of these pockets.It suggests that these pockets may be located at unknown active sites or protein-protein interfaces.Considering that many of these OTH pockets occur on enzymes, it is more likely that they may be located at a protein-protein interface and could be useful targets for the disruption of protein-protein interactions.

Gene expression
Level 3 gene expression data expressed using RNA-seq (RNASeq Version 2) technology for ten cancer types was retrieved from The Cancer Genome Atlas (TCGA).Triple-negative breast cancer (TNBC) patients were identified from a subset of patients in BRCA by filtering clinical records for breast cancer patients who were negative for estrogen receptor (ER), progesterone receptor (PR), and Her2/neu.The gene expression data was used to build a matrix of read counts for each sample against each mapped gene.Only samples with designations of either the primary solid tumor or the solid tissue normal were kept in this matrix.Differential expression analyses between cancer and normal samples in the RNA-seq expression profiles were conducted using default parameters in the edgeR 74 package in R. 75 Differentially-expressed (overexpressed) genes were defined as those genes with p o 0.001 and Q o 0.05.Two log 2 fold changes of Z2.0 and Z1.5 were used to filtered genes for further analysis.Gene symbols provided by TCGA were mapped to their respective UniProt IDs using UniProt's mapping tool (http://www.uniprot.org/mapping/).

Protein structures
An annotated set of 20 192 reference human protein identifiers was retrieved from UniProtKB/SwissProt. 35 The FASTA sequences were retrieved for each of these proteins and used to identify structures in the RCSB Protein Data Bank (PDB). 76ach FASTA sequence was queried against the pdbaa dataset using BLASTP (protein-protein BLAST v2.2.25+). 77To limit the search to protein structures that possess significant sequence identity and coverage to the query sequence, only structures with E-value o10 À5 , 490% sequence identity, and PDB sequence coverage 480% were kept.We then identified the experimental methodology, taxonomy of the identified protein chain, and the structural resolution if the structure was from X-ray diffraction.Previously identified structures were then filtered for only crystal structures from human proteins with a resolution better than 3 Å.To reduce the number of redundant structures identified by BLASTP and generate a representative set of crystal structures associated with each protein, CD-HIT (v4.6.1) 78was used with default parameters to cluster the FASTA sequences of the PDB structures identified for each of the proteins.Only cluster centers identified by CD-HIT were used to locate binding sites on the structures for the protein.In total, 4124 proteins had at least one crystal structure that met all of these criteria.

Binding site identification
Identification of druggable binding sites on the crystal structures was carried out using the Schro ¨dinger Software Suite.For each cluster identified by CD-HIT, the cluster centers (i.e. the representative structures) were used to identify binding sites.Structures were first retrieved from PDB and binding partners were removed to identify the monomeric representative structures.All other heteroatoms, including solvent molecules and bound ligands, were removed.Selenomethonine residues were converted to methonines.These preprocessed PDB monomeric structures were then processed using the Protein Preparation Wizard workflow.Missing side chains and loops were added with the Prime 79 module.Disulfide bonds were added and each crystal structure was protonated using PROPKA at pH 7.0.Binding sites were identified using the SiteMap 25 module in Schro ¨dinger on the processed structure.Up to 10 binding sites were kept, while all other parameters were left default.Only binding sites 28 with SiteScore and DrugScore above 0.8 were kept.The average coordinates of the SiteMap spheres were used to identify the centroid of the binding site.Druggable binding sites were distinguished as those with a DrugScore above 1.0.In total, we identified 5498 binding sites on 2607 proteins.

Binding site annotation
PyMOL 80 scripts were generated to create individual sessions for each protein with druggable binding sites.The unprocessed protein structure, including all bound ligands and other nonsolvent molecules was overlaid back atop the crystal structure.In addition, all redundant structures from the CD-HIT clustering were added and aligned back to the druggable protein.The location of enzymatic binding residues were retrieved from UniProt 35 This journal is © The Royal Society of Chemistry 2016 Mol.BioSyst., 2016, 12, 3067--3087 | 3083 and Catalytic Site Atlas 36 and highlighted on the processed protein structures.
Each binding site identified by SiteMap was visually inspected and manually annotated to determine its functional role in the protein.If an enzymatic residue was in contact with the SiteMap spheres, or if an enzymatic molecule or inhibitor occupied the space of the spheres, the binding site was labeled 'enzymatic' (ENZ).If the binding site was at a protein-protein interaction (PPI) interface on the original structure or on any of the aligned structures, the binding site was labeled 'PPI'.Otherwise, if the binding site was neither enzymatic nor part of the interaction interface, it was labeled 'Other' (OTH).Binding sites of the recognition site of human leukocyte antigens (HLAs) and heme cofactor binding site of Cytochrome P450s were labeled 'Other'.
Secondary structures for each of the binding sites and their interaction partners were retrieved from DSSP. 81The secondary structure of each residue of a crystal structure are classified into helix, sheet, or coil in DSSP.The number of residues falling into each category was retrieved for the residues within 5 Å of the binding site.If there is at least a 60% consensus in the secondary structures for these residues, it was assigned into that category.Otherwise, the binding site was considered mixed.

Survival analysis
Kaplan-Meier curves were built using the survival 82 package in R. 75 For each disease, each patient's time to last follow-up or time to death was collected from the clinical data depending on whether or not the patient was deceased.A patient's overall survival was paired with their respective log 2 CPM and for diseases using RNA-seq.Expression levels for each gene was separated into 'high expression' and 'low expression' groups using the median expression of the gene across all patients for a given disease.A Cox proportional hazards regression model was fitted to the survival profile to determine the hazard ratio (HR) of each gene.Genes were filtered using p o 0.05 and HR 4 1.0.

Signaling pathway
27 cancer related signaling pathways were collected from KEGG. 47Individual proteins within each of these pathways were collected and mapped to their respective UniProt IDs using the REST API in KEGG.Any protein that could not be mapped to a UniProt entry from the reference protein identifiers was filtered out.

Protein-protein interaction network
A protein-protein interaction network was constructed using the NetworkX 83 module in Python by retrieving human PPI data with experimental evidence from seven major interaction databases: Biomolecular Interaction Network Database (BIND), 84 BioGRID, 85 Database of Interacting Proteins (DIP), 86 Human Protein Reference Database (HPRD), 87 IntAct, 88 Molecular INTeraction database (MINT), 89 and Reactome. 90Only those interactions with at least two occurrences among the seven databases were kept.The resulting network featured 9665 nodes and 38 164 edges.

Missense mutations
Mutations were obtained from a recent study by Kandoth and coworkers. 55The work identified somatic variants from 12 cancers as part of TCGA's Pan-Cancer initiative.We only use missense mutation data as other mutations result in the insertion or deletion of amino acids from the protein sequence, which would be very difficult to model onto the three-dimensional structure of the protein.Mutations were mapped using the sample ID barcode provided by TCGA to match patients with both mutation and gene expression data.The data for three diseases were not used since THCA was not included in the original study, while COAD and UCEC had low numbers of patient samples with matched gene expression data.Genes were mapped from Ensembl Transcript IDs to UniProt IDs using UniProt's mapping tool.For each protein, the subsequent amino acid position on the protein sequence was mapped to the protein structure using the pairwise function in BLASTP.Each mutation was then classified by minimizing the Euclidean distance from the corresponding alpha carbon of the mutated residue to the site points (grid spheres) of each binding site on the protein structure.In addition, the solvent-accessible surface area (SASA) of the mutated residue was calculated using NACCESS. 91We used the SASA and distance to the closest binding site to classify each mutation as being (i) adjacent to a binding site; (ii) elsewhere on the protein surface; or (iii) buried in the interior of the protein.If the distance between the mutation and the closest binding site was less than 4 Å, the mutation was classified as being adjacent to the binding pocket.Otherwise, if the SASA of the mutated residue was greater than 10 Å 2 , the mutation was classified as being on the surface of the protein.If the mutation did not fit into either of these criteria, it was classified as located in the interior of the protein.

Fig. 1
Fig. 1 Examples of proteins with both ENZ and PPI binding sites.Proteins are represented in cartoon format.The monomer structure with identified binding sites is in white.SiteMap binding sites are shown as spheres, bound ligands are shown as ball-and-sticks.(A) The homodimeric structure of CDA (PDB: 1mq0.B) with a bound inhibitor at a binding site classified as both ENZ and PPI.(B) The homodimeric structure of NAMPT (PDB: 4o0z.B) with an ENZ (peach, bound inhibitor) and a PPI (blue) binding site on the same domain.(C and D) The protein kinase (PDB: 2vwy.A) and ligand binding domain (PDB: 2hle.A) of EPHB4 featuring an ENZ and a PPI binding site on separate domains.The binding site on the protein kinase domain is not shown as spheres, but is occupied by the bound inhibitor (green).
the monomer structure of the apo protein.After superimposition of additional crystal structures back onto the representative structure, two of the three detected binding sites were classified as PPI.The two separate PPI binding sites occupy the respective interfaces between PLAUR-PLAU and PLAUR-VTN.An additional OTH binding site was also detected on the protein surface and represents an allosteric site.

Fig. 2
Fig. 2 Examples of proteins with potentially allosteric OTH binding sites.Proteins are represented in cartoon format.The monomer structure with identified binding sites is in white.SiteMap binding sites are shown as spheres, bound ligands are shown as ball-and-sticks.(A) SULT2B1 (PDB: 1q1q.A) with an ENZ binding site occupied by a nucleotide and three additional OTH binding sites (green, blue, yellow).(B) RET (PDB: 2iiv.A) with an ENZ binding site occupied by the bound inhibitor and an additional OTH binding site (green).(C) CHP2 (PDB: 2bec.A) with two PPI binding sites (green, blue) at the interface with SL9CA1 (PDB: 2bec.B) and an additional OTH binding site (peach).(D) The superimposed structure of PLAUR (PDB: 1ywh.M) with two PPI binding sites at the interfaces with VTN (PDB: 3bt1.B, green) and PLAU (PDB: 3bt1.A, yellow) and an additional OTH binding site (peach).

Fig. 3
Fig. 3 Binding sites in cancer related signaling pathways.Proteins with binding sites were mapped to 27 cancer related signaling pathways in KEGG.Identified binding sites were divided based on whether the protein was exclusive to one signaling pathway or occurred in multiple signaling pathways.(A) Identified binding sites had DrugScore greater than 0.8 on proteins with log 2 fold change greater than 1.5.(B) Identified binding sites had DrugScore greater than 1.0 and log 2 fold change greater than 2.

Fig. 4
Fig. 4 Proteins with binding sites that are both overexpressed and correlate with patient outcome.(A) Fold change versus hazard ratio across all cancer types on proteins with log 2 FC Z 1.5, HR 4 1.0, and DrugScore 4 0.8.(B) SiteScore and DrugScore of binding sites by functional annotation for proteins in A. (C) Degree versus betweenness centrality from PPI network for all proteins with log 2 FC Z 1.5 and HR 4 1.Proteins are colored coded based on whether there was a high quality crystal structure (blue), a crystal structure but no identifiable binding sites (orange), binding sites with DrugScore between 0.8 and 1.0 (gray), and druggable binding site with DrugScore greater than 1.0 (yellow).(D) Fold change versus hazard ratio across all cancer types on proteins with druggable binding sites with log 2 FC Z 2.0, HR 4 1.0, and DrugScore 4 1.0.(E) SiteScore versus DrugScore of druggable binding sites with log 2 FC Z 2.0, HR 4 1.0, and DrugScore 4 1.0.(F) Degree versus betweenness centrality from PPI network for all proteins with log 2 FC Z 2.0, HR 4 1.0, and DrugScore 4 1.0.

aa
Has a binding site that has a co-crystallized small molecule inhibitor.Molecular BioSystems PaperThis journal is © The Royal Society of Chemistry 2016 Table 7 Uncommon cancer targets with druggable binding site Has a binding site that has a co-crystallized small molecule inhibitor.This journal is © The Royal Society of Chemistry 2016 Mol.BioSyst., 2016, 12, 3067--3087 | 3079

Table 1
Structural coverage of TCGA and the human proteome This journal is © The Royal Society of Chemistry 2016

Table 2
Distribution of protein structures and druggable binding sites among cancer types (log 2 FC Z 2.0, DS Z 1.0)

Table 5
Proteins with potential PPI binding sites identified from search against PrePPI a The binding site is druggable (DS Z 1.0).This journal is © The Royal Society of Chemistry 2016 Mol.BioSyst., 2016, 12, 3067--3087 | 3075