Open Access Article
David
Xu
ab,
Shadia I.
Jalal
c,
George W.
Sledge
Jr.
d and
Samy O.
Meroueh
*aef
aCenter for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana 46202, USA
bDepartment of BioHealth Informatics, Indiana University School of Informatics and Computing, Indianapolis, Indiana 46202, USA
cDepartment of Medicine, Indiana University School of Medicine, Indianapolis, Indiana 46202, USA
dDivision of Oncology, Stanford University Medical Center, USA
eDepartment of Biochemistry and Molecular Biology, Indiana University School of Medicine, 410 W. 10th Street, HITS 5000, Indianapolis, Indiana 46202, USA. E-mail: smeroueh@iu.edu; Fax: +1 (317) 278-9217; Tel: +1 (317) 274-8315
fStark Neuroscience Research Institute, Indiana University School of Medicine, Indianapolis, Indiana 46202, USA
First published on 5th July 2016
The Cancer Genome Atlas (TCGA) offers an unprecedented opportunity to identify small-molecule binding sites on proteins with overexpressed mRNA levels that correlate with poor survival. Here, we analyze RNA-seq and clinical data for 10 tumor types to identify genes that are both overexpressed and correlate with patient survival. Protein products of these genes were scanned for binding sites that possess shape and physicochemical properties that can accommodate small-molecule probes or therapeutic agents (druggable). These binding sites were classified as enzyme active sites (ENZ), protein–protein interaction sites (PPI), or other sites whose function is unknown (OTH). Interestingly, the overwhelming majority of binding sites were classified as OTH. We find that ENZ, PPI, and OTH binding sites often occurred on the same structure suggesting that many of these OTH cavities can be used for allosteric modulation of enzyme activity or protein–protein interactions with small molecules. We discovered several ENZ (PYCR1, QPRT, and HSPA6) and PPI (CASC5, ZBTB32, and CSAD) binding sites on proteins that have been seldom explored in cancer. We also found proteins that have been extensively studied in cancer that have not been previously explored with small molecules that harbor ENZ (PKMYT1, STEAP3, and NNMT) and PPI (HNF4A, MEF2B, and CBX2) binding sites. All binding sites were classified by the signaling pathways to which the protein that harbors them belongs using KEGG. In addition, binding sites were mapped onto structural protein–protein interaction networks to identify promising sites for drug discovery. Finally, we identify pockets that harbor missense mutations previously identified from analysis of TCGA data. The occurrence of mutations in these binding sites provides new opportunities to develop small-molecule probes to explore their function in cancer.
TCGA is an ongoing effort that aims to catalog clinical and molecular profiles of tumor samples from over 30 cancer types to discover cancer-causing alterations in large cohorts through integrated multi-platform analyses. The project aims to integrate the clinical and molecular profiles of at least 500 tumors for each disease and to determine its underlying molecular mechanism. Multiple platforms capture the clinical, pathological, genomic, epigenomic, transcriptomic, and proteomic profiles of cancers in TCGA project. Among these platforms, RNA-seq is a widely-used technology for the characterization of mRNA expression. RNA-seq uses high-throughput short reads that offer several distinct advantages over its array-based predecessors. RNA-seq is not limited by a set of predetermined probes seen in microarrays, and is superior in its ability to identify low abundance transcripts, biological isoforms, and genetic variants.11 RNA-seq was performed for both tumor and normal tissue for each disease at TCGA. Comparison of tumor and normal mRNA levels can be used to identify overexpressed genes and their corresponding protein product that may contribute to tumor formation, progression, and metastasis. Patient information that accompanies the genomic data affords further analyses to assess the correlation of mRNA levels with patient outcome. Survival curves constructed by plotting patient outcome with time can be used to generate metrics such as hazard ratios and other coefficients to determine the correlation between overexpression of individual genes and clinical outcome. This analysis has been widely used in clinical trials, where Kaplan–Meier survival curves are used to determine the time-to-event differences between placebo and drug groups.12
Whether overexpressed genes contribute to the cancer phenotype must be confirmed in follow-up studies in vitro and in vivo, especially since studies have shown that there is not always a direct correlation between the levels of gene expression and the proteins that they encode.13–15 This is typically accomplished using molecular biology approaches such as RNAi16 or CRISPR/Cas917 technologies. A complementary approach is the use of small organic molecules that work by binding to well-defined cavities or binding sites on the surface of a protein and compete with the target's ligands either in an orthosteric or allosteric manner. Binding sites that are located at enzyme active sites, protein–protein interfaces, or known allosteric sites, have particular functional relevance. Identification of binding sites is accomplished by analyzing the three-dimensional structure of a protein. Several computational methods have been developed to scan the surface of proteins for binding sites.18 Binding site detection algorithms, such as CavBase,19 fpocket,20 and LIGSITECSC,21 often represent the protein structure through the use of points on a three-dimensional grid. Other algorithms, such as Q-SiteFinder,22 PocketFinder,23 and SiteHound,24 employ energy-based approaches to calculate interaction potentials at points in the grid and cluster favorable points together into binding sites. Finally, ensemble or combinatorial algorithms, such as SiteMap25 and metaPocket,26 use a combination of geometric and energy-based methods to identify potential binding sites.
Extensive data from TCGA combined with the exponentially growing structural data at the Protein Data Bank (PDB) offers a unique opportunity to identify protein structures of overexpressed or clinically-relevant genes in cancer. These structures can be used to scan for binding sites to develop chemical probes and lead compounds for drug discovery. In addition to detecting binding sites, algorithms have been developed to score these binding sites based on whether they can accommodate a small molecule. Both SiteMap and fpocket provide descriptors to assess binding sites that are suitable for small-molecule ligands based on the amino acid composition of the binding site and its collective physicochemical properties. SiteMap uses the hydrophobicity and accessibility of a detected binding site to assess how likely a small-molecule inhibitor will bind. It provides two scores, SiteScore and DrugScore. The latter score goes beyond just assessing a binding site for ligand binding. It measures whether a binding site is druggable, or whether it possesses similar properties to other binding sites that have led to FDA-approved drugs. fpocket provides a measure called the druggability score, which is a general logistical model based on the local hydrophobic density of the binding site, as well as a hydrophobicity and normalized polarity score. The discovery of binding sites within structures that are encoded by overexpressed genes with clinical relevance is highly significant as these binding sites can be used to develop novel cancer therapeutics that are likely to exhibit greater efficacy in humans.
In addition to druggability, the binding sites must be functionally important to serve as targets for small molecules. For example, binding sites located at enzyme active sites or at the interface between a protein–protein complex are expected to disrupt protein function. Protein kinases are one example of an enzyme class with druggable binding sites that occur at the enzyme active site.27 The ATP binding site of kinases is highly druggable with a SiteMap SiteScore and DrugScore above 1.1.28 There are fewer small-molecule inhibitors of protein–protein interactions, which is partly due to the lack of druggable binding sites at protein–protein interfaces. The only examples of PPI inhibitors that have shown in vivo efficacy, such as MDM2/p53 or BcL-xL, possess druggable binding sites (DrugScore of 0.92 and 0.82, respectively).29 Therefore, the identification of binding sites that are considered druggable at protein–protein interaction interfaces can provide new avenues to develop chemical probes and cancer therapeutics. Finally, it is worth mentioning that binding sites located outside an enzyme active site or protein–protein interface can also be functionally relevant. These binding sites may modulate protein function in an allosteric manner through long-range interactions that involve dynamic changes of the target protein.30–34 Allosteric inhibitors have been successfully used to inhibit kinase activity and in some cases, such as AKT, have shown more promise than competitive inhibitors.
Here, we collect gene expression profiles for 10 cancer types from TCGA and compare the expression profiles between cancer and normal samples to identify genes that are overexpressed in each cancer type. We search the Protein Data Bank for crystal structures of the protein products of these genes. We scan the surface of these proteins and identify binding sites. The functional relevance of these binding sites is explored by classifying them into known enzyme active sites, protein–protein interaction sites, or other sites that may lie outside of functional sites. To further explore the biological outcome of small molecules that bind to these binding sites, proteins harboring binding sites are further characterized in the context of a global PPI network and cancer signaling pathways to gain insight into the biological effect of binding at these binding sites. Patient data is used to investigate the correlation of overexpressed genes with clinical outcome. Our analysis uncovered new unexplored and potentially druggable and clinically-relevant protein targets. The study also provides new avenues for the rational design of small-molecule probes for well-established oncogenes. This is the first study that maps binding pockets on three-dimensional structures of the PDB within the context of cancer genomic data.
192 reference proteins using UniProt35 identifiers led to 7044 proteins that are encoded by TCGA overexpressed genes (Table 1 and Table S1, ESI†). For each cancer type, we identified the number of overexpressed genes with protein products having at least one high-resolution crystal structure by mining the Protein Data Bank (PDB). A total of 5069 unique protein chains on 2758 crystal structures from the PDB mapped to at least one of the 7044 overexpressed genes. In cases where more than one crystal structure was identified for a protein, the computer program CD-HIT was used to cluster the protein sequences of the crystal structures to find a set of non-redundant representative structures for the given protein. This resulted in 1624 unique crystal structures of proteins encoding overexpressed genes. The total number of proteins that encoded overexpressed genes ranged from 839 for TNBC to 2096 for LUSC (Table 2). Overall, the percentage of differentially-expressed genes with at least one crystal structure spanning at least a portion of the gene sequence ranges from 20% in LUSC to 34% in GBM. Additionally, we introduce more stringent cutoffs to distinguish between proteins that can act as probes versus those that feature druggable binding sites by increasing cutoffs of both the log2 fold change and the druggability property of a binding site. Using these increased cutoffs, we identify 5218 overexpressed proteins in TCGA, with only 1218 having a high quality crystal structure at the PDB (Table 1).
TCGA druggable binding sites (log2 FC ≥ 2.0, DS ≥ 1.0) |
TCGA binding sites (log2 FC ≥ 1.5, DS ≥ 0.8) |
All proteins | |
|---|---|---|---|
| Total number of proteins | 5218 | 7044 | 20 192 |
| Proteins with structure | 1218 | 1624 | 4124 |
| Proteins with druggable binding sites | 405 | 1044 | 2607 |
| Number of druggable binding sites | 502 | 2214 | 5498 |
| ENZ | 126 | 434 | |
| PPI | 55 | 231 | |
| OTH | 331 | 1576 |
FC ≥ 2.0, DS ≥ 1.0)
| Cancer type | Cancer name | Total number of proteins | Proteins with structure | Proteins with druggable binding sites | Number of druggable binding sites | Binding site type | ||
|---|---|---|---|---|---|---|---|---|
| ENZ | PPI | OTH | ||||||
| BRCA | Breast invasive carcinoma | 1314 | 280 | 79 | 93 | 29 | 14 | 54 |
| COAD | Colon adenocarcinoma | 971 | 187 | 47 | 64 | 15 | 8 | 45 |
| GBM | Glioblastoma multiforme | 1168 | 429 | 161 | 145 | 34 | 13 | 99 |
| HNSC | Head and neck squamous cell carcinoma | 697 | 128 | 28 | 34 | 10 | 4 | 21 |
| KIRC | Kidney renal clear cell carcinoma | 1437 | 376 | 132 | 158 | 32 | 19 | 109 |
| LUAD | Lung adenocarcinoma | 1780 | 363 | 114 | 169 | 38 | 15 | 117 |
| LUSC | Lung squamous cell carcinoma | 2096 | 402 | 111 | 158 | 49 | 16 | 96 |
| THCA | Thyroid adenocarcinoma | 888 | 207 | 65 | 103 | 27 | 7 | 72 |
| TNBC | Triple-negative breast carcinoma | 839 | 211 | 51 | 64 | 21 | 10 | 38 |
| UCEC | Uterine corpous endometrioid carcinoma | 1449 | 332 | 95 | 136 | 37 | 17 | 86 |
Among 1624 overexpressed proteins with at least one high-resolution human crystal structure, 1044 (∼64%) had at least one binding site (Table 1). Similarly, among the 1218 highly overexpressed proteins with crystal structures, 405 (∼33%) had at least one druggable binding site. For individual diseases, roughly 30% of proteins with crystal structures corresponding to highly overexpressed genes possessed at least one druggable binding site (Table 2). For example, 51 proteins with a crystal structure from among 211 in TNBC had a druggable binding site, while 114 proteins with a crystal structure in LUAD were found to have a binding site among 363. Generally, we found more binding sites than proteins with crystal structures, suggesting that although many of the proteins harbored more than one binding site, a large portion might only act as probes rather than druggable sites. An average of about 0.38 druggable binding sites were identified per protein with crystal structures. For example, a total of 145 druggable binding sites were identified on the 429 proteins with crystal structures corresponding to differentially-expressed GBM genes. Among the most frequently overexpressed proteins with druggable binding sites are the members of the matrix metalloproteinases (MMPs) and protein kinases related to cell signaling (Fig. S1, ESI†).
| Symbol | Name | Interaction partner | ||
|---|---|---|---|---|
| PDB | Symbol | Name | ||
| a The identified binding site is druggable (DS ≥ 1.0) | ||||
| ANPEP | Aminopeptidase N | 4FYSC | AGT | Angiotensinogen |
| CDA | Cytidine deaminase | 1MQ0A | CDA | Cytidine deaminase |
| CTSV | Cathepsin L2 | 3KFQCa | CSTA | Cystatin-A |
| DDR1 | Epithelial discoidin domain-containing receptor 1 | 3ZOSA | DDR1 | Epithelial discoidin domain-containing receptor 1 |
| DNM1 | Dynamin-1 | 2X2ED | DNM1 | Dynamin-1 |
| GAPDH | Glyceraldehyde-3-phosphate dehydrogenase | 1ZNQRa | GAPDH | Glyceraldehyde-3-phosphate dehydrogenase |
| GLA | Alpha-galactosidase A | 3HG3B | GLA | Alpha-galactosidase A |
| GSG2 | Serine/threonine-protein kinase haspin | 4OUCBa | HIST2H3A | Histone H3.2 |
| HDC | Histidine decarboxylase | 4E1OEa | HDC | Histidine decarboxylase |
| HOGA1 | 4-Hydroxy-2-oxoglutarate aldolase, mitochondrial | 3SO5Aa | HOGA1 | 4-Hydroxy-2-oxoglutarate aldolase, mitochondrial |
| KIF3C | Kinesin-like protein KIF3C | 3B6VB | KIF3C | Kinesin-like protein KIF3C |
| MMP14 | Matrix metalloproteinase-14 | 3MA2B | TIMP1 | Metalloproteinase inhibitor 1 |
| PCSK9 | Proprotein convertase subtilisin/kexin type 9 | 3BPSPa | PCSK9 | Proprotein convertase subtilisin/kexin type 9 |
| PGC | Gastricsin | 1AVFQ | PGC | Gastricsin |
| PGD | 6-Phosphogluconate dehydrogenase, decarboxylating | 2KJVC | PGD | 6-Phosphogluconate dehydrogenase, decarboxylating |
| PKLR | Pyruvate kinase PKLR | 4IMAC | PKLR | Pyruvate kinase PKLR |
| PNLIPRP2 | Pancreatic lipase-related protein 2 | 2PVSBa | PNLIPRP2 | Pancreatic lipase-related protein 2 |
| PNP | Purine nucleoside phosphorylase | 4ECEEa | PNP | Purine nucleoside phosphorylase |
| REN | Renin | 3G72Aa | REN | Renin |
| RNASE2 | Non-secretory ribonuclease | 2BEXB | RNH1 | Ribonuclease inhibitor |
| RRM1 | Ribonucleoside-diphosphate reductase large subunit | 2HNCB | RRM1 | Ribonucleoside-diphosphate reductase large subunit |
| SEPT3 | Neuronal-specific septin-3 | 3SOPB | SEPT3 | Neuronal-specific septin-3 |
| TDO2 | Tryptophan 2,3-dioxygenase | 4PW8Ea | TDO2 | Tryptophan 2,3-dioxygenase |
| UCHL1 | Ubiquitin carboxyl-terminal hydrolase isozyme L1 | 3IFWB | UBC | Polyubiquitin-C |
![]() | ||
| Fig. 1 Examples of proteins with both ENZ and PPI binding sites. Proteins are represented in cartoon format. The monomer structure with identified binding sites is in white. SiteMap binding sites are shown as spheres, bound ligands are shown as ball-and-sticks. (A) The homodimeric structure of CDA (PDB: 1mq0.B) with a bound inhibitor at a binding site classified as both ENZ and PPI. (B) The homodimeric structure of NAMPT (PDB: 4o0z.B) with an ENZ (peach, bound inhibitor) and a PPI (blue) binding site on the same domain. (C and D) The protein kinase (PDB: 2vwy.A) and ligand binding domain (PDB: 2hle.A) of EPHB4 featuring an ENZ and a PPI binding site on separate domains. The binding site on the protein kinase domain is not shown as spheres, but is occupied by the bound inhibitor (green). | ||
Other proteins contain distinct enzyme and PPI binding sites (Table 4). Of these 24 proteins, only ALOX12 and NR1L2 feature both druggable ENZ and PPI binding sites. These proteins can be placed into two categories based whether or not the binding sites are on the same protein domains. Some have ENZ and PPI binding sites on the same domain such as the decarboxylase GAD1, which has a catalytic site as well as a PPI binding site at its homodimer interface. Another example is the phosphoribosyltransferase NAMPT, which is implicated in cancer metabolism,40 and has an ENZ binding site with an inhibitor bound as well as a PPI binding site between the homodimer structure (Fig. 1B). Other proteins have ENZ and PPI binding sites on separate domains. For example, the serine/threonine-protein kinase PLK1 has both an enzyme ATP binding site on its protein kinase domain and a binding site at the PPI interface at its POLO-box domain. Another similar example is the receptor tyrosine kinase EPHB4, which has an enzyme ATP binding site on its protein kinase domain (Fig. 1C) and a binding site at the PPI interface with an ephrin ligand EFNB2 on its ligand binding domain (Fig. 1D). These binding sites may be used to develop allosteric modulators. Small molecules that bind to the PPI binding site may alter substrate binding to the active site. A small molecule inhibitor of enzyme activity may affect the protein–protein interaction of the protein.
| Symbol | Name | Interaction partner | ||
|---|---|---|---|---|
| PDB | Symbol | Name | ||
| a Both ENZ and PPI binding sites are druggable (DS ≥ 1.0). | ||||
| ACMSD | 2-Amino-3-carboxymuconate-6-semialdehyde decarboxylase | 4IH3A | ACMSD | 2-Amino-3-carboxymuconate-6-semialdehyde decarboxylase |
| ADH1C | Alcohol dehydrogenase 1C | 1HSOA | ADH1C | Alcohol dehydrogenase 1C |
| ALOX12 | Arachidonate 12-lipoxygenase, 12S-type | 3D3LBa | ALOX12 | Arachidonate 12-lipoxygenase, 12S-type |
| AOC1 | Amiloride-sensitive amine oxidase [copper-containing] | 3MPHB | AOC1 | Amiloride-sensitive amine oxidase [copper-containing] |
| BHMT | Betaine-homocysteine S-methyltransferase 1 | 1LT7B | BHMT | Betaine-homocysteine S-methyltransferase 1 |
| CTSE | Cathepsin E | 1TZSP | CTSE | Cathepsin E |
| DDC | Aromatic-L-amino-acid decarboxylase | 3RBFB | DDC | Aromatic-L-amino-acid decarboxylase |
| DDX39A | ATP-dependent RNA helicase DDX39A | 1T6NB | DDX39A | ATP-dependent RNA helicase DDX39A |
| EPHB2 | Ephrin type-B receptor 2 | 2QBXD | Antagonist peptide (Ephrin binding site) | |
| EPHB4 | Ephrin type-B receptor 4 | 2HLEB | EFNB2 | Ephrin-B2 |
| GAD1 | Glutamate decarboxylase 1 | 3VP6A | GAD1 | Glutamate decarboxylase 1 |
| GPI | Glucose-6-phosphate isomerase | 1JIQB | GPI | Glucose-6-phosphate isomerase |
| HK2 | Hexokinase-2 | 2NZTA | HK2 | Hexokinase-2 |
| HMGCS2 | Hydroxymethylglutaryl-CoA synthase, mitochondrial | 2WYAD | HMGCS2 | Hydroxymethylglutaryl-CoA synthase, mitochondrial |
| NAMPT | Nicotinamide phosphoribosyltransferase | 4O0ZA | NAMPT | Nicotinamide phosphoribosyltransferase |
| NR1I2 | Nuclear receptor subfamily 1 group I member 2 | 3CTBBa | NR1I2 | Nuclear receptor subfamily 1 group I member 2 |
| NTRK1 | High affinity nerve growth factor receptor | 1WWWV | NGF | Beta-nerve growth factor |
| PLK1 | Serine/threonine-protein kinase PLK1 | 1Q4KE | Phosphopeptide | |
| PYGL | Glycogen phosphorylase, liver form | 2ZB2B | PYGL | Glycogen phosphorylase, liver form |
| RHOC | Rho-related GTP-binding protein RhoC | 3KZ1A | ARHGEF11 | Rho guanine nucleotide exchange factor 11 |
| SULT1C2 | Sulfotransferase 1C2 | 3BFXA | SULT1C2 | Sulfotransferase 1C2 |
| TH | Tyrosine 3-monooxygenase | 2XSNC | TH | Tyrosine 3-monooxygenase |
| TPH2 | Tryptophan 5-hydroxylase 2 | 4VO6B | TPH2 | Tryptophan 5-hydroxylase 2 |
| UPP1 | Uridine phosphorylase 1 | 3EUFB | UPP1 | Uridine phosphorylase 1 |
Many OTH binding sites occur on proteins with existing ENZ and/or PPI binding sites, which may be potential allosteric sites for protein inhibition. When the enzyme active site is well characterized on a protein surface, additional binding sites represent opportunities for allosteric inhibition of the protein's function. For example, the sulfotransferase SULT2B1 has four binding sites on its protein surface (Fig. 2A). The ENZ binding site is not shown on the figure but encompasses the adenosine nucleotide. Three additional OTH binding sites were detected on the surface of the protein and represent potential sites for allosteric sites. Another example of a protein with both ENZ and OTH binding sites is the protein kinase RET (Fig. 2B). In this structure, a known inhibitor occupies the ENZ ATP binding site, while an additional allosteric binding site is formed near the αC helix. Similarly, there are proteins with both PPI and OTH binding sites. One example is the PPI between CHN2 and SLC9A1 (Fig. 2C), where an α-helix from SLC9A1 occupies two PPI binding sites on CHN2. An additional potentially allosteric OTH binding site is formed on the backside of CHN2. Another example is the protein complex formed between PLAUR, PLAU, and VTN (Fig. 2D). In this example, binding sites were found on the monomer structure of the apo protein. After superimposition of additional crystal structures back onto the representative structure, two of the three detected binding sites were classified as PPI. The two separate PPI binding sites occupy the respective interfaces between PLAUR–PLAU and PLAUR–VTN. An additional OTH binding site was also detected on the protein surface and represents an allosteric site.
![]() | ||
| Fig. 2 Examples of proteins with potentially allosteric OTH binding sites. Proteins are represented in cartoon format. The monomer structure with identified binding sites is in white. SiteMap binding sites are shown as spheres, bound ligands are shown as ball-and-sticks. (A) SULT2B1 (PDB: 1q1q.A) with an ENZ binding site occupied by a nucleotide and three additional OTH binding sites (green, blue, yellow). (B) RET (PDB: 2iiv.A) with an ENZ binding site occupied by the bound inhibitor and an additional OTH binding site (green). (C) CHP2 (PDB: 2bec.A) with two PPI binding sites (green, blue) at the interface with SL9CA1 (PDB: 2bec.B) and an additional OTH binding site (peach). (D) The superimposed structure of PLAUR (PDB: 1ywh.M) with two PPI binding sites at the interfaces with VTN (PDB: 3bt1.B, green) and PLAU (PDB: 3bt1.A, yellow) and an additional OTH binding site (peach). | ||
| Symbol | Name | Binding site | Predicted PPI | ||
|---|---|---|---|---|---|
| Model | Symbol | Name | |||
| a The binding site is druggable (DS ≥ 1.0). | |||||
| AK3 | GTP:AMP phosphotransferase AK3, mitochondrial | 1ZD8A2 | 2BWJ | AK5 | Adenylate kinase isoenzyme 5 |
| ANK1 | Ankyrin-1 | 1N11A3 | 2JAB | ILK | Integrin-linked protein kinase |
| CHN1 | N-chimaerin | 3CXLA3 | 1OW3 | RAC1 | Ras-related C3 botulinum toxin substrate 1 |
| HOGA1 | 4-Hydroxy-2-oxoglutarate aldolase, mitochondrial | 3S5OA1a | 3DAQ | HOGA1 | 4-Hydroxy-2-oxoglutarate aldolase, mitochondrial |
| HPD | 4-Hydroxyphenylpyruvate dioxygenase | 3ISQA1a | 1SQI | HPDL | 4-Hydroxyphenylpyruvate dioxygenase-like protein |
| HPD | 4-Hydroxyphenylpyruvate dioxygenase | 3ISQA5 | 1SQI | HPDL | 4-Hydroxyphenylpyruvate dioxygenase-like protein |
| LCN | Lipocalin-1 | 3EYCA1a | 2F91 | OVCH1 | Ovochymase-1 |
| NCS1 | Neuronal calcium sensor 1 | 1G8IB1 | 1AUI | PPP3CA | Serine/threonine-protein phosphatase 2B catalytic subunit alpha isoform |
| NCS1 | Neuronal calcium sensor 1 | 1G8IB2a | 1AUI | PPP3CA | Serine/threonine-protein phosphatase 2B catalytic subunit alpha isoform |
| RAP1GAP | Rap1 GTPase-activating protein 1 | 1SRQA1 | 3BRW | RAP1A | Ras-related protein Rap-1A |
| RHCG | Ammonium transporter Rh type C | 3HD6A1 | 2NUU | RHAG | Ammonium transporter Rh type A |
| RHCG | Ammonium transporter Rh type C | 3HD6A4 | 2NUU | RHAG | Ammonium transporter Rh type A |
| RHCG | Ammonium transporter Rh type C | 3HD6A5 | 2NUU | RHAG | Ammonium transporter Rh type A |
| SHMT2 | Serine hydroxymethyltransferase, mitochondrial | 3OU5A1 | 3GBX | SHMT2 | Serine hydroxymethyltransferase, mitochondrial |
| STXBP2 | Syntaxin-binding protein 2 | 4CCAA2 | 3C98 | STX1A | Syntaxin-1A |
| THEM5 | Acyl-coenzyme A thioesterase THEM5 | 4AE7A1 | 1Q4T | THEM4 | Acyl-coenzyme A thioesterase THEM4 |
| ZBTB32 | Zinc finger and BTB domain-containing protein 32 | 3M5BB1 | 3BIM | BCL6 | B-cell lymphoma 6 protein |
068 non-redundant protein–protein interactions. To address the robustness of the network, we further filtered the interactions by only keeping those interactions that appeared in at least two of the seven databases. This resulted in a network with 38
164 non-redundant protein–protein interactions. We then identified the network properties of each protein within this network to measure the centrality and essentiality of each protein to the overall network. Among the topological properties of a given protein are its degree, which describes the number of interactions that are formed by that protein, and its betweenness centrality, which describes the number of shortest paths that go through the given protein. In a biological context, betweenness centrality is a measure of the available paths that a signal can travel through a given network.52 Thus, proteins with high betweenness are thought to be essential to biological function and are frequently targeted in drug discovery.53 For example, TP53 has a betweenness centrality and degree of 4.1 × 10−2 and 236, respectively, while EGFR is 2.3 × 10−2 and 181 for the same properties. We examine the topological properties of all proteins that are overexpressed (log2
FC ≥ 1.5) and whose expression correlate with patient outcome (Fig. 4C). Of these 1343 proteins, 1001 (∼75%) did not have a high quality crystal structure and an additional 141 (∼10%) had a structure but no binding sites. Of the remaining proteins, 117 (9%) and 84 (6%) have binding sites and druggable binding sites, respectively. When the differential-expression cutoff is increased to 2 and the minimum DrugScore is increased to 1.0, 60 proteins have at least one druggable binding site (Fig. 4F). Among the proteins with the highest centrality and degree are PLK1, KPNA2, AURKA, and AURKB.
| Protein | Network | Signaling pathways | Cancer | Binding site | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Symbol | Name | Degree | Betweenness (×106) | Disease | FC | HR (95% CI) | Publications | Structure | All | ENZ | PPI | OTH | |
| a Has a binding site that has a co-crystallized small molecule inhibitor. | |||||||||||||
| PLK1 | Serine/threonine-protein kinase PLK1 | 74 | 5876.7 | Cell cycle | BRCA | 3.7 | 1.8 (1.2–2.7) | 776 | 2OWBA | 1a | 1 | 0 | 0 |
| KIRC | 2.3 | 2.4 (1.7–3.3) | |||||||||||
| LUAD | 3.5 | 1.8 (1.2–2.6) | |||||||||||
| CDC20 | Cell division cycle protein 20 homolog | 37 | 1379.5 | Cell cycle, ubiquitin mediated proteolysis | LUAD | 4.0 | 1.6 (1.1–2.3) | 296 | 4GGDB | 1 | 0 | 0 | 1 |
| CCNB1 | G2/mitotic-specific cyclin-B1 | 34 | 934.3 | Cell cycle, p53 | BRCA | 2.6 | 1.6 (1.1–2.3) | 2214 | 2B9RB | 1 | 0 | 0 | 1 |
| LUAD | 3.0 | 1.8 (1.2–2.6) | |||||||||||
| AURKA | Aurora kinase A | 28 | 1684.1 | BRCA | 3.2 | 1.6 (1.1–2.3) | 686 | 2J4ZB | 1a | 1 | 0 | 0 | |
| LUAD | 2.8 | 1.5 (1.0–2.1) | |||||||||||
| MAD2L1 | Mitotic spindle assembly checkpoint protein MAD2A | 27 | 1194.3 | Cell cycle | LUAD | 2.7 | 1.7 (1.2–2.4) | 221 | 2V64F | 1 | 0 | 1 | 0 |
| AURKB | Aurora kinase B | 25 | 1199.7 | KIRC | 2.9 | 2.7 (1.9–3.8) | 500 | 4AF3A | 2a | 1 | 0 | 1 | |
| LUAD | 3.8 | 1.5 (1.0–2.1) | |||||||||||
| NEK2 | Serine/threonine-protein kinase Nek2 | 21 | 606.0 | LUAD | 3.9 | 1.8 (1.2–2.6) | 115 | 2XK4A | 1 | 0 | 0 | 1 | |
| BUB1 | Mitotic checkpoint serine/threonine-protein kinase BUB1 | 21 | 401.1 | BRCA | 3.6 | 1.6 (1.1–2.4) | 397 | 4R8QA | 3 | 1 | 0 | 2 | |
| KIRC | 2.3 | 2.1 (1.5–2.9) | |||||||||||
| LUAD | 3.0 | 1.8 (1.2–2.6) | |||||||||||
| ZAP70 | Tyrosine-protein kinase ZAP-70 | 19 | 278.3 | Ras | KIRC | 3.2 | 1.6 (1.2–2.3) | 702 | 4K2RA | 2 | 0 | 0 | 2 |
| CHEK1 | Serine/threonine-protein kinase Chk1 | 18 | 580.9 | Cell cycle, p53 | LUAD | 2.2 | 1.5 (1.0–2.1) | 77 | 2R0UA | 1a | 1 | 0 | 0 |
| CCNE1 | G1/S-specific cyclin-E1 | 13 | 111.8 | Cell cycle, p53, PI3K-Akt | BRCA | 3.0 | 2.1 (1.4–3.2) | 280 | 1W98B | 1 | 0 | 1 | 0 |
| LUAD | 3.5 | 1.5 (1.1–2.2) | |||||||||||
| C3 | Complement C3 | 9 | 1191.2 | KIRC | 3.4 | 1.5 (1.1–2.0) | 3468 | 2WIIA; 2WIIB | 6 | 0 | 2 | 4 | |
| F2 | Prothrombin | 9 | 443.3 | KIRC | 3.8 | 2.5 (1.8–3.4) | 3338 | 4NZQA | 1 | 0 | 0 | 1 | |
| TF | Serotransferrin | 9 | 353.3 | HIF-1 | KIRC | 4.4 | 2.0 (1.5–2.8) | 2556 | 3V8XB | 2 | 0 | 0 | 2 |
| CCNA2 | Cyclin-A2 | 9 | 80.6 | Cell cycle | KIRC | 2.1 | 2.2 (1.6–3.0) | 312 | 2BPMD | 1 | 0 | 0 | 1 |
| LUAD | 3.0 | 2.1 (1.4–3.0) | |||||||||||
| HNF4A | Hepatocyte nuclear factor 4-alpha | 9 | 55.7 | HNSC | 2.3 | 1.4 (1.0–1.9) | 173 | 4IQRE | 1 | 0 | 0 | 1 | |
| CTLA4 | Cytotoxic T-lymphocyte protein 4 | 8 | 46.5 | KIRC | 3.0 | 1.7 (1.2–2.4) | 1603 | 2X44D | 1 | 0 | 0 | 1 | |
| TTK | Dual specificity protein kinase TTK | 6 | 11.2 | Cell cycle | BRCA | 3.1 | 1.5 (1.0–2.3) | 155 | 2ZMDA | 1a | 1 | 0 | 0 |
| LUAD | 3.8 | 1.7 (1.2–2.5) | |||||||||||
| ALOX5 | Arachidonate 5-lipoxygenase | 5 | 268.6 | KIRC | 2.1 | 1.7 (1.2–2.3) | 403 | 3O8YA | 5 | 1 | 0 | 4 | |
| KIF11 | Kinesin-like protein KIF11 | 4 | 0.9 | LUAD | 2.9 | 1.6 (1.1–2.3) | 144 | 4AP0A | 1 | 1 | 0 | 0 | |
| ITGAM | Integrin alpha-M | 3 | 1606.8 | KIRC | 2.1 | 1.4 (1.0–1.9) | 1206 | 4M76B | 1 | 0 | 0 | 1 | |
| ADA | Adenosine deaminase | 3 | 50.2 | KIRC | 2.4 | 2.1 (1.5–2.9) | 1838 | 3IARA | 1 | 0 | 0 | 1 | |
| TOP2A | DNA topoisomerase 2-alpha | 3 | 13.2 | KIRC | 2.1 | 1.7 (1.3–2.4) | 367 | 4FM9A | 3 | 0 | 0 | 3 | |
| LUAD | 4.1 | 1.5 (1.0–2.1) | |||||||||||
| NNMT | Nicotinamide N-methyltransferase | 2 | 1.8 | KIRC | 4.0 | 1.7 (1.3–2.4) | 76 | 2IIPA | 1 | 1 | 0 | 0 | |
| MMP9 | Matrix metalloproteinase-9 | 2 | 0.0 | KIRC | 4.3 | 1.8 (1.3–2.5) | 6230 | 1L6JA | 1 | 1 | 0 | 0 | |
| ITGAX | Integrin alpha-X | 1 | 0.0 | KIRC | 3.4 | 1.5 (1.1–2.1) | 733 | 4NENA | 1 | 0 | 0 | 1 | |
| FABP5 | Fatty acid-binding protein, epidermal | 1 | 0.0 | PPAR | KIRC | 2.1 | 1.7 (1.3–2.4) | 88 | 4LKPA | 1 | 0 | 0 | 1 |
| AKR1B10 | Aldo-keto reductase family 1 member B10 | 1 | 0.0 | KIRC | 2.8 | 2.0 (1.4–2.7) | 118 | 4JIIX | 1a | 1 | 0 | 0 | |
| CYP2D6 | Cytochrome P450 2D6 | 0 | 0.0 | KIRC | 2.8 | 1.7 (1.2–2.3) | 915 | 3QM4A | 4a | 0 | 0 | 4 | |
| MMP12 | Macrophage metalloelastase | 0 | 0.0 | KIRC | 2.9 | 1.7 (1.2–2.3) | 113 | 3BA0A | 1 | 0 | 0 | 1 | |
| Protein | Network | Cancer | Binding site | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Symbol | Name | Degree | Betweenness (×106) | Signaling pathways | Disease | FC | HR (95% CI) | Publications | Structure | All | ENZ | PPI | OTH |
| a Has a binding site that has a co-crystallized small molecule inhibitor. | |||||||||||||
| KPNA2 | Importin subunit alpha-1 | 50 | 3152.3 | BRCA | 2.2 | 1.8 (1.2–2.7) | 55 | 4E4VB | 1 | 0 | 0 | 1 | |
| LUAD | 2.0 | 1.5 (1.0–2.1) | |||||||||||
| FBP1 | Fructose-1,6-bisphosphatase 1 | 16 | 820.8 | GBM | 2.6 | 1.5 (1.0–2.3) | 34 | 2FHYA | 1 | 0 | 0 | 1 | |
| KIF23 | Kinesin-like protein KIF23 | 10 | 61.5 | LUAD | 2.9 | 1.6 (1.1–2.4) | 25 | 3VHXH | 1 | 0 | 0 | 1 | |
| ORM1 | Alpha-1-acid glycoprotein 1 | 7 | 59.0 | KIRC | 6.1 | 1.4 (1.0–1.9) | 25 | 3KQ0A | 1 | 0 | 0 | 1 | |
| PTPN22 | Tyrosine-protein phosphatase non-receptor type 22 | 6 | 12.7 | KIRC | 2.2 | 1.5 (1.1–2.1) | 23 | 4J51B | 1 | 0 | 0 | 1 | |
| EPHA8 | Ephrin type-A receptor 8 | 5 | 99.1 | UCEC | 5.4 | 4.7 (1.0–21.5) | 14 | 3KULB | 1 | 1 | 0 | 0 | |
| ERO1L | ERO1-like protein alpha | 4 | 10.7 | LUAD | 2.2 | 1.7 (1.2–2.4) | 13 | 3AHQA | 2 | 1 | 0 | 1 | |
| PKMYT1 | Membrane-associated tyrosine- and threonine-specific cdc2-inhibitory kinase | 4 | 6.1 | Cell cycle | KIRC | 2.9 | 1.9 (1.4–2.7) | 14 | 3P1AA | 1 | 1 | 0 | 0 |
| TDO2 | Tryptophan 2,3-dioxygenase | 3 | 403.6 | KIRC | 2.7 | 1.5 (1.1–2.1) | 52 | 4PW8F | 1 | 1 | 1 | 0 | |
| GCKR | Glucokinase regulatory protein | 2 | 200.8 | KIRC | 2.9 | 2.5 (1.8–3.5) | 15 | 4OLHA | 3 | 0 | 0 | 3 | |
| SERPINB3 | Serpin B3 | 2 | 4.6 | LUAD | 2.2 | 1.7 (1.2–2.4) | 35 | 2ZV6A | 3 | 0 | 0 | 3 | |
| ADAMTS4 | A disintegrin and metalloproteinase with thrombospondin motifs 4 | 2 | 2.2 | KIRC | 2.0 | 1.6 (1.2–2.2) | 18 | 2RJPC | 1a | 1 | 0 | 0 | |
| SERPINB4 | Serpin B4 | 2 | 1.7 | LUAD | 4.0 | 1.5 (1.0–2.1) | 7 | 2ZV6A | 3 | 0 | 0 | 3 | |
| UCEC | 5.0 | 4.6 (1.2–17.1) | |||||||||||
| MELK | Maternal embryonic leucine zipper kinase | 1 | 0.0 | BRCA | 3.9 | 1.6 (1.0–2.3) | 63 | 4UMUA | 2a | 1 | 0 | 1 | |
| LUAD | 2.7 | 1.7 (1.2–2.3) | |||||||||||
| KIRC | 4.0 | 1.5 (1.0–2.1) | |||||||||||
| UCEC | 3.9 | 3.6 (0.9–3.6) | |||||||||||
| PLCB2 | 1-Phosphatidylinositol 4,5-bisphosphate phosphodiesterase beta-2 | 1 | 0.0 | Calcium, Wnt, Phosphatidylinositol | KIRC | 2.2 | 1.5 (1.1–2.0) | 8 | 2ZKMX | 4 | 0 | 0 | 4 |
| PCK1 | Phosphoenolpyruvate carboxykinase, cytosolic [GTP] | 1 | 0.0 | Citrate, PPAR, PI3K-Akt | LUAD | 3.8 | 1.5 (1.0–2.1) | 15 | 2GMVA | 1 | 0 | 0 | 1 |
| TNFAIP8L2 | Tumor necrosis factor alpha-induced protein 8-like protein 2 | 1 | 0.0 | KIRC | 2.4 | 1.6 (1.1–2.2) | 3 | 3F4MA | 1 | 0 | 0 | 1 | |
| ANXA8L2 | Annexin A8-like protein 2 | 0 | 0.0 | KIRC | 2.1 | 1.7 (1.3–2.4) | 0 | 1W45B | 1 | 0 | 0 | 1 | |
| GSG2 | Serine/threonine-protein kinase haspin | 0 | 0.0 | BRCA | 2.0 | 1.6 (1.1–2.4) | 5 | 3DLZA | 1a | 1 | 1 | 0 | |
| RNASET2 | Ribonuclease T2 | 0 | 0.0 | KIRC | 3.2 | 1.9 (1.3–2.6) | 27 | 3T0OA | 1 | 0 | 0 | 1 | |
| NCF1C | Putative neutrophil cytosol factor 1C | 0 | 0.0 | GBM | 2.5 | 1.8 (1.2–2.7) | 0 | 1NG2A; 1KQ6A | 2 | 0 | 2 | 0 | |
| XDH | Xanthine dehydrogenase/oxidase | 0 | 0.0 | KIRC | 2.2 | 1.5 (1.0–2.1) | 34 | 2E1QD | 9 | 1 | 0 | 8 | |
| LUAD | 4.5 | 1.9 (1.4–2.7) | |||||||||||
| CHI3L2 | Chitinase-3-like protein 2 | 0 | 0.0 | GBM | 4.8 | 1.9 (1.3–2.9) | 5 | 4P8XA | 1 | 0 | 0 | 1 | |
| KIRC | 2.5 | 2.0 (1.5–2.8) | |||||||||||
| ALDH1L1 | Cytosolic 10-formyltetrahydrofolate dehydrogenase | 0 | 0.0 | LUSC | 2.5 | 1.4 (1.0–1.9) | 23 | 2CFIA | 1 | 1 | 0 | 0 | |
| AKR1D1 | 3-Oxo-5-beta-steroid 4-dehydrogenase | 0 | 0.0 | KIRC | 4.5 | 1.6 (1.2–2.2) | 5 | 3UZWB | 1 | 1 | 0 | 0 | |
| SERPIND1 | Heparin cofactor 2 | 0 | 0.0 | KIRC | 3.8 | 1.6 (1.2–2.2) | 5 | 1JMOA | 2 | 0 | 0 | 2 | |
| TCN1 | Transcobalamin-1 | 0 | 0.0 | LUAD | 6.9 | 1.5 (1.0–2.1) | 10 | 4KKJA | 1 | 0 | 0 | 1 | |
| PARP15 | Poly [ADP-ribose] polymerase 15 | 0 | 0.0 | KIRC | 3.1 | 1.5 (1.1–2.1) | 5 | 3GEYA | 1a | 1 | 0 | 0 | |
Among these targets, 26 have been previously probed with small-molecule ligands and X-ray crystallography (Table S4, ESI†). Interestingly, many of these co-crystallized structures occur at binding sites at or below our higher DrugScore cutoff of 1.0, suggesting that a more stringent cutoff may discard otherwise druggable binding sites. Additionally, we mapped these druggable binding sites to conserved protein domains, and find that these binding sites are mainly parts of the protein kinase, serpin, kinesin, and peptidase domains (Table S5, ESI†). When we consider only those without co-crystallized small-molecule inhibitors, protein kinases and trypsin domains are removed. The majority of binding sites across both targeted and untargeted proteins are classified as OTH. In well-studied systems where the active site is known, these OTH sites represent opportunities for allosteric regulation.
We next looked at the secondary structure of residues that compose the individual binding sites of these proteins across their individual binding site annotations. By examining the residues around a binding site, we generalized the type of secondary structures that were used to construct the binding site itself (Fig. S7, ESI†). The majority of binding sites identified were a mixture of secondary structures or random coils among all proteins with or without small molecule inhibitors. Combined, these two secondary structures generally making up the large majority of all binding sites in each binding site type. In each case, the least frequently observed secondary structure among these binding sites were the helix-like (i.e. α-helix, 310 helix, or π-helix) and sheet-like structures (i.e. beta bridges and beta bulges). We then examined the secondary structures of the residues of the binding partner inside PPI binding sites. About 27 and 46% of the residues of the binding partners in the binding site were coil-like and helical (α-helix, 310 helix, or π-helix), respectively. Only 10% of the binding sites were characterized by strand-like structures (β-sheet or β-bridge). The remaining PPI binding sites were a combination of these.
![]() | ||
| Fig. 5 Proteins with missense mutations. (A) Missense mutations were mapped to patients in 7 of 10 diseases (COAD, THCA, and UCEC not included). Individual mutations were mapped to the protein structure and classified as being adjacent to the binding site, elsewhere on the protein surface, or buried in the interior of the protein structure. (B) Percentage of samples with missense mutations adjacent to a binding site in a given disease, showing the top 20 proteins rank-order using the sum of frequencies. (C) The W167L (green stick) mutation on the PPI interface between MAD2L1 (white) and MAD1L1 (cyan) is shown in cartoon (PDB ID: 1GO4). The PPI binding site is shown as transparent spheres. (D) The R121P (green stick) mutation adjacent to the DNA-binding OTH site (tan, transparent spheres) on EXO1 (white cartoon) (PDB ID: 3QEB). DNA in the binding site from the crystal structure is also shown as cartoon. (E) The counts of missense mutations at the amino acid level divided classified as being adjacent to the binding site, elsewhere on the surface of the protein, or buried in the protein interior. The original amino acid is listed row-wise and the subsequent mutation is listed column-wise. | ||
We matched these proteins with missense mutations with their gene expression levels and correlation with patient outcome. We find 29 binding sites on 26 proteins that are (i) overexpressed (log2 fold change ≥2); (ii) correlate with patient outcome (hazard ratio >1); and (iii) have a missense mutation adjacent to a binding site in a given disease (Table 8). These 29 binding sites include 9 ENZ, 3 PPI, and 17 OTH pockets. Among these mutations adjacent to binding sites is the W167L mutation on the PPI interface between MAD2L1 and MAD1L1 in LUAD (Fig. 5C). This interaction is part of the spindle assembly checkpoint in the cell cycle.62 Considering the significant reduction in contact area upon replacing tryptophan with leucine, and the fact that tryptophan residues tend to often occur at protein–protein interaction interfaces, we expect that this mutation may impair the protein–protein interaction. Another mutation is the R121P mutation adjacent to the DNA-binding OTH binding site on EXO1 in LUAD (Fig. 5D). The DNA-binding protein is also involved in DNA repair during cell cycle regulation.63 Unlike the previous mutation, arginine contains a positively charged group while proline is a neutral non-polar amino acid.
| Symbol | Name | Cancer type | Mutation | Type | |
|---|---|---|---|---|---|
| ADH1C | Alcohol dehydrogenase 1C | LUAD | G205C | 1HSZA1 | ENZ |
| ADORA2A | Adenosine receptor A2a | BRCA | R293P | 3VG9A5 | PPI |
| C3 | Complement C3 | KIRC | C873Y | 2WIIB4 | OTH |
| CA6 | Carbonic anhydrase 6 | LUSC | H113Q | 3FE4A1 | ENZ |
| CCNA2 | Cyclin-A2 | LUAD | L341F | 2BPMD1 | OTH |
| CCNE1 | G1/S-specific cyclin-E1 | BRCA | A338T | 1W98B2 | OTH |
| CHEK1 | Serine/threonine-protein kinase Chk1 | LUAD | V46A | 2R0UA1 | ENZ |
| CYP2A6 | Cytochrome P450 2A6 | LUAD | V306I | 2PG6B1 | OTH |
| CYP2D6 | Cytochrome P450 2D6 | KIRC | L213P | 3QM4A1 | OTH |
| EXO1 | Exonuclease 1 | LUAD | R121P | 3QEBZ1 | OTH |
| F2 | Prothrombin | KIRC | R543L | 4NZQA3 | OTH |
| KIF15 | Kinesin-like protein KIF15 | LUSC | G41A | 4BN2C2 | OTH |
| KIFC1 | Kinesin-like protein KIFC1 | LUAD | G568W | 2REPA1 | ENZ |
| MAD2L1 | Mitotic spindle assembly checkpoint protein MAD2A | LUAD | W167L | 2V64F1 | PPI |
| MELK | Maternal embryonic leucine zipper kinase | BRCA | Q115R | 4UMUA2 | OTH |
| LUAD | V271A | 4UMUA2 | OTH | ||
| NEK2 | Serine/threonine-protein kinase Nek2 | LUAD | R140L | 2XK4A1 | ENZ |
| PCK1 | Phosphoenolpyruvate carboxykinase, cytosolic [GTP] | LUAD | R137H | 2GMVA3 | OTH |
| A287S | 2GMVA1 | ENZ | |||
| G289W | |||||
| PSPH | Phosphoserine phosphatase | LUSC | M52T | 1L8OA1 | ENZ |
| RHCG | Ammonium transporter Rh type C | LUAD | Q107H | 3HD6A1 | PPI |
| RRM2 | Ribonucleoside-diphosphate reductase subunit M2 | LUAD | E207Q | 2UW2A2 | OTH |
| SERPINB3 | Serpin B3 | LUAD | A45T | 2ZV6A3 | OTH |
| SERPINB4 | Serpin B4 | LUAD | S33N | 2ZV6A2 | OTH |
| SULT4A1 | Sulfotransferase 4A1 | KIRC | M80R | 1ZD1A1 | ENZ |
| TOP2A | DNA topoisomerase 2-alpha | LUAD | E712V | 4FM9A4 | OTH |
| R736L | 4FM9A7 | OTH | |||
| TTK | Dual specificity protein kinase TTK | LUAD | C604F | 2ZMDA1 | ENZ |
| BRCA | G666E | 2ZMDA1 | ENZ | ||
| XDH | Xanthine dehydrogenase/oxidase | LUAD | C43F | 2E1QD3 | OTH |
| N461T | 2E1QD8 | OTH |
We examined the mutation rates of individual amino acids by looking at the wild-type and mutated amino acids as a result of a mutation at each of the three locations on the protein (Fig. 5E). We find differences in the relative frequencies of specific point mutations between each location. For example, mutations to alanine is less favored in the pocket or on the surface of the protein than it is in the interior, especially at charged or polar groups. Among the most common mutations in the binding site and on the surface is from lysine to glutamic acid, which occurs at a much lower frequency in the interior of the protein.
For small molecules to engage their targets with high affinity, a well-defined cavity that possesses suitable shape and physicochemical properties. The lack of such cavities is partly responsible for the difficulty in developing small-molecule therapeutic agents that bind directly to highly promising cancer targets such as mutated RAS GTPase or transcription factors such as c-MYC. Conversely, the success of kinases as oncology targets can be attributed to the well-defined ATP-binding site. Using binding sites of kinases and other druggable targets, several algorithms have been developed to predict the druggable nature of a binding site using the three-dimensional structure of the protein that harbors them.64 Among them, SiteScore and DrugScore, which have been developed using data from binding sites occupied by approved drugs.25,28 Druggable sites, the highly conserved nature of the ATP-binding site has been the main impediment in the development of kinase drugs. Developing highly selective kinase inhibitors is notoriously difficult, although some successes have been reported. Identifying novel targets with unique druggable binding sites located on potential cancer targets may lead to cancer therapeutics with greater efficacy and lower toxicity.
Here, in an effort to facilitate the chemical probing of new targets in cancer, we explore RNA-seq data of 10 tumor types at TCGA to identify unique and druggable binding sites on proteins encoded by protein products of overexpressed genes. The large-scale effort of TCGA to sequence the genome of tumors from more than 30 cancers provides an unprecedented opportunity to uncover new targets for the development of cancer therapeutics. We identified genes whose mRNA levels are overexpressed in tumors compared with normal tissue. Patient data provided by TCGA was used to further narrow the list of targets to genes whose overexpression correlates strongly with patient survival. This was accomplished by constructing survival curves and evaluating a hazard ratio for each overexpressed gene. Genes with hazard ratio of 1 or greater where considered to correlate with worse patient survival. For each of the 10 diseases that we have considered in this work, we identified protein products of genes whose mRNA levels are differentially-expressed that strongly correlate with patient survival. Additionally, we explored these targets in the context of cancer related signaling pathways and the protein–protein interaction network.
The exponentially growing list of three-dimensional structures of proteins prompted us to search the PDB to identify structures for protein products of up-regulated genes that we identified. We used a stringent threshold for these scores to ensure that small molecules that bind to the druggable binding sites have the potential to be developed into therapeutic agents. Among all up-regulated genes we found that 23% of their protein products had a structure at the PDB. Among the 1218 proteins with structures, 405 (33%) had druggable binding sites. A similar ratio was found among individual diseases. For example, 51 proteins with a crystal structure from among 211 in TNBC had a druggable binding site, while 114 proteins with a crystal structure in LUAD were found to have a binding site among a total of 363. When overexpressed genes are further filtered by hazard ratio, a total of 54 proteins that possess druggable binding sites and 65 possessed binding sites are identified among 1344 differentially-expressed genes. There were 15 druggable proteins that are present in multiple tumor types. The most frequently-occurring were MELK in 4 tumors.
The presence of a binding site is not sufficient to serve as a suitable target site for chemical probe development and drug discovery. The binding site must possess functional relevance. Its position must be located at a site such that the binding of a small molecule will impair the function of the protein harboring the binding site. For example, small molecules that bind to a binding site located at an enzyme active site or protein–protein interface will disrupt enzyme activity or protein–protein interactions and thereby impair the function of the target protein. Binding sites located outside an enzyme active site or protein–protein interface, may or may not modulate the activity of a protein. We classified all binding sites into enzyme active sites, protein–protein interaction sites, or other sites with yet unknown function that may provide an opportunity to modulate protein function through an allosteric mechanism.
Many of the enzyme active sites occur on well-established oncology targets or have been inhibited by small molecules. However, there were several examples of enzymes whose function was explored in cancer but were never targeted with small molecules; these include PKMYT1, STEAP3, and NNMT. There were also several druggable active site binding sites that occurred on enzymes that have seldom been considered in cancer, such as PYCR1, HSPA6, and QPRT. We identified several proteins whose overexpression correlate with patient outcome that occurred at protein–protein interfaces. This discovery is highly significant as protein–protein interactions have been historically challenging due to the lack of well-defined binding sites at protein–protein interfaces.65,66 Protein–protein interfaces can offer an opportunity to develop highly selective compounds since many of these interfaces are structurally unique. Among all differentially-expressed proteins with binding sites, 18% have binding sites that occurred at protein–protein interfaces. For the proteins encoded by genes that correlate with patient survival, we identified 28 binding sites (7 druggable) on 25 proteins that occurred at protein–protein interfaces. Among these proteins, 13 have been studied in cancer. Examples include MEF2B, HNF4A, and CBX2. The remaining 15 proteins have seldom been studied in cancer, such as CASC5 and ZBTB32. Interestingly, several protein structures possess both PPI and ENZ binding sites either on the same domain (e.g. GAD1, NAMPT, and NR1I2) or on different domains (e.g. EPHB2, PLK1, and NTRK1). Small molecules that bind to a binding site on these proteins may serve as allosteric modulator of PPI interactions.
We found that the majority of binding sites were not located either at an enzyme active site or protein–protein interaction site. We refer to these binding sites as other (OTH). Of the 601 unique binding sites on the 202 proteins encoded by genes whose overexpression correlates with patient survival, 102 are ENZ, 46 are PPI, 444 are OTH, and 9 have been classified as both ENZ and PPI. It is likely that many of these OTH binding sites occur at protein–protein interfaces. To explore this possibility, we searched protein–protein interaction databases such as PrePPI for binding partners. Among 759 OTH binding sites located on overexpressed proteins, we identified 17 candidates that have the potential to be located at PPI interfaces. Examples of these proteins include ANK1, CHN1, and NCS1. While OTH binding sites that occur at enzyme active sites or protein–protein interaction sites can be used to develop probes that directly modulates the function of the target harboring these binding sites, the remaining OTH binding sites can provide an opportunity to modulate receptors through an allosteric mechanism.31,67 Whether a small molecule that binds to a binding site will allosterically modulate enzyme function or a PPI interaction is difficult to predict. Small molecules can serve as positive or negative allosteric regulators.34,68,69 These OTH binding sites can also be used for the development of small molecules that can be attached to probes for proteasome degradation.70
Finally, we mapped mutations that were previously identified at TCGA55 onto the three-dimensional structure of proteins that are encoded by overexpressed genes that correlate with patient outcome. A recent study explored the role of mutations on tumorigenesis71 and more recently using a structural genomics based approach.72,73 Our work complements these studies by identifying druggable binding pockets and classifying pockets into whether they occur at enzyme active sites or protein–protein interaction sites. Mutations that occur within these pockets are expected to have direct consequences to the function of a protein. These pockets could provide promising targets for the development of small-molecule therapeutic agents. Interestingly, several mutations occurred in enzyme active sites. These mutations may either enhance or inhibit enzyme activity. Most of the enzyme mutations appear to involve dramatic changes in physico-chemical properties such as H113Q, G568W, R140L, M80R for CA6, KIFC1, NEK2, and SULT4A1. Others involved subtler mutations such as V46A, A287S, and M52T for CHEK1, PCK1, and PSPH, respectively. Since we have focused on proteins that are expected to be overexpressed, it is likely that these mutations will further enhance the activity of these enzymes. Three mutations were identified to occur at protein–protein interfaces, R293P, W167L, and Q107H, which correspond to ADORA2A, MAD2L1, and RHCG, respectively. The first two may have disruptive effects considering that proline residues tend to disrupt secondary structures and tryptophan residues are generally believed to tighten protein–protein interactions. The overwhelming majority occurred at OTH binding sites. These mutations provide an opportunity to validate the importance of these pockets. It suggests that these pockets may be located at unknown active sites or protein–protein interfaces. Considering that many of these OTH pockets occur on enzymes, it is more likely that they may be located at a protein–protein interface and could be useful targets for the disruption of protein–protein interactions.
192 reference human protein identifiers was retrieved from UniProtKB/SwissProt.35 The FASTA sequences were retrieved for each of these proteins and used to identify structures in the RCSB Protein Data Bank (PDB).76 Each FASTA sequence was queried against the pdbaa dataset using BLASTP (protein–protein BLAST v2.2.25+).77 To limit the search to protein structures that possess significant sequence identity and coverage to the query sequence, only structures with E-value <10−5, >90% sequence identity, and PDB sequence coverage >80% were kept. We then identified the experimental methodology, taxonomy of the identified protein chain, and the structural resolution if the structure was from X-ray diffraction. Previously identified structures were then filtered for only crystal structures from human proteins with a resolution better than 3 Å. To reduce the number of redundant structures identified by BLASTP and generate a representative set of crystal structures associated with each protein, CD-HIT (v4.6.1)78 was used with default parameters to cluster the FASTA sequences of the PDB structures identified for each of the proteins. Only cluster centers identified by CD-HIT were used to locate binding sites on the structures for the protein. In total, 4124 proteins had at least one crystal structure that met all of these criteria.
Each binding site identified by SiteMap was visually inspected and manually annotated to determine its functional role in the protein. If an enzymatic residue was in contact with the SiteMap spheres, or if an enzymatic molecule or inhibitor occupied the space of the spheres, the binding site was labeled ‘enzymatic’ (ENZ). If the binding site was at a protein–protein interaction (PPI) interface on the original structure or on any of the aligned structures, the binding site was labeled ‘PPI’. Otherwise, if the binding site was neither enzymatic nor part of the interaction interface, it was labeled ‘Other’ (OTH). Binding sites of the recognition site of human leukocyte antigens (HLAs) and heme cofactor binding site of Cytochrome P450s were labeled ‘Other’.
Secondary structures for each of the binding sites and their interaction partners were retrieved from DSSP.81 The secondary structure of each residue of a crystal structure are classified into helix, sheet, or coil in DSSP. The number of residues falling into each category was retrieved for the residues within 5 Å of the binding site. If there is at least a 60% consensus in the secondary structures for these residues, it was assigned into that category. Otherwise, the binding site was considered mixed.
CPM and for diseases using RNA-seq. Expression levels for each gene was separated into ‘high expression’ and ‘low expression’ groups using the median expression of the gene across all patients for a given disease. A Cox proportional hazards regression model was fitted to the survival profile to determine the hazard ratio (HR) of each gene. Genes were filtered using p < 0.05 and HR > 1.0.
164 edges.
Footnote |
| † Electronic supplementary information (ESI) available. See DOI: 10.1039/c6mb00231e |
| This journal is © The Royal Society of Chemistry 2016 |