Richard J.
Edwards
*a,
Norman E.
Davey
b,
Kevin O'
Brien
c and
Denis C.
Shields
c
aCentre for Biological Sciences, University of Southampton, UK. E-mail: r.edwards@southampton.ac.uk; Fax: + 44 23 8059 4459; Tel: + 44 2380 594344
bStructural and Computational Biology Unit, European Molecular Biology Laboratory, 69117 Heidelberg, Germany
cUCD Complex and Adaptive Systems Laboratory & UCD Conway Institute of Biomolecular and Biomedical Sciences, University College Dublin, Ireland
First published on 30th August 2011
Many of the specific functions of intrinsically disordered protein segments are mediated by Short Linear Motifs (SLiMs) interacting with other proteins. Well known examples include SLiMs that interact with 14-3-3, PDZ, SH2, SH3, and WW domains but the true extent and diversity of SLiM-mediated interactions is largely unknown. Here, we attempt to expand our knowledge of human SLiMs by applying in silicoSLiM prediction to the human interactome. Combining data from seven different interaction databases, we analysed approximately 6000 protein-centred and 1600 domain-centred human interaction datasets of 3+ unrelated proteins that interact with a common partner. Results were placed in context through comparison to randomised datasets of similar size and composition. The search returned thousands of evolutionarily conserved, intrinsically disordered occurrences of hundreds of significantly enriched recurring motifs, including many that have never been previously identified (http://bioware.soton.ac.uk/slimdb/). In addition to True Positive results for at least 25 different known SLiMs, a striking number of “off-target” proteins/domains also returned significantly enriched known motifs. Often, this was due to the non-independence of the datasets, with many proteins sharing interaction partners or contributing interactions to multiple domain datasets. The majority of these motif classes, however, were also found to be significantly enriched in one or more randomised datasets. This highlights the need for care when interpreting motif predictions of this nature but also raises the possibility that SLiM occurrences may be successfully identified independently of interaction data. Although not as compositionally biased as previous studies, patterns matching known SLiMs tended to cluster into a few large groups of similar sequence, while novel predictions tended to be more distinctive and less abundant. Whether this is due to ascertainment bias or a true functional composition bias of SLiMs is not clear and warrants further investigation.
Many of the specific functions of intrinsically disordered protein segments are mediated by Short Linear Motifs (SLiMs). SLiMs are functional peptide microdomains, typically 3–10 amino acids in length, which usually occur in regions of intrinsic disorder.8,9 They are known to mediate many important protein-protein interactions in a variety of scenarios, including protein scaffolding (e.g. 14-3-3), intra- (e.g.PDZ) and extra-cellular signalling (e.g. integrin-binding RGD), control of gene expression (e.g. PBX Homeobox ligand), subcellular localisation (e.g. Golgi to ER retrieval), post-translational modification (e.g.phosphorylation) and cleavage (e.g. Taspase1).9 Through transient and low-affinity interactions, SLiMs can function as molecular switches and cooperatively regulate dynamic cell signalling events.10 Their ubiquity and importance has made them critical molecular targets for pathogens and predators, particularly viruses, which are known to mimic over 50 different eukaryotic host SLiMs.11 As key players in signalling pathways, SLiMs also represent important targets for diseases, both in terms of causal mutations and potential therapeutics.12
Annotation efforts over the last decade have provided high quality data for known SLiMs, with databases specifically focusing on phosphorylation13 and cleavage sites,14,15 in addition to classical ligand-binding SLiMs.4,5 With the exception of a few well-studied examples, however, we still know comparatively little about the abundance and variety of functional motifs. It is therefore of great interest to discover new interaction motifs that may form the basis of future reagents, including drugs, to disrupt or regulate important interactions.
Currently there is a disproportionate number of known domains (∼10000) compared to known SLiMs (∼200), suggesting that the difficulty involved in SLiM discovery is reflected in our knowledge of them. It was estimated that 15–40% of protein-protein interactions may be mediated by SLiMs8 but protein-protein interaction data does not reflect this; only 1% interactions detected in genome-scale human yeast-2-hybrid experiments12 and as little as 5% of all interactions contained in the Human Protein Reference Database (HPRD),16 which includes data derived from many low throughput SLiM discovery experiments, are mediated through known SLiMs.12 Previous attempts could explain only 19% of known interactions by known domain-domain interactions.17 This proportion will undoubtedly increase as more complex structures are solved experimentally but the capacity for SLiM-mediated interactions remains extensive. Furthermore, it is not unrealistic to hypothesise that a larger proportion of the undiscovered interactome may be SLiM-mediated than current trends suggest, since their low affinity and temporally transient activity may make them much more difficult to discover experimentally by current methods than domain-mediated interactions.
Despite these challenges, advances in motif statistics,18,19 motif enrichment,20 dataset design21 and motif classification22,23 are enabling rapid motif discovery with ever-increasing accuracy. These tools are ideally suited to aid in the annotation of interaction data. The potential of interactome-wide in silico predictions of interaction motifs was demonstrated by Neduva et al.24 when they applied their LMD (a.k.a. DILIMOT25) motif prediction tool to the known interactomes of human, Drosophila melanogaster, Caenorhabditis elegans and yeast. Of the potential motifs returned, they validated two of the predictions using fluorescent polarisation to demonstrate specific binding between hub proteins and peptides corresponding to the predicted motif. This pioneering study, however, had several shortcomings: LMD does not allow amino acid ambiguity or flexible lengths in its returned motifs. Secondly, it returns the probability of a given motif occurring by chance, but not the chance of any motif occurring. More recent software, SLiMFinder,19 addresses these issues directly by incorporating ambiguity into SLiM predictions and calculating a significance value for each motif, which estimates with reasonable accuracy the probability of the dataset returning an apparently convergently-evolved motif of the same or greater over-representation by chance. This method has recently been improved by incorporating evolutionary information to mask residues based on their relative conservation.20
In this paper, we describe an attempt to mine the known interaction data for interacting modules by focusing on the discovery and rediscovery of SLiMs using these latest developments in SLiM prediction. We also enlarge the search space in humans by incorporating additional interaction data with eight distinct strategies of dataset compilation. We highlight important issues to be considered during in silicoSLiM discovery and have made our results available as a navigable online resource, which can be mined for predictions for specific proteins and provides an invaluable reference for future studies.
PPI type | Datasetsa | Protein hubs analyseda | Significanta | Datasetsa | Domain hubs analyseda | Significanta |
---|---|---|---|---|---|---|
a Numbers of datasets for each PPI compilation strategy: Datasets, in total; Analysed, analysed with SLiMFinder (< = 1000 sequences, 3+ unrelated); Significant, returning 1+ significant results (p < 0.05). | ||||||
ppi | 12207 | 7346 | 590 (8.0%) | 1759 | 1660 | 458 (27.6%) |
y2h | 7392 | 2956 | 116 (3.9%) | 1255 | 1129 | 166 (14.7%) |
bin | 10247 | 4880 | 193 (4.0%) | 1572 | 1539 | 212 (13.8%) |
com | 8853 | 4832 | 266 (5.5%) | 1468 | 1342 | 294 (21.9%) |
The proportion of datasets returning motif predictions with a SLiMChance significance of p < 0.05 varied from 3.9%–8.0% for protein-centred datasets, and 13.8%–27.6% for domain-centred datasets (Table 1). Expected numbers of motifs returned at a given p value can be estimated by a simple product of the p value and the number of datasets analysed. Enrichment can then be defined as the ratio of observed results at a given p value to this random expectation. This expectation assumes that the SLiMChance algorithm is completely accurate in its estimation of significance on real data. In reality, SLiMChance is slightly stringent and has a tendency to under-estimate significance.18 Therefore, we also analysed datasets of equal size to the real data constructed using two different strategies: “rseq”, in which sequences were selected from the human proteome at random, and “rupc” in which clusters of related sequences from the “real” data where randomly shuffled to make new datasets. (See Methods for details.)
For all protein-centred PPI compilation strategies, the number of significant SLiMFinder predictions (p < 0.01) dramatically exceeded random expectation for the real data, whereas for random data it generally did not (Fig. 1). Apparent enrichment at p < 0.001 in random data was due to a very small number of datasets and was exceeded in every case by the corresponding set of real data. Domain-centred datasets returned a greater proportion of significant motifs than protein-centred datasets, although this difference diminished with increasing significance. Randomised domain-centred datasets show a similar pattern, indicating that dataset size may influence results. This is not surprising as SLiMChance has been shown to be more sensitive for larger datasets.19 This observation could also account for the apparent increased effectiveness of “ppi” and “com” datasets, which in turn tend to be larger than the “y2h” and “bin” datasets. To investigate this further, we compared the size distribution (in terms of Unrelated Protein Clusters (UPC)) of datasets returning significant motifs (p < 0.05) with those that do not. As predicted, within each class of dataset (PPI combination strategy and real/rseq/rupc), the datasets returning significant motifs tend to be larger than those that do not (data not shown). This is especially pronounced in random data.
Fig. 1 Enrichment of significant results vs. expectation. Log2 enrichment of datasets returning significant motifs vs. SLiMChance expectation is plotted against decreasing SLiMChance significance for Real (gold), RUPC (cyan) and RSeq (blue) datasets for each ppi compilation strategy. |
Motif | Description | Full PPIa | Y2H a | Binarya | Complexa |
---|---|---|---|---|---|
a No. gene/domain hubs returning significant predicted SLiM matching known motif. | |||||
LIG_1433 | Dominant 14-3-3 ligand motif [HKR][ST].[ST].P | 6/1 | 2/1 | 2/1 | 6/1 |
LIG_AP_GAE | Gamma-adaptin ear ligand motif [DE][DE][DE]F.[DE]F | 2/1 | 0/0 | 0/0 | 0/0 |
LIG_BRCT | S..F phosphomotif interacting with BRCA1 | 1/0 | 0/0 | 0/0 | 0/0 |
LIG_CtBP | P.DLS CtBP interaction ligand | 1/1 | 2/0 | 2/0 | 1/1 |
LIG_CYC | Cyclin recognition motif, [RK].L | 0/1 | 0/0 | 0/0 | 0/1 |
LIG_Dyn | K.TQT Dynein Light Chain ligand | 1/0 | 0/0 | 0/0 | 0/0 |
LIG_EH | Canonical Eps15 homology (EH) domain binding motif, NPF | 2/0 | 0/0 | 0/0 | 0/0 |
LIG_GoLoco | Part of G-protein G-alpha domain binding motif | 0/0 | 0/1 | 0/1 | 0/0 |
LIG_PABPC1 | PABPC1 binding region | 1/1 | 0/0 | 1/1 | 1/1 |
LIG_PCNA | Q...[IL]...FF PCNA ligand | 1/0 | 0/0 | 0/0 | 1/0 |
LIG_PDZ | Canonical C-terminal PDZ motif [ST].[ILV]$ | 24/1 | 10/2 | 18/2 | 8/2 |
LIG_PP1 | PP1 docking motif [RK].{0,1}[IV][⁁P][FW] | 1/0 | 1/0 | 1/0 | 0/0 |
LIG_PTB | NP.Y Phosphotyrosine binding (PTB) motif | 1/0 | 0/0 | 0/0 | 1/0 |
LIG_SH2 | SH2 domain ligand. Strongest Y.N, Y..Q and Y..P motifs only | 4/0 | 0/0 | 0/0 | 5/1 |
LIG_SH3 | Canonical P..P SH3 ligand motif | 1/0 | 0/0 | 0/0 | 0/0 |
LIG_WW_1 | PP.Y WW ligand motif | 0/1 | 0/0 | 0/1 | 0/0 |
MOD_CAAXbox | Generic CAAX box prenylation motif C.[ILMV].$ | 1/1 | 0/0 | 1/0 | 0/0 |
MOD_CK1 | S..[ST] Motif recognised by CK1 for Ser/Thr phosphorylation | 2/0 | 0/0 | 0/0 | 1/1 |
MOD_CK2 | CK2 phosphorylation motif. [ST]...[DE] | 2/1 | 0/0 | 0/0 | 0/1 |
MOD_GSK3 | [ST]...[ST] Site recognised by GSK3 for Ser/Thr Phosphorylation | 0/0 | 0/0 | 0/0 | 2/0 |
MOD_PKB | R.R...[ST][⁁P] PKB Phosphorylation site | 2/1 | 0/0 | 0/0 | 1/0 |
MOD_PKC | PKC phosphorylation motif, [ST].[KR] | 0/0 | 0/0 | 1/1 | 0/0 |
MOD_STP | Common recurring phosphorylation motif [ST]P | 7/2 | 1/1 | 2/1 | 2/1 |
MOD_SUMO | Canonical sumoylation motif, [AILMV]K.E (ELM MOD_SUMO) | 1/0 | 0/0 | 0/0 | 0/0 |
Yxx# | Multifunctional Y..[ILMVF] motif, which includes ITAM, ITIM, ITSM, SH2 and endocytic targeting motifs | 8/1 | 2/0 | 2/0 | 5/2 |
Another typical statistic of interest is the return of known “False Positives” (FP), which are predictions known to be incorrect. At face value, this analysis returns a great number of motifs that appear to fall into this category. We have identified numerous different groups of motifs that we have classed as “Off-target or generic recurring” motifs, which are either returned by multiple datasets, or are known ELMs returned by the “wrong” dataset (Table 3). However, these are not FP motifs in the true sense of the term, in that many of them are either known, or highly likely, to be real SLiMs of biological importance. The “false” aspect of these predictions lies in the assumption that they are responsible for SLiM-mediated interactions with the PPI hub that returned the motif. This is explored further in subsequent sections.
Motif | ppia | y2h a | bina | com a | Realb | RSeqc | RUPC c |
---|---|---|---|---|---|---|---|
a No. gene/domain hubs returning significant predicted SLiM matching known motif. b Total number of gene/domain hubs from real interaction datasets. c No. random gene/domain datasets returning significant predicted SLiM matching known motif. d Note that due to overlapping motifs, this total is an over-estimate. | |||||||
LIG_1433 | 4/7 | 1/2 | 2/4 | 1/2 | 8/15 | 4/18 | 3/16 |
LIG_AP_GAE | 3/1 | 0/0 | 0/0 | 2/2 | 5/3 | 0/0 | 6/2 |
LIG_CYC | 5/3 | 0/0 | 2/1 | 5/1 | 12/5 | 0/0 | 2/2 |
LIG_CtBP | 0/2 | 0/2 | 0/2 | 0/2 | 0/8 | 0/1 | 0/0 |
LIG_EH | 1/0 | 0/0 | 0/0 | 0/0 | 1/0 | 1/1 | 2/0 |
LIG_FHA | 8/3 | 2/0 | 2/0 | 3/0 | 15/3 | 7/11 | 6/10 |
LIG_IQ | 3/9 | 3/3 | 3/5 | 1/5 | 10/22 | 11/18 | 3/7 |
LIG_PABPC1 | 0/0 | 0/0 | 0/1 | 0/1 | 0/2 | 0/0 | 0/1 |
LIG_PCNA | 1/0 | 0/1 | 0/1 | 0/0 | 1/2 | 1/0 | 0/0 |
LIG_PDZ | 3/15 | 2/11 | 1/10 | 2/6 | 8/42 | 5/15 | 1/3 |
LIG_PP1 | 0/0 | 1/1 | 0/1 | 0/0 | 1/2 | 0/1 | 0/0 |
LIG_PTB | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | 0/1 | 1/0 |
LIG_RGD | 1/1 | 0/0 | 0/0 | 1/1 | 2/2 | 0/0 | 0/0 |
LIG_SCF | 0/0 | 0/0 | 0/0 | 1/1 | 1/1 | 0/0 | 0/0 |
LIG_SH2 | 2/1 | 1/0 | 2/0 | 0/4 | 4/5 | 1/3 | 2/0 |
LIG_SH3 | 9/22 | 2/9 | 2/18 | 3/10 | 16/59 | 1/18 | 7/16 |
LIG_WW_1 | 1/1 | 0/0 | 0/1 | 0/0 | 1/2 | 0/0 | 0/0 |
MOD_CAAXbox | 0/0 | 0/0 | 0/0 | 0/0 | 0/0 | 0/5 | 1/1 |
MOD_CK1 | 5/8 | 1/3 | 1/3 | 2/9 | 9/23 | 5/22 | 21/9 |
MOD_CK2 | 10/18 | 1/5 | 2/6 | 6/10 | 19/49 | 16/36 | 10/14 |
MOD_CamKII | 2/1 | 2/1 | 2/2 | 3/0 | 9/4 | 3/5 | 3/1 |
MOD_GSK3 | 9/12 | 0/3 | 2/6 | 1/17 | 12/38 | 7/24 | 12/20 |
MOD_NGLC | 5/1 | 1/0 | 1/0 | 3/2 | 10/3 | 7/24 | 5/6 |
MOD_PIKK | 0/0 | 0/1 | 0/0 | 1/0 | 1/1 | 2/3 | 0/0 |
MOD_PKA | 2/0 | 0/0 | 0/0 | 1/0 | 3/0 | 1/0 | 1/0 |
MOD_PKB | 6/5 | 1/1 | 1/3 | 5/2 | 13/11 | 0/4 | 0/6 |
MOD_PKC | 1/1 | 0/1 | 0/0 | 1/0 | 2/2 | 3/0 | 1/1 |
MOD_PLK | 1/0 | 0/0 | 0/0 | 0/0 | 1/0 | 2/1 | 3/1 |
MOD_SDE | 10/20 | 3/5 | 2/7 | 5/17 | 20/49 | 11/27 | 8/8 |
MOD_STP | 16/31 | 5/10 | 4/21 | 4/16 | 29/78 | 3/26 | 10/35 |
MOD_SUMO | 12/17 | 5/4 | 5/2 | 6/11 | 28/34 | 1/1 | 0/1 |
TRG_KDEL | 12/10 | 2/3 | 1/2 | 10/12 | 25/27 | 0/0 | 0/0 |
Yxx# | 3/4 | 1/0 | 1/0 | 0/1 | 5/5 | 7/4 | 5/4 |
CxxC | 3/6 | 1/3 | 1/4 | 1/4 | 6/17 | 3/3 | 3/2 |
RGR | 1/13 | 0/4 | 0/2 | 3/7 | 4/26 | 1/1 | 0/2 |
WALKER | 6/16 | 0/0 | 0/0 | 5/8 | 11/24 | 4/5 | 1/5 |
diKR | 39/72 | 3/8 | 6/20 | 10/46 | 58/146 | 24/82 | 18/41 |
pST | 75/38 | 14/20 | 29/18 | 25/29 | 143/105 | 147/137 | 135/93 |
pY | 16/5 | 7/2 | 10/3 | 7/1 | 40/11 | 59/52 | 34/29 |
TOTAL d | 275/343 | 59/103 | 82/143 | 118/227 | 533/826 | 337/549 | 304/336 |
PPI a | Nb | Cloudc | Motifd | Rankd | Sigd | Supportd |
---|---|---|---|---|---|---|
a PPI compilation strategy. b Number of PCNA-interactors in dataset. The number of UP clusters is given in brackets. c Cloud of overlapping motifs. Numbers in brackets indicate numbers of different sequences and UP clusters containing motifs in the cloud. d Predicted motifs: returned pattern, rank in dataset, SLiMChance significance, motif support (no. occurrences/no. sequences/no. UP). | ||||||
Full PPI | 112 (74) | 1 (9/8) | Q...[IL]...FF | 1 | 4.3 × 10−8 | 8/8/7 |
Q.[ST].[IL]...FF | 2 | 4.3 × 10−4 | 4/4/4 | |||
TL.SFF | 3 | 0.044 | 3/3/3 | |||
Complex | 91 (62) | 1 (19/16) | Q...[IL]...FF | 1 | 1.5 × 10−8 | 8/8/7 |
Q.[ST].[IL]...FF | 2 | 2.7 × 10−4 | 4/4/4 | |||
I...FF | 3 | 0.003 | 7/7/7 | |||
[ILV]...[FWY]F | 5 | 0.006 | 20/18/15 | |||
[ILV]...[FHY]F | 6 | 0.019 | 20/18/15 | |||
TL.SFF | 7 | 0.030 | 3/3/3 | |||
Q...L...FF | 8 | 0.034 | 5/5/4 | |||
2 (13/13) | D[FILV].N | 4 | 0.005 | 14/13/13 | ||
Binary | 24 (18) | — | — | — | — | — |
Y2H | 22 (17) | — | — | — | — | — |
Fig. 2 Subset of PCNA interaction dataset returning LIG_PCNA ELM. Selected proteins that interact with proliferating cell nuclear antigen (PCNA) with evidence types. Black double-line, “complex-enriched” interaction; green dashed line, yeast-two hybrid; single black line, other interaction evidence. With the exception of APEX1, interactions between spokes are not shown. Spokes containing annotated occurrences of the ELM LIG_PCNA are highlighted in green. Variants of the LIG_PCNA SLiM returned by SLiMFinder analysis are shown next to each hub. *, WRN returns LIG_PCNA variants but the positions of the two occurrences do not match that in the ELM database. |
Fig. 3 Predicted LIG_PCNA occurrence in WRN. The upper panel shows an alignment of a region of human Werner syndrome, RecQ helicase-like (WRN) with predicted vertebrate orthologues, centred on the SLiM occurrence. The lower panel plots Relative Local Conservation (RLC) and IUPred disorder prediction scores for each residue. Residues designated “unconserved” (RLC < 0) and/or “disordered” (IUPred < 0.2) were masked out of the analysis (X in the “Masking” row of the alignment). |
In addition to the five Q...[IL]...FF occurrences that are known ELM occurrences, the PCNA interactome returns three additional Q...[IL]...FF occurrences in PCNA-associated factor p15PAF (KIAA0101), DNA mismatch repair protein MSH3 and A/G-specific adenine DNA glycosylase (MUTYH) (Fig. 2). Given the size and composition of the PCNA interactome, the expected number of occurrences of the Q...[IL]...FF motif is 0.035 (data not shown); it is highly probable that these represent true PCNA ligand occurrences. Indeed, one of the three, MSH3, is clearly homologous to an occurrence annotated in ELM in the yeast MSH3 protein (data not shown), while MSH3 itself is homologous to MSH6, which is known to interact with PCNA through the PCNA ligand SLiM (Fig. 2).
The PCNA dataset is also informative about how small changes to the composition or masking of the interaction datasets can affect the SLiM predictions returned. Whilst both the “ppi” and “com” datasets return the core Q...[IL]...FF motif and two other variants (Q.[ST][IL]...FF and TL.SFF) with similar significance, the “com” dataset also returns another four variants of the PCNA ligand as well as an unknown motif D[FILV].N (Table 4, and ESI, Table S2‡). Since, by definition, the full PPI dataset contains all the proteins returning these motifs, the shift into significance for these motifs is purely the result of a reduction in the number of proteinsnot containing the SLiM. On the other hand, the “bin” and “y2h” datasets do not return any significant motifs, including the Q...[IL]...FF TP, which is a result of losing a number of ligand-containing interactors through the interaction filtering (Table 4, Fig. 2).
In an attempt to investigate this affect, four different PPI compilation strategies were applied to the human PPI data (ESI, Fig. S3‡). Datasets constructed with all four strategies successfully return known motifs (Table 2, and ESI, Table S1‡). The simplest approach (“ppi”), in which all interactors are included, seems to be the most successful, returning the most True Positives (Table 2) and the highest proportion of significant results in total (Table 1, Fig. 1). These datasets obviously have more proteins than the other strategies and this result indicates that SLiMFinder is more sensitive to a reduction in signal than a reduction of noise.
The increased return of significant results from larger datasets also raises the possibility that many of the datasets are right on the cusp of motif detection. This is exemplified by the return of the CAAX box motif (C.[IL].$) from PDE6D (Table 2, p = 3.33 × 10−6, FDR = 0.001). Of the 26 PDE6D interactors, nine returned occurrences of the CAAX box motif (Fig. 4). However, only one of the three subset strategies returns the motif with significance (p < 0.05, Table 2). This is because, in each case, interactors returning the motif are removed by the filtering process (ESI, Fig. S3‡) and, even though a number of occurrences remain in each case (“bin”, 6/23; “y2h”, 5/22; “com”, 3/6), there is not enough signal to overcome the random expectation. This emphasises the importance of interactome coverage and the need for care to be taken when filtering proteins out of the interaction data.
Fig. 4 Protein-protein interaction network for PDE6D. Full compiled interactome for retinal rod rhodopsin-sensitive cGMP 3′,5′-cyclic phosphodiesterase subunit delta (PDE6D) with evidence types. Black double-line, “complex-enriched” interaction; green dashed line, yeast-two hybrid; single black line, other interaction evidence. Spokes returning the CAAX box motif are highlighted in green. |
The second factor in the enrichment of motifs in the real domain-based datasets is the multi-domain nature of many human proteins and recurring domain architecture, which means that many domain-based datasets will return motifs that interact with different domains that co-occur in proteins. This explains the even more pronounced return of off-target and generic recurring motifs from domain datasets compared to protein datasets; even though there are less domain datasets, more of them (in absolute terms) return such motifs (Table 3).
Although it is obviously tempting to equate SLiM predictions from randomised datasets with “random noise”, this is not strictly true. Just as the results from real datasets are dominated by “off-target” and “generic” motifs that represent genuine SLiMs, albeit SLiMs that do not (as far as we know) interact with the specified hub protein/domain, it is important to conceptually distinguish stochastic over-representation of a genuine SLiMversus pure noise in randomised datasets. In the former case, the random element driving the false discovery is the combination by chance of a number of proteins containing the same real SLiM (e.g. the WALKER motif). In the latter case, the random element is coincidental combinations of amino acids. Because over-representation of a SLiM in a whole proteome is going to increase its chances of stochastic over-representation in a subset of proteins, it is not surprising that a substantial proportion of results returned by randomised datasets correspond to “off-target” motifs. This includes 77% of the random results at FDR < 0.05. Interestingly, three of the remaining eleven random motifs (FDR < 0.05) are the LQxxL motif, returned by different random datasets. In total, this motif is returned by seven different random datasets and ten real domain interactome datasets. It is highly probable, therefore, that this represents another recurring motif of genuine biological significance. The motif itself shows similarity to part of the core alpha helical section of the Ubiquitin Interaction Motif (UIM) PFam domain (ESI, Fig. S4‡) and LQxxL is the top ranked motif returned by the ubiquitin domain binary-enriched dataset. Occurrences in ubiquitin interactors are generally lacking the characteristic charged flanking regions of the UIM, however, and only one confirmed UIM protein (ATXN3) is among the 33 LQxxL containing spokes in this dataset. Given the overall abundance of this motif, which in total is returned in 435 different spoke proteins across the 17 significant datasets, it is unlikely to be a specific ligand although, given the ubiquitous nature of ubiquitin, we cannot rule out the possibility that it represents a novel core ubiquitin binding motif that is related to the UIM sequence.
Another important implication of these observations is that the return of a particular motif from random data does not necessarily rule that motif out as being a stochastic false positive when it is returned in real datasets. This is embodied by the PABPC1 interaction motif (S.L...NA.EF) that, in addition to being returned by the PABPC1 interactome and the two domains found in PABPC1 (PABP and RRM_1), is returned by one random dataset (p = 0.015) that happens to contain three otherwise unrelated interactors of PABPC1 (ATXN2, PAIP2, TOB1) (ESI, Fig. S5‡). A future challenge of interpretation will be taking candidate SLiMs and predicting their true functional significance. To aid this endeavour, all motif predictions from this analysis have been made available as ESI‡ and an interactive database (http://bioware.soton.ac.uk/slimdb/). This resource will continue to be updated and annotated as literature and/or experimental support for given motifs becomes available. Clues to function may also be gained by searching the motif against the whole proteome and looking for enriched biological functions associated with evolutionary conserved occurrences.28
1. Ascertainment bias. It is inevitable that known SLiMs are likely to have more examples in the PPI network. This is both because more abundant SLiMs are more likely to be discovered and, once discovered, knowledge of SLiMs can be used to identify additional interactors. It is also likely that functional studies are enriched in regions with an existing known function, increasing the chance of discovering a second motif in the same place.
2. Physiochemical bias. A more interesting explanation is that there is an inherent bias in the combinations of amino acid that can be successfully employed as a functional SLiM. If true, many motifs share the same core signature, which might make it easier to distinguish true SLiMs from randomly occurring patterns. At the same time, however, it will make distinguishing motifs much harder as there will be fewer distinct residues conferring specificity.
3. Regulatory bias. Molecular signalling switches might rely on competitive binding for overlapping SLiMs. Such motifs will not only share some common residues but will also co-occur in the very same proteins, which might make them even harder to distinguish.
In an attempt to get a better handle on the relationship between motifs, a network analysis was performed using CompariMotif26 relationships between the patterns returned by “Real” datasets (p < 0.01) (Fig. 5). As expected from the motif “cloud” data (which, in contrast to CompariMotif, requires co-occurrence in addition to pattern similarity), several clusters of patterns were formed (ESI, Table S3‡). The largest of these are dominated by TP motifs. While a subset of novel motifs do cluster with the TP SLiMs, the majority either form small clusters with each other or do not cluster with any other motifs. The contrast between the TP and novel motifs is seen more clearly when their networks are investigated separately (Supplemental Fig 7). This favours one of two explanations:
Fig. 5 MCL clustering of TP and novel motifs based on CompariMotif similarity. Each node represents a motif. Circles, TP; Triangles, Novel. Each colour is a different MCL cluster. Details of clusters can be found in the ESI, Table S3.‡ |
1. There is heavy ascertainment bias in terms of motif composition for the known motifs; novel motifs represent entirely new classes of SLiM.
2. Certain amino acid combinations are enriched for functional reasons; novel motifs with these amino acids are more likely to be functional SLiMs, while the motifs with a very different composition are more likely to be false positives.
Resolving this issue will need more data on the nature of SLiM-mediated interactions and whether any specific physical or chemical properties are universally favoured. Such analysis is beyond the scope of this paper. It is not surprising that the largest clusters of motifs, spanning many interaction hubs, have been identified by biochemical means in the past, while the remaining novel motifs are members of much smaller groupings. A subset of the novel motifs likely represent false positives, however, which may be occupying regions of motif space that are not favourable for ligand-binding motifs. These data do highlight an important question that has widespread consequences for future motif discovery: common as they are, are SLiM-mediated PPI dominated by a handful of common motifs types? Or, is the current repertoire of known SLiMs just the tip of the iceberg? In other words, should we concentrate on looking for more of the same, or are there whole new classes of SLiM out there waiting to be discovered? Furthermore, if there are no strong constraints on what sequence can potentially function as SLiMs, can we expect large numbers of unique motifs that only mediate a single PPI pairing? Mining the natural interactome for recurring patterns will never recover such motifs; instead, PPI networks will need to be supplemented by phage display or peptide library screens to identify other, non-native, sequences that can bind the same targets. Alternatively, methods that look for SLiM-like sequence fingerprints in individual proteins6,29 may be able to identify singleton SLiMs, even if they cannot predict the PPI partners.
It is also clear that the results of Neduva et al. are biased in a way that the results presented here are not, with a strong tendency to return proline-rich motifs. 436/690 (63%) DILIMOT predictions (226/422 (54%) patterns) contain one or more prolines, compared to 379/3990 (9.5%) SLiMFinder patterns. Serine enrichment is also strong, with 244 (35%) predictions (149 (35%) patterns) versus 955 (24%) for SLiMFinder. Only 118/422 (28%) DILIMOT patterns have neither a proline or serine, compared to 2862 (78%) SLiMFinder patterns. The reasons for this are not clear but at least part of the explanation no doubt lies in the fact that a low complexity filter was applied to this analysis and both prolines and serines have a tendency to occur in low complexity runs. Low complexity motifs of this nature are also more likely to be rediscovered in homologous proteins than more specific motifs and so are probably further enriched in the DILIMOT analysis, which uses rediscovery in mouse to weight results.
As these methods are honed, the potential of SLiM discovery will continue to improve by increasing the motif signal. Furthermore, as the level and quality of annotation and cross-talk between databases increases with the implementation of data standards (HUPO, PSIetc.), it should be possible to improve things further by reducing noise and increasing the quality of datasets by including only directly interacting proteins. Background noise can be reduced further still by focusing analysis on the specific protein regions known to be responsible for interacting with the hub protein/domain. Again, this will become increasingly possible as the quantity and quality of interaction data continue to improve. A particular problem with eukaryotic datasets is the presence of many multi-domain proteins that can draw together several sub-networks of the interactome into large overlapping interaction datasets. This contributes greatly to the return of off-target motifs (Table 3). Methods that can partition protein interactions by domain should greatly enhance both the sensitivity and specificity of motif prediction for domain-centred interaction datasets.
1. The human query sequence had a minimum global (GABLAM ordered) similarity of 40% with the orthologue.
2. The query has a higher percentage similarity to the orthologue than any other sequence of the same species.
3. The orthologue has a higher percentage similarity to the query than to any other human sequence.
4. If (3) is not met, the orthologue must be ancestral to a duplication event involving the query. In this case, the closest human paralogue to the sequence and the query must be more similar to each other than either is to the orthologue. (See GOPHER documentation for details.)
Putative orthologues were then aligned using MAFFT.39
1. All PPI (“ppi”). All evidence codes were used, except “indirect_complex” and “neighbouring_reaction”.
2. Yeast-2-Hybrid (“y2h”). Only PPI supported by the following evidence codes were retained: “2 hybrid”, “two hybrid”, “two hybrid array”, “two hybrid pooling”, “two hybrid pooling approach”, “two hybrid fragment pooling approach”, “two-hybrid” and “yeast 2-hybrid”.
3. Binary-enriched (“bin”). All mutually interacting sets of three proteins (where A interacts with B & C and B also interacts with C (and A, B & C are all different)) were removed from the PPI dataset. Any Yeast-2-Hybrid PPI that had been removed were added back in.
4. Complex-enriched (“com”). Only PPI supported by the following evidence codes were retained: “affinity capture-luminescence”, “affinity capture-ms”, “affinity capture-western”, “anti bait coimmunoprecipitation”, “anti bait coip”, “anti tag coimmunoprecipitation”, “anti tag coip”, “coimmunoprecipitation”, “coip”, “complex”, “direct_complex”, “gst pull down”, “his pull down”, “mass spectrometry studies of complexes”, “pull down”, “reconstituted complex”, “tandem affinity purification” and “tap”.
• N-terminal methionines were masked, as were alanines at position 2, which are also very common [metmask=T posmask=2:A].
• Because of the large number of datasets being analysed, the default complexity filter of 5 identical amino acids (aa) in an 8 aa window was made slightly stricter for this analysis at 4 aa in a 6 aa window [compmask=4,6].
• Disorder masking of regions using IUPred.30 Disorder was predicted at an IUPred cut-off ≥ 0.2, with a minimum (dis)ordered region length of five consecutive residues [dismask=T iucut=0.2 minregion=5].
• Conservation masking using the Relative Local Conservation strategy of Davey et al.20 for sequences with 3+ GOPHER orthologues (see above). Only orthologues with fewer than 10 undefined (“X”) residues and 20% gaps relative to the query were used [consmask=T conscore=rel vnematrix=blosum62.bla minhom=3 homfilter=T maxx=10 maxgap=0.2].
SLiMBuild used the default settings, allowing wildcard spacers of up to two consecutive wildcards and extending motifs up to five defined positions [maxwild=2 slimlen=5]. Ambiguous positions were permitted with the following “equivalency groups” of amino acids that could be combined into a single ambiguous position: ILMVF, FYW, FYH, KRH, DE, ST. Each ambiguous motif variant had to occur in at least two unrelated sequences, while each predicted SLiM returned had to occur in at least three unrelated proteins [ambocc=2 minocc=3]. Flexible-length wildcards were not permitted in this analysis [wildvar=F]. The SLiMFinder walltime was disabled [walltime=0] but analyses were limited to datasets with less than 1000 proteins [maxseq=1000].
FDR = pN/np, |
• The number of UPCs in the random datasets matched the number of UPCs from the original datasets
• Each UPC from the original data was used once. (i.e. random selection without replacement.)
In reality, UPCs from different hubs are not necessarily unrelated and so some of the random UPC datasets will actually be reduced in size compared to the original data (in terms of UPC numbers and, sometimes, even in terms of absolute sequence numbers if spokes from different hub datasets are randomly selected for the same random dataset).
This then produced two random datasets with different attributes and biases: the Random Sequence data tending to have more UPC than the real data and the Random UPC datasets tending to have fewer. By framing the real data in this fashion, any inherent bias due to the underlying UPC distribution should be apparent.
This analysis is the largest de novo in silicoSLiM prediction in humans and has identified a number of candidates for novel functional motifs and/or motif occurrences. This work represents a major step forward by predicting a number of statistically over-represented and evolutionary conserved motifs outside of globular domains. While we have confidence that many of these represent motifs of true biological functional significance, the interdependent nature of the data makes categorical functional assignment difficult. Nevertheless, this study clearly identifies a number of areas that can be targeted to further enhance predictions and move closer to the ultimate goal of motif prediction with assigned function.
We conclude that SLiMFinder can be an effective tool for motif discovery. Users should be careful to assemble protein datasets that are enriched for direct interactors of the proposed motif binding protein. Ideally, such analyses will be performed in conjunction with data specifically supporting at least one such direct interaction, which can be used as a filter for motif predictions. This information could be provided a priori, using the “query” option in SLiMFinder to restrict results to a specific protein or disordered region. Alternatively, coupling the in silico predictions to in vitro interaction screening will greatly increase the power of such analyses. The extensive database of results provided here represent an initial starting point for exploring motifs in human proteins. Users can traverse this database in terms of a protein of interest, or a motif of interest, or a domain of interest. While the most strongly significant motifs generally represent previously well known motifs, this should not come as a surprise: the most complete interaction datasets represent well-studied proteins and experimental protein biochemists are adept at interpreting evidence for motifs in such sequences; thus, computational predictions are likely to be more useful for motifs for which the evidence is less glaringly obvious. These predicted motifs can then be further explored in follow-up experiments, to validate their significance.
Footnotes |
† Published as part of a Molecular BioSystems themed issue on Intrinsically Disordered Proteins: Guest Editor M Madan Babu. |
‡ Electronic supplementary information (ESI) available. See DOI: 10.1039/c1mb05212h |
This journal is © The Royal Society of Chemistry 2012 |