Open Access Article
Faraz K.
Mardakheh‡
*a,
Heba Z.
Sailem‡
ab,
Sandra
Kümper
a,
Christopher J.
Tape
ac,
Ryan R.
McCully
a,
Angela
Paul
a,
Sara
Anjomani-Virmouni
a,
Claus
Jørgensen
d,
George
Poulogiannis
a,
Christopher J.
Marshall§
a and
Chris
Bakal
*a
aInstitute of Cancer Research, Division of Cancer Biology, 237 Fulham Road, London SW3 6JB, UK. E-mail: chris.bakal@icr.ac.uk; mardakheh@icr.ac.uk
bInstitute of Biomedical Engineering, University of Oxford, Old Road Campus Research Building, Oxford OX3 7DQ, UK
cDepartment of Biological Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
dCancer Research UK Manchester Institute, University of Manchester, Wilmslow Road, Manchester M20 4BX, UK
First published on 1st November 2016
Localisation and protein function are intimately linked in eukaryotes, as proteins are localised to specific compartments where they come into proximity of other functionally relevant proteins. Significant co-localisation of two proteins can therefore be indicative of their functional association. We here present COLA, a proteomics based strategy coupled with a bioinformatics framework to detect protein–protein co-localisations on a global scale. COLA reveals functional interactions by matching proteins with significant similarity in their subcellular localisation signatures. The rapid nature of COLA allows mapping of interactome dynamics across different conditions or treatments with high precision.
Recently, a novel proteomics based approach has been utilised to reveal interactions by assessing the co-behaviour of interacting proteins when biochemically fractionated.10 Havugimana et al. used three parallel chromatography methods in conjugation with proteomics, separating soluble cellular complexes based on charge, pKa, and density.11 Kristensen et al., on the other hand, used size-exclusion chromatography in conjugation with proteomics to separate complexes by size.12 Proteins with similar elution profiles were then matched as likely constituents of the same complexes. The key advantage of these approaches is allowing simultaneous determination of biochemical interactions from a fixed number of mass spectrometry runs.12,13 However, a downside of both approaches is that they are limited to soluble proteins, and therefore are not well suited for detecting insoluble complexes. Moreover, while matching proteins based on biochemical co-fractionation can reveal strong associations that survive such fractionations, many physiologically relevant functional interactions only occur transiently, thus are often missed during stringent biochemical separations.9
Eukaryotic cells are highly compartmentalised assembly of organelles, macro-molecular complexes, and spatially organised subcellular functional regions. As a result, where a protein is localised inside a eukaryotic cell can be indicative of its function. More importantly, colocalisation of two or more proteins can be indicative of their functional interaction.14 We hypothesised that, similar to multi-variate phenotypic signatures used in high-throughput genetic interaction analyses,2 quantitative multi-dimensional subcellular localisation signatures can be used to match functionally interacting proteins on the basis of their colocalisation. We here present COLA, a streamlined proteomics–bioinformatics strategy to infer functional interactions from significant similarities in subcellular localisation patterns. COLA uses complete subcellular fractionation in conjugation with quantitative proteomics to generate a quantitative, multi-dimensional, subcellular localisation signature for each identified cellular protein. Bootstrapped hierarchical clustering is then used to match proteins with significant similarity in their localisation signatures. Crucially, COLA is not limited to soluble protein complexes, and can reveal functional interactions on a global scale based on subcellular proximity with high Precision and Sensitivity. Finally, by utilising Tandem-Mass-Tagging for quantitative profiling of different subcellular fractions, we developed a multiplexed version of COLA, named iCOLA, that could be utilised to rapidly map interactomes across different conditions and treatments, thereby revealing interactome dynamics.
840) was used with modification. Briefly, cells were scraped off (1 × 15 cm dish) in cold PBS and pelleted before being solubilised serially into 5 fractions according to the kits' protocol (cytosol, membrane, nuclear soluble, nuclear chromatin, and cytoskeleton). A pellet remaining at the end of the procedure was also solubilised by 2% SDS, Tris-pH 7.6, which constituted a second cytoskeleton fraction.
000 × g for 30 min (4 °C) to pellet the cellular membranes. The remaining supernatant (cytosol + microsomes) was taken away, and the pellet was resuspended in 200 μl of the ‘Upper Phase’ aqueous biphasic extraction solution, according to the kit's protocol. An equal volume of ‘Lower Phase’ solution (200 μl) was then added to the mix, vortexed thoroughly, and incubated on ice for 5 minutes, before being centrifuged at 1000 × g for 5 minutes to separate the two phases. In parallel, a fresh tube of mixed upper and lower phase solutions without any sample was similarly prepared and centrifuged to separate the two phase solutions. The upper phase of the tube with samples was then carefully taken away from the lower phase and put in a new tube. The two phases were then extracted again by adding 100 μl of the separated lower or upper phase solutions from the tube without samples to each (lower to upper and vice versa) as before (mixing thoroughly, incubating on ice for 5 minutes, before centrifugation at 1000 × g for 5 minutes). The second separated upper phase from the initial upper phase (plasma membrane), and the second separated lower phase from the initial lower phase (intracellular membranes), were then moved to new tubes, diluted in 5× volume of ice-cold water, and kept on ice for 5 minutes to precipitate the extracted proteins. The proteins were then pelleted by centrifugation at 16
000g in a micro-centrifuge for 10 minutes (4 °C). The supernatants were then removed and discarded and the pellets (intracellular or plasma membrane fractions) were solubilised in 2% SDS, Tris-pH 7.6. Next, protein concentrations of the fractions were measured by BCA assay (Pierce), and balanced.
000g for 10 min to clear any cell debris, followed by concentrating ∼20 fold using Amicon ultra centrifugal filter units (10 kDa cut-off), and solubilising the concentrated proteins in 2% SDS, Tris-pH 7.6. As a whole cell lysate control, the remaining cells after removal of the conditioned media were directly lysed by 2% SDS, Tris-pH 7.6. Again, protein concentrations for both extracellular fraction and the whole cell lysates were measured by BCA assay (Pierce), followed by balancing.
:
4 to 50
:
50 buffer A
:
B (t = 0 min 4% B, 0.5 min 4% B, 40.0 min 10% B, 170.0 min 25% B, 240.0 min 50% B) (buffer A: 1% acetonitrile/3% dimethyl sulfoxide/0.1% formic acid; buffer B: 80% acetonitrile/3% dimethyl sulfoxide/0.1% formic acid) at 250 nL min−1. Peptides were ionised by electrospray ionisation using 1.8 kV applied immediately pre-column via a microtee built into the nanospray source. Sample was injected into an LTQ Velos Orbitrap mass spectrometer (Thermo Fisher Scientific, Hemel Hempstead, UK) directly from the end of the tapered tip silica column (6–8 μm exit bore). The ion transfer tube was heated to 275 °C and the S-lens set to 60%. MS/MS were acquired using data dependent acquisition based on a full 30
000 resolution FT-MS scan with preview mode disabled. The top 20 most intense ions were fragmented by collision-induced dissociation and analysed using normal ion trap scans. Precursor ions with unknown or single charge states were excluded from selection. Automatic gain control was set to 1
000
000 for FT-MS and 30
000 for IT-MS/MS, full FT-MS maximum inject time was 500 ms and normalised collision energy was set to 35% with an activation time of 10 ms. Wideband activation was used to co-fragment precursor ions undergoing neutral loss of up to −20 m/z from the parent ion, including loss of water/ammonia. MS/MS was acquired for selected precursor ions with a single repeat count acquired after 8 s delay followed by dynamic exclusion with a 10 ppm mass window for 60 s based on a maximal exclusion list of 500 entries.
000 at m/z 400 and FT target value of 1 × 106 ions. The 20 most abundant ions were selected for MS2 fragmentation (isolation window 1.2 m/z) using collision-induced dissociation (CID), dynamically excluded for 30 seconds, and scanned in the ion trap at 30
000 at m/z 400. MS3 multi-notch isolated ions (10 notches)18 were fragmented using higher-energy collisional dissociation (HCD) and scanned in the Orbitrap (from m/z 100–500) at 60
000 at m/z 400. For accurate mass measurement, the lock mass option was enabled using the polydimethylcyclosiloxane ion (m/z 445.12003) as an internal calibrant. Four serial technical replicate injections were performed per TMT sample set to boost the identification coverage.
(1) STRING-all interactions:23 all string interactions were downloaded (Oct 2014). String gene IDs were mapped to corresponding gene IDs using UNIPROT ID mapping tool. Interactions with medium confidence (combined score >0.4) were considered where the score is based on neighbourhood, gene fusion, co-occurrence, co-expression, experiments (physical interactions), databases and text mining methods.
(2) STRING-physical interactions:23 interactions that have experimental evidence in STRING.
(3) Pathway Commons:24 based on Pathway Commons 7 (May 2015) with exclusion of BioGrid as this database includes the studies we used for benchmarking our methods against (see below).
(4) CORUM25 (protein complex database): gene IDs were mapped using UNIPROT.
The overlap was calculated as the percentage of identified interactions in COLA or iCOLA that were also reported in the above mentioned databases. The significance of the overlap was calculated using right tail Fisher Exact Test (R) and hypergeometric probability. For the number of interactions in the reference databases, only interactions that include the proteins that were quantified in our fractionation experiments were considered. We benchmarked our method against Kristensen et al.,12 (7204 interactions), and Rolland et al.,6 (13
944 interactions). To calculate the significance of overlap, the number of interactions in reference database was modified to only include the proteins that were identified using each of these methods.
To maximise acquisition of novel, digitised, information on subcellular protein distributions which would be suitable for hierarchical clustering, we used four parallel independent fractionation procedures with highly distinct individual fractions (Fig. 1B). The majority of subcellular fractions came from two main fractionation procedures, one based on serial solubilisation by using successive solubilising buffers, and the other based on serial centrifugation (Fig. 1B). Both of these procedures are fast, reproducible, require little optimisation, and cover major subcellular compartments. To further expand on subcellular information, we also collected two additional independent fractions, the actin-rich cellular protrusions, and the extracellular compartment (Fig. 1B). Protrusions are purified using transwell based physical separation of cell protrusions from the cell-body,15 and the extracellular compartment is collected by removing and concentrating conditioned media (Fig. 1B). Overall, we identified 4950 proteins from human Retinal Pigment Epithelial (RPE) cells, out of which 1886 had a full subcellular localisation profile with no missing values (Dataset S1, ESI†). The quality of fractionations was examined by western blotting for markers of specific fractions (Fig. 1C–F), as well as category enrichment analysis using the Gene Ontology Cellular Compartment (GOCC) database (Table 1 and Dataset S2, ESI†). For assessing the reproducibility of fractionations, we determined Pearson's correlation coefficients across all fraction replicates. While correlation coefficient within fraction replicates was on average 0.65, suggestive of good reproducibility of fractionations, coefficients between different fractions was only 0.1 on average (Fig. 1G and H), indicating that fractions provide unique information towards the localisation signatures. The distribution and variance of different fractions were comparable (Fig. 1I), and importantly, no single fraction contributed disproportionately to the overall variance (Fig. 1J), ruling out potential bias towards a specific compartment in later downstream analysis. The correlation between the overall subcellular localisation signatures, across two independent biological replicates was calculated to be 0.70 (Fig. 1K).
| Fraction | Significantly enriched GOCC terms (FDR < 0.02) |
|---|---|
| 1 | Cytosol; intracellular |
| 2 | Membrane part; integral component of plasma-membrane; organelle membrane |
| 3 | Nuclear part; nucleoplasm part; transcription factor complex; chromatin remodelling complex; chromatin |
| 4 | Nuclear part; protein–DNA complex; nucleosome; chromosome; nuclear chromosome part; nuclear body |
| 5 | Intermediate filament; nuclear membrane; nucleolus |
| 6 | Intermediate filament; nuclear membrane; nucleolus |
| 7 | Protein–DNA complex; chromatin remodelling complex; nucleoid; organelle part; membrane part |
| 8 | Organelle membrane; membrane enclosed lumen; respiratory chain complex |
| 9 | Plasma membrane; cell junction; coated pit; endosome membrane; membrane part; protein–DNA complex |
| 10 | Cytosolic small ribosomal subunit; cytosolic large ribosomal subunit; proteosome complex; MCM complex |
| 11 | Actin cytoskeleton; ruffle; cell projection; cell junction; adherens junction; synapse; plasma membrane |
| 12 | Extracellular space; extracellular matrix; basement membrane; secretory granule lumen |
To detect proteins that have significantly similar localisation signatures, and therefore are expected to functionally interact, we used hierarchical clustering (Fig. 1A). Hierarchical clustering is ideal for multi-variate assessment of functional interactions as it matches proteins into discrete functional units.3 However, a common shortcoming of standard hierarchical clustering is sensitivity to samples order and variation in clustering results depending on sample inclusion.22 To ensure the significance and robustness of our clustering to permutations, we performed bootstrapped clustering to reveal groups of proteins that group together with high confidence, thus are likely to be truly co-localising (Fig. 1A). We used three different bootstrapping stringency cut-offs of p-value <0.05, 0.01, or 0.001, revealing 365, 271, or 101 bootstrapped clusters (Dataset S3, ESI†), which correspond to 4415, 3087, or 1487 pair-wise co-localisations, respectively (Dataset S4, ESI†). While over 50% of identified bootstrapped clusters contain only 2 or 3 proteins, clusters of 10 or more proteins were also detected (Fig. 1L), suggesting that COLA can detect both small and large complexes.
To verify that COLA predicts real interactions, we assessed its performance in detecting known interactions using three major mammalian functional interaction databases as reference: (1) CORUM, a highly curated database of mammalian protein complexes,25 (2) STRING, a larger database of both physical and functional associations.23 (3) Pathway Commons, a comprehensive collection of functional and physical interactions integrated from multiple publicly available databases.24 We quantified the proportion of co-localising proteins in our method that appear as known functional interactors in each database (Dataset S4, ESI†). Over than half of the COLA identified interactions are annotated as known interactions based on the Pathway Commons database – the largest collection of functional protein interactions (Fig. 2A). Moreover, around 1 in five of COLA interactions are annotated as known in STRING database (Fig. 2B), and a similar percentage of overlap with CORUM interaction database was also detected (Fig. 2C). All of these degrees of overlap are statistically highly significant (p-value <1 × 10−300). For comparison, we also evaluated the performance of two published large-scale studies which used different approaches to map human protein–protein interactions as a proxy for functional associations: (1) Rolland et al., which used Y2H screening,6 and (2) Kristensen et al., which used size-exclusion chromatography coupled with proteomics (SEC-MS).12 Although both studies significantly detected known interactions, they were both consistently outperformed by COLA in every database comparison (Fig. 2A–C). Interestingly, the overlap between each of these three approaches was little despite the significant degree of overlap with the reference databases, suggesting that each method must be revealing complementary information with regards to functional associations (Fig. 2D). These results demonstrate that significant similarity in protein localisation signatures is strongly reflective of a functional interaction, and that COLA outperforms Y2H and SEC-MS methods in revealing functional interactions.
![]() | ||
| Fig. 2 COLA reveals known functional associations and outperforms two current global interactome analysis studies. (A) Percentage of overlap with Pathway Commons between the identified interactions in each bootstrap significance setting, vs. in Kristensen et al.,12vs. in Rolland et al.,6P***: fisher's exact test p < 1 × 10−300; **: fisher's exact test p < 2 × 10−150; *: fisher's exact test p < 1 × 10−75. (B) As in (A) but for overlap with STRING (darker blue indicates physical interactions only). (C) As in (A) but for overlap with CORUM. (D) Venn diagram of the overlap between binary protein–protein interactions revealed COLA versus Kristensen et al., size-exclusion chromatography profiling12 and Rolland et al., Y2H screening.6 Only 2 interactions are shared across the three methods. | ||
![]() | ||
| Fig. 3 Analysis of interactome dynamics by iCOLA. (A) Outline of the iCOLA methodology. 9 fractions from serial solubilisation and serial centrifugation protocols, along with a 2% SDS solubilised whole cell lysate total control, were digested and isobarically labelled using TMT 10-plex kit as indicated. The labelled peptides were then pooled together and analysed by LC-MS3. Averaged normalised fraction/lysate ratios for every fraction were used to create a multi-variate subcellular localisation signature for each protein (T-complex protein 1 subunit alpha was used as example here). Signatures were then subjected to unsupervised hierarchical clustering with Euclidean average linkage and bootstrapping was used to reveal clustering matches with high confidence (in color) from the rest (blacked out). All members of the TCP1 chaperonin ring complex (CCT1 to 8) are detected as significant interactors of CCT1, and are shown as an example of a bootstrapped cluster. (B) Graph of percentage of total bootstrapped clusters (p < 0.05) vs. the number of proteins per cluster from iCOLA analysis of A375P cells. The majority of clusters are constituted of 2–4 proteins, yet very large clusters are still detectable by iCOLA. (C) Heat map of Pearson correlation coefficients between the two replicate iCOLA series of fractionations. Cells were fractionated in duplicate. Collected fractions show high similarity with their corresponding replicate, but low similarity with other fractions. (D) Plotted averaged Pearson's correlation coefficients within replicate fractions versus averaged Pearson's correlation coefficients between different fractions. While a high degree of similarity exists within replicate iCOLA fractions suggestive of high reproducibility, similarity between different fractions is very low, indicating that each fraction is likely providing unique information. (E) Analysis of the reproducibility of the overall localisation signatures between two biological replicate iCOLA experiments. Euclidean distances between the signatures from two independent iCOLA fractionations experiments in A375P cells were calculated and plotted against each other, showing a highly significant correlation (p < 1.0 × 10−15). The Pearson correlation coefficient (CC) is displayed on the graph. (F) Comparison of the percentage of known interactions according to Pathway Commons that were detected in A375P and A375M2 cells (p < 0.05). Percentage of overlap with Pathway Commons was very similar in A375P and A375M2 cells, and comparable to SILAC based COLA (Fig. 2A). ***: fisher's exact test p < 1 × 10−300; **: fisher's exact test p < 2 × 10−200. (G) Comparison of the percentage of known interactions according to STRING that were detected in A375P and A375M2 cells (p < 0.05). Light blue bars shows all STRING interactions. Dark blue bars show only physical interactions. Percentage of overlap with STRING was very similar in A375P and A375M2 cells, and comparable to SILAC based COLA (Fig. 2B). ***: fisher's exact test p < 1 × 10−300; **: fisher's exact test p < 2× 10−200. (H) Comparison of the percentage of known interactions according to CORUM that were detected in A375P and A375M2 cells (p < 0.05). Percentage of overlap with CORUM was very similar in A375P and A375M2 cells, and comparable to SILAC based COLA (Fig. 2C). ***: fisher's exact test p < 1 × 10−300; **: fisher's exact test p < 2 × 10−200. (I) Venn diagram of the overlap between binary protein–protein interactions detected in A375P and A375M2 cells (bootstrap cut off = 0.05). A core of 1269 interactions were conserved between the two isogenic cell lines, while over 3000 unique interactions were detected in each cell-line. (J) Analysis of mitochondrial respiratory flux in A375P and A375M2 cells. Oxygen consumption (OCR) was measured in real-time, with serial addition of oligomycin to inhibit ATP synthase, FCCP to uncouple oxygen consumption from ATP production, and rotenone/antimycin (R/A) to completely inhibit electron transport chain, at indicated timepoints. Values were normalised to total seeded cell numbers. A375P show a significantly higher basal mitochondrial respiration (blue arrow), as well as a higher maximal mitochondrial respiratory capacity (red arrow), while levels of non-mitochondrial oxygen consumption measured after R/A addition (three ending time points) are equal between both cells. | ||
Next, we applied iCOLA to reveal interactome differences between our previously analysed A375P cells, which are weakly metastatic, to their highly metastatic isogenic derivative, the A375M2 cells.30 We identified 1442 proteins with complete subcellular localisation profiles in A375M2 cells (Dataset S8, ESI†). At the bootstrap cut-off of p-value <0.05, a total of 279 bootstrap clusters were revealed for A375M2 cells, (Dataset S9, ESI†), corresponding to 4779 pair-wise co-localisations (Dataset S10, ESI†). First, to test whether the performance of iCOLA is similar to COLA in terms of identifying true functional interactions in both cell-types, we assessed the overlap of the identified co-localisations with CORUM, STRING, and Pathway Commons interaction databases, as before (Datasets S7 and S10, ESI†). A highly significant proportion of the revealed co-localisations were amongst known functional interactors in both A375P and A375M2 cells (Fig. 3F–H), with the degree of overlap being comparable to that of the SILAC based COLA method, suggesting that reducing the number of fractions from 12 to 9 does not significantly affected the ability of our approach to reveal true interactions. Next, we assessed the degree of overlap between the interactomes of the two cell-lines. 1269 of the interactions identified in total were seen in both A375P and A375M2 cells, whilst more than 3000 interactions were detected in only one cell-type (Fig. 3I). Category enrichment analysis revealed that most conserved interactions belonged to core cellular complexes such as the nucleosome, chaperonin complex, and the ribosome, suggesting that these core interactions do not change much across the two cell-types (Dataset S11, ESI†). In contrast, mitochondrial protein complexes were significantly enriched amongst proteins with changing interactions (Table 2 and Dataset S11, ESI†), suggestive of a substantial rewiring of the mitochondrial interactome between the two cell types. As a result, we hypothesized that mitochondrial activity is likely to be significantly altered between the two cell types. Accordingly, both basal and spare mitochondrial respiratory capacity was significantly reduced in A375M2 cells compared to A375P cells (Fig. 3J). Collectively, these results demonstrate that iCOLA can be used for comparison of functional interactomes between different conditions, and that variations between interactomes can inform on functional differences between different cellular settings.
| Category database | Category name |
|---|---|
| Corum | 55S ribosome, mitochondrial |
| Keywords | Ligase |
| GSEA | RESPIRATORY_ELECTRON_TRANSPORT |
| GOBP | Coenzyme metabolic process |
| GOBP | Cofactor metabolic process |
| GOCC | Mitochondrial matrix |
| Keywords | Mitochondrion |
| Keywords | Transitpeptide |
| GOCC | Mitochondrial part |
| GSEA | TCA_CYCLE_AND_RESPIRATORY_ELECTRON_TRANSPORT |
| GSEA | WONG_MITOCHONDRIA_GENE_MODULE |
| GSEA | MOOTHA_MITOCHONDRIA |
| GOBP | Oxidation–reduction process |
| GSEA | MOOTHA_HUMAN_MITODB_6_2002 |
| GSEA | MITOCHONDRION |
| GOCC | Mitochondrion |
| GSEA | LEE_BMP2_TARGETS_DN |
| GOBP | Small molecule metabolic process |
500 proteins in human genome, the total size of the possible interactome is equal to 22
500 × 22
500/2 = 253
125
000. In COLA, we identified 1886 proteins, meaning that 1886 × 1886/2 = 1
778
498 possible interactions were tested, which is ∼1.4% of the total possible interactome space. This compares similarly with Kristensen et al., SEC-MS method for mining of interactions,12 but is less than most Y2H screens, which cover a larger ORFome space.6,27
In a binary interactome analysis, assuming the null-hypothesis (H0) equates to proteins A and B not interacting (false interaction), and alternative hypothesis (H1) to A and B interacting (true interaction), various types of possible interactions can be defined, which are listed in (Table 3).
| H(0) is correct | H(1) is correct | Sum | |
|---|---|---|---|
| Note: V + S = R; U + T = R′; V + U = M′; S + T = M; M′ + M = A. | |||
| Reported interactions | V (false positives) | S (true positives) | R |
| Unreported interactions | U (true negatives) | T (false negatives) | R′ |
| Sum | M′ (total false interactions) | M (total interactions) | A |
Assuming that the null hypothesis (H0) in a protein–protein interaction detection assay is no interaction, and the alternative hypothesis (H1) is existence of an interaction, a global binary interactome analysis method can report two types of interaction: true positive (S), and false positive (V). The total number of reported interactions (R) is therefore the sum of S and V. Conversely, the unreported interactions consist of true (U) and false negatives (T), with the total number of unreported interactions (R′) consisted of the sum of U and T. In addition, all true interactions in the interactome space (M), whether reported or not by the method, can be defined as the sum of S and T. Similarly, all false interactions in the interactome space (M′), whether reported or not, can be defined as the sum of V and U. Finally the total size of the hypothetical interactome space (A) can be defined as the sums of M and M′, or R and R′.
Accordingly, FPR, Sensitivity, and Precision can be defined as a function of these types of interactions:
• Sensitivity = S/M
• FPR = V/M′
• Precision = S/R
CORUM is a highly curated database of well-known protein interactions, which can be regarded as almost ‘True’.25 To estimate Sensitivity, we can therefore simply calculate the proportion of CORUM interactions that were reported by COLA, for the list of identified input proteins. COLA and iCOLA have a Sensitivity of ∼3% to 6%, depending on the bootstrapping cut-off (Fig. 4A). In comparison, based on CORUM, Kristensen et al.'s SEC-MS method12 and Rolland et al.'s Y2H study6 had Sensitivities of ∼8% and ∼2%, respectively (Fig. 4A). Thus, COLA is roughly similar in terms of its Sensitivity to both of these existing methods.
![]() | ||
| Fig. 4 Analysis of Sensitivity, FPR, and Precision for COLA and iCOLA. (A) Comparison of Sensitivity, also termed Recall, between COLA (at bootstrapping cut off p-values of 0.05, 0.01, and 0.001), iCOLA (at cut off p-value of 0.05), Kristensen et al.'s SEC-MS method,12 and Rolland et al.'s Y2H screen.6 Sensitivity was defined as percentage of interactions present in CORUM that were identified by each method. (B) Comparison of FPR between COLA (at bootstrapping cut off p-values of 0.05, 0.01, and 0.001), iCOLA (at cut off p-value of 0.05), Kristensen et al.'s SEC-MS method,12 and Rolland et al.'s Y2H screen.6 FPR was defined as the percentage of interactions from a list of randomly selected 200 binary interactions not previously reported in any database (see Materials and methods) which were identified by each method. The sampling of 200 unknown interactions was performed 1000 times, and percentage values were averaged and displayed. (C) An alternative comparison of FPR, between COLA (at bootstrapping cut off p-values of 0.05, 0.01, and 0.001), iCOLA (at cut off p-value of 0.05), Kristensen et al.'s SEC-MS method,12 and Rolland et al.'s Y2H screen.6 FPR was defined as percentage of interactions present in the anti-CORUM dataset (see Materials and methods) that were identified by each method. (D) Comparison of Precision between COLA (at bootstrapping cut off p-values of 0.05, 0.01, and 0.001), iCOLA (at cut off p-value of 0.05), Kristensen et al.'s SEC-MS method,12 and Rolland et al.'s Y2H screen.6 As the total size of the hypothetical interactome space (A), as well as the total number of reported (R) and unreported (R′) interactions are known for each method, the measures of Sensitivity (S/M) and FPR (V/M′) can be used to estimate S, M, M′, T, and U values. Precision was then calculated as S/R, and displayed as a percentage value. COLA, and iCOLA have comparable or slightly better Precision than Kristensen et al.'s SEC-MS method, but both vastly outperform Rolland et al.'s Y2H. | ||
Estimation of FPR is somewhat tricky as no reference database of false interactions (protein–protein interactions that definitely do not occur) exists. To circumvent this problem, we used two alternative approaches to generate lists of likely to be false interactions. In the first approach, we simply made a list of 200 randomly generated interactions which were not reported to interact with one another in any known protein interaction database (thus at least enriched in false interactions compared to the background), and calculated the percentage of these interactions that were detected by COLA and iCOLA. This method has been utilised by Vidal and colleagues to estimate the FPR of various Y2H screens, which were reported at 0.5 to 2%.27 The downside of such an approach is potential errors that could be introduced due to random sampling. To counteract such bias, we repeated our sampling of the 200 unreported interactions a thousand times, calculating the FPR for all of them and averaging the resulting values. Based on this approach, COLA and iCOLA were estimated to have FPR of 0.02 to ∼0.1% depending on the bootstrapping cut-off (Fig. 4B). Using the same estimation strategy, FPR of Kristensen et al.'s SEC-MS was calculated to be ∼0.03% while Rolland et al.'s Y2H had an FPR of ∼0.1% (Fig. 4B).
A major issue with the aforementioned strategy is that many unreported interactions maybe yet undiscovered true interactions as opposed to false ones, and this is particularly likely if a given protein is not well studied in terms of its interactions, thus not well annotated in interaction databases. So as an alternative approach to generate a library of likely to be false interactions, we used a method based on a strategy recently proposed by Foster and colleagues.31 In this method, it is reasoned that as CORUM is a highly curated database of well-known protein interactions, if a given protein is already annotated in CORUM, it is likely that its interactions are better defined, so if two such CORUM annotated proteins are not reported to interact with each other, they are more likely to be false interactors. Based on this rationale, we created a library of possible false interactions, which we named anti-CORUM (proteins which are listed in CORUM but not known to interact), and used it to estimate the FPR by calculating the percentage of these anti-CORUM interactions that were identified by COLA and iCOLA. Depending on the bootstrapping cut-off stringency, COLA and iCOLA's FPR was estimated at 0.02 to ∼0.19% (Fig. 4C). Using the same strategy, the FPR for Kristensen et al.'s SEC-MS was ∼0.25% while Rolland et al.'s Y2H had an FPR of ∼0.14% (Fig. 4C). Thus, the FPR of COLA and iCOLA are better than that of SEC-MS, and better or comparable with Y2H, depending on the bootstrapping cut-off.
Finally, using the anti-CORUM estimated Sensitivity and FPR, we estimated V, S, U, and T values (Table 3), which allowed calculation of Precision for COLA and iCOLA. Depending on the bootstrapping cut off stringency, COLA and iCOLA's Precision were calculated at ∼61 to 79%. In comparison, SEC-MS had a Precision of ∼65%, while Y2H Precision was ∼10% (Fig. 4D). Thus, while COLA and iCOLA have comparable or slightly better Precision than Kristensen et al.'s SEC-MS, they vastly outperform Rolland et al.'s Y2H screen. Collectively, these results demonstrate that significant similarity in protein localisation signatures can be confidently used to reveal interactions, and that COLA and iCOLA vastly outperform Y2H in terms of Precision, which is the key measure of assay specificity.
Although subcellular localisation of proteins has been studied by proteomics before,28,32–37 the focus of most of these studies has been on assigning proteins to different organelles rather than revealing protein–protein interactions. Comprehensive subcellular localisation profiling to reveal interactions has been performed by microscopy, using high-throughput fluorescent tagging in combination with high-content imaging.14 Also, a related non-microscopy based method known as proximity based biotin labelling functions by tagging bait proteins with a promiscuous biotin ligase, which then biotinylates any closely localising proteins in vivo, allowing their subsequent affinity purification and identification of by mass spectrometry. However, similar to Y2H or AP-MS, both these approaches suffer from the labour intensive need to tag every target protein in a cell-type of interest, and are prone to potential artefacts caused by the addition of a fluorescent or a biotin ligase tag. In contrast, COLA can be applied to any cell-type, and in a fraction of the time required for other methods. Fractionations, sample preparations, and computational methods used in COLA are all well established, making it a readily available tool to a wide range of biologist across diverse fields. Finally, our benchmarking shows that COLA and iCOLA compare favourably with some of the existing methods of interactome mining in terms of the quality of their interactome data. With regards to their Sensitivity, COLA and iCOLA are comparable with SEC-MS, and perform better than Y2H (Fig. 4A). More importantly, COLA and iCOLA perform comparable or better than SEC-MS in terms of their Precision, while greatly outperforming Y2H (Fig. 4D). COLA and iCOLA therefore compare favourably with some of the existing global methods for reliable unbiased mining of interactomes.
Footnotes |
| † Electronic supplementary information (ESI) available. See DOI: 10.1039/c6mb00701e |
| ‡ These authors contributed equally to the work. |
| § Deceased. |
| This journal is © The Royal Society of Chemistry 2017 |