Anna C. F.
Lewis
a,
Ramazan
Saeed
b and
Charlotte M.
Deane
*c
aDepartment of Statistics and Systems Biology DTC, University of Oxford, UK
bDepartment of Statistics, University of Oxford, UK
cDepartment of Statistics, University of Oxford, UK. E-mail: deane@stats.ox.ac.uk
First published on 28th September 2009
Here we review the methods for the prediction of protein interactions and the ideas in protein evolution that relate to them. The evolutionary assumptions implicit in many of the protein interaction prediction methods are elucidated. We draw attention to the caution needed in deploying certain evolutionary assumptions, in particular cross-organism transfer of interactions by sequence homology, and discuss the known issues in deriving interaction predictions from evidence of co-evolution. We also conject that there is evolutionary knowledge yet to be exploited in the prediction of interactions, in particular the heterogeneity of interactions, the increasing availability of interaction data from multiple species, and the models of protein interaction network growth.
![]() Anna C. F. Lewis | Anna Lewis is studying for her DPhil through the Systems Biology Doctoral Training Centre, and is supervised by Drs Charlotte Deane, Mason Porter and Nick Jones. Her thesis will be in the area of applied biological networks. Her undergraduate degree was in Physics and Philosophy at the University of Oxford. |
![]() Ramazan Saeed | Dr Ramazan Saeed completed his DPhil in 2009 in the Department of Statistics, University of Oxford, supervised by Dr Charlotte Deane. His thesis was on investigating protein structure and evolution through the protein interaction network. He was a student in the Life Sciences Interfaces Doctoral Training Centre. He did his undergraduate degree in software engineering at Manchester University, and has worked as a business consultant at Fujitsu Services. |
![]() Charlotte M. Deane | Dr Charlotte Deane is a University Lecturer in Statistics at Oxford. She is currently the Director of both the Systems Biology and the new industry focussed Systems Approaches to Biomedical Sciences Doctoral Training Centres (DTCs). She is also Deputy Director of the Doctoral Training Centre at Oxford. In addition she is a member of the management team for Oxford’s Centre for Integrative Systems Biology. All of her research is interdisciplinary and involves application of mathematical and computational techniques to problems across biology and biochemistry, in particular in protein structure, function and interaction. She aims to understand how proteins evolved their properties as well as to develop software to predict these properties. |
In this review we have two broad aims. The first is to outline those aspects of protein evolution which are of potential relevance in predicting protein interactions. Our understanding of protein evolution is thus a very broad one, covering sequence evolution, structural evolution, evolution of a protein’s context within and across genomes and location within the protein interaction network (PIN). Our second aim is to clearly state the evolutionary assumptions underlying some of the interaction prediction methods, and ask how accurate and general these assumptions are.
As a motivating example, consider the common practice of transferring protein interactions by orthology from an organism where we have a lot of protein interaction data (e.g. yeast), to one where we have considerably less (e.g. mouse). The assumption is that if proteins A and B interact in yeast, then orthologous proteins A′ and B′ in mouse probably also interact. So, below some threshold of sequence divergence, we can predict an interaction. The threshold at which we can do this with confidence has been shown to be far higher (70% sequence identity) than that which is used for transferring other properties such as function (16–30%),11 demonstrating the care needed in making use of evolutionary assumptions in making interaction predictions.
There is a vast literature on protein interaction networks, notably models for how they evolve, which has not fully engaged with the literature on the properties of individual proteins and how they evolve. As a second motivating example, consider that network evolution models have yet to be fully exploited for making protein interaction predictions.
This review is divided into three main sections. In the first, we summarise the protein interaction data we currently have, including the known problems associated with the different types of data. We discuss the methods that are used to assess the quality of protein interaction data sets and the ability of protein interaction prediction algorithms. The second section reviews some of the aspects of protein evolution most relevant to protein–protein interactions. We organise this literature around three broad themes: how protein interactions are thought to constrain evolution, the co-evolution of interacting proteins, and the evolution of the whole PIN. The third section reviews several different protein interaction prediction methods, and the extent to which they rely on and allow for evolutionary considerations. We conclude by discussing how the available evolutionary information has been utilised in predicting protein–protein interactions, whether this is always justified, and whether further applications of this information are possible.
The two most common high throughput assays are Yeast Two-hybrid (Y2H) and tandem affinity purification (TAP). In the classical Y2H system, a bait is expressed as a DNA binding domain fusion and the prey expressed as a transcriptional activation domain fusion. The interaction is measured when the two domains join to form a transcriptor which activates a reporter gene.17 TAP methods infer an interaction when a bait protein is affinity captured from cell extracts by either polyclonal antibody or epitope tag and the associated interaction partner is identified, usually by mass spectrometric methods.18 A number of variations on both assays exist.19,20
Errors in both methodologies arise when the protein fusions can lead to restricted mobility and in some cases increase the affinity of a protein for certain targets.21 Both approaches are also very poor at identifying self-interactions.22 In Y2H the baits interact with each other and the preys interact with each other resulting in reduced concentrations of bait/prey interactions. In TAP assays no self interactions are reported due to the lack of untagged baits.
False positives in the Y2H approach arise as the bait and prey are over-expressed, often in a non-native cellular localisation.15 Consequently any observed interaction may not be present in the wild type of the cells where the concentrations are often significantly lower and where the bait and prey may not normally meet. Auto-activation can also lead to false positives as the reporter gene is activated by other factors, in some cases the bait protein itself.23 This shortcoming however has been addressed in more recent Y2H assays 1 where the individual fusions are tested to see if they cause auto-activation.24
TAP methods are also afflicted by false positives, particularly as a reported complex does not actually imply any direct physical interactions between the complex partners. This issue is addressed, but not resolved, by adopting a ‘spoke’ or ‘matrix’ model,25 where the spoke model directly pairs proteins with others in the complex and a matrix model assumes interactions between all.
There are several low throughput technologies that offer more accurate but far fewer interactions, such as crystallised complexes found in the protein data bank.26
The abundance of proteins has some effect on the number of interactions that they participate in, and experimental methods have been shown to be biased towards counting more interactions for abundant proteins.27 Fraser and Hirsh suggest that this relationship with expression is an intrinsic characteristic of yeast rather than an experimental bias.28
When using such assessment methodologies on computationally predicted interactions it may be that biased results are obtained due to the possible circular nature of the prediction method and the assessment methodology.
For a review of protein evolution in more general contexts see Pál et al.,47 and for a review of the evolution of protein interactions see Levy and Pereira-Leal.48
Initial work appeared to confirm (a),52 but since then this claim has been challenged, on the basis that it relies on biases in interaction data,27 on confounding independent variables (notably expression rate53), is not particular to number of interactions but to other network features,54 and does not stand up to the data.32,55,56 A review of the methodologies used concluded that the number of translation events (which is well indicated by expression level, protein abundance and codon adaptation index), rather than the number or patterns of protein interactions, was the key determinant of evolutionary rate.57
Work based on structures of proteins crystallised together has been more conclusive with regard to (b): interface residues are more evolutionary conserved than other surface residues,58,59 although this effect is moderate. Considering an individual interface, some residues, known as hot spots, are thought to dominate the interaction binding energy,60,61 and these are found to be even more conserved than other interface residues.62 Comparing protein interfaces, we can distinguish two broad classes: transient and obligate, where the latter describes two proteins that are never found out of complex with each other.63Proteins involved in obligate interactions are more evolutionarily conserved than those involved in transient interactions, which are in turn more evolutionarily conserved than those not known to be involved in any interactions.64
The genomes of different organisms can be compared to give us information about likely functional association between proteins. If the same genes tend to occur as neighbours in multiple organisms, then we can infer functional association between them (e.g.ref. 66): if two proteins cannot perform their cellular function without each other, then when one is lost, there will be no evolutionary advantage to keeping the other, and they will be lost from the genome as a pair. Patterns of presence and absence of genes in different organisms, termed phylogenetic profiles, are the simplest clue of protein co-evolution. Similarly, profiles encoding the presence and absence of protein domains can be used to detect functional associations.67 Additional patterns in phylogenetic profiles, such as anti-correlation,68 and correlations between triplets of proteins69 can also give information about functional associations.
Interacting proteins are often transcribed as a single unit (operon) in bacteria, whereas in eukaryotes they tend to be co-regulated or regulated by the same protein (regulon). Studies have shown that the products of 63–75% of co-regulated genes tend to interact physically.70 This suggests that the regulation of proteins co-evolves with the proteins and their interactions.
It is also possible to compare the phylogenetic trees of proteins. The motivation for the approach is that the phylogenetic trees of, for example, ligands and their receptors are more similar than would be expected under the standard molecular clock hypothesis, which indicates some degree of co-evolution.71 This relationship is even more striking when the phylogenetic trees are built from the sequence of the interacting interfaces, rather than the whole protein.72 The same study shows that residues at the interface of obligate complexes tend to evolve slowly, allowing co-evolution of the partner interface, whereas transient interfaces tend to have an increased rate of residue substitution, leaving little evidence of correlated mutations across the interface.72 It is also possible to concentrate on the domains within proteins, and this can be used to infer which domains are responsible for a given interaction.73 As these approaches rely on generating reliable phylogenetic trees, they are well placed to take advantage of the growing amount of sequence information available.
A generalisation of this approach comes from the acknowledgement that, as proteins can interact with many different partners, considering only pairwise interactions will never give us the whole picture when it comes to protein co-evolution. Rather, this depends on all the different interactions a protein may be engaged in, so comparing proteins not only pair-wise, but against all other proteins can give a better idea of the co-evolution of a given pair.74
Generalisations in a different direction come from investigating the similarities of whole protein structures, not just sequences. In Williams and Lovell75 the authors offer an integrated view of sequence and structural divergence, claiming that both co-evolution following sequence changes and structural accommodation of non-compensated substitutions can be accommodated in the same framework. The methods above assess similarity based on nucleotide substitutions, but there is some evidence that the role of insertions and deletions of short stretches of nucleotide (indels) is important. Indels are particularly common on the surfaces of proteins (e.g.ref. 76), and are thus suspected to play a large role in ‘re-wiring’ the protein interaction network. The importance of indels has been highlighted by a study that finds that proteins that participate in indel alignments have high centrality measures within the network.77
It is not clear under what circumstances we can infer direct physical interactions from the observation of co-evolution.65 In Hakes et al.78 the authors argue that there is no evidence that co-evolution is due to co-adaptation, arguing for the importance of common evolutionary forces, notably expression levels, as responsible for the co-evolution. In Kann et al.79 it is argued that there is some evidence for co-adaptation, alongside more general evolutionary forces. A better understanding of the molecular mechanisms underpinning compensatory changes will help distinguish these correlations, and perhaps work out the conditions under which co-adaptation can be inferred. A better understanding of how other protein features can influence functional association will be vital.
The field was re-vitalised in 1999 with the proposal of the preferential attachment model of Barabasi and Albert,80 which picked up on ideas dating back to the 1950s.81,82 It was observed that, if new nodes attached themselves to old nodes with a probability proportional to the number of interactions of the old nodes, a power law distribution of node degree would result (often called a scale free distribution in the literature). Such distributions were, at the time, being found in many different types of network,83 including PINs.84 This would fit the observation that older proteins have more interactions.85 The problem with the model of growth by preferential attachment as a model of PIN growth is that it does not obviously correspond to a known biological mechanism, with the possible exception of horizontal gene transfer in the prokaryotic case.
What are the likely factors underlying network evolution? Errors in replication can result in a change in copy number of proteins, from individual genes being duplicated or lost (reviewed in e.g. Zhang86), to the whole genome being duplicated (reviewed in e.g. Kasahara,87 and Scannell et al.88). After a geneduplication event, divergence of function is possible. There are two main competing models for such divergence: sub-functionalisation (partitioning of ancestral function between gene duplicates) and neo-functionalisation (the de novo acquisition of function by one duplicate).86Geneduplication was hypothesised to be disadvantageous in complexes in particular, and evidence for fewer single geneduplication events in gene families encoding complexes has been found in support of this.89
Many PIN evolution models have been based on this idea of duplication followed by divergence90,91 (see Fig. 1(a)). There are many different variants, but all share in common that a node is selected to be copied (duplication), some fraction of the nodes links are replicated in the duplicated node, and some more added (divergence). Recently, models which allow for whole genome duplication events have also been proposed.92 Both preferential-attachment and duplication-divergence models match the power law degree distributions found in PINs, and also both match the data in that nodes of high degree tend to connect to nodes of low degree (node disassortativity).93Duplication-divergence models generate some level of hierarchical modularity (whereby small densely connection groups of proteins, termed modules, join together into larger modules), though not as much as observed in the data.93
![]() | ||
Fig. 1 PIN evolution models. (a) Duplication-divergence. A node is chosen to be copied. The new node is given links to the same set of nodes as the chosen node (duplication). Some fraction of links are then lost (divergence). (b) Asymmetric gain and loss of interactions. Three move types are possible. (i) Addition of a link. A link is made between one node chosen at random and another chosen proportional to node degree. (ii) Removal of link. A protein is chosen uniformly at random, and one of its links is chosen uniformly at random to be removed. (iii) A new node is added with zero links. The probabilities of these three moves are chosen such that the mean node degree stays the same and the network grows at some empirically inferred rate. (c) Crystal growth model. After an initial seeding phase, either (i) modules are computed, one is chosen, and a new node is added to this chosen module or (ii) a new node is put into its own module, and connects to other modules which have few links (anti-preferential attachment rule). |
Despite the successes of duplication-divergence models in capturing many aspects of the empirical PIN, it has been claimed that geneduplication and divergence may in fact have played only a limited role in the evolution of PINs, as the dynamics of the gain and loss of individual interactions is thought to happen at a much faster time-scale.94,95 In Berg et al.,96 the authors propose a model based on the addition and loss of individual links. They found that the rate of addition and loss of links depends on the number of connections of both interacting partners asymmetrically, which is to be expected because when a new link is formed, typically only one node undergoes a mutation with the other remaining unchanged (see Fig. 1(b)). By building a stochastic model based on these observations they are able to match the degree distribution and node disassortativity found in the data.
In a separate critique of the popular models, Kim and Marcotte93 investigate the age-dependent evolution of proteins, and claim this cannot be accounted for by duplication-divergence or preferential attachment models. Instead, they propose a crystal growth model based on (a) interaction probability increasing with availability of unoccupied interaction surface, (b) tightly connected groups of proteins developing as the network grows, (c) once a protein is committed to such a group, further connections tend to be made with other members of that group (see Fig. 1(c)). In this model, proteins are more likely to link to proteins of a similar age, as observed in real PINs. The model uses network modularity as a key idea in PIN evolution, an idea that is not that well explored elsewhere (though see Li and Maini97 for an abstract model of network growth based on modules).
It is not clear how best to assess different models of network evolution and growth, though it is clear that these have matured significantly beyond matching degree distributions.93,98–100
As more and more PIN data become available, models that use and explain the patterns of interaction found in different organisms will become increasingly prevalent. The alignment of PINs is a rapidly growing field (for examples see Guo and Hartemink,101 Liao et al.,102 and Zaslavskiy et al.,103 for an early review see Sharan and Ideker104). Such alignments can enable the location of evolutionarily conserved sets of interactions, which is of great potential in the field of predicting interactions.
![]() | ||
Fig. 2 Different interaction prediction methods. |
The main category of prediction methods we do not cover are those based on machine learning approaches, as these tend not to be underpinned by biological ideas, explicit or implicit. Two reviews of protein interaction prediction which include these approaches are given in Shoemaker and Panchenko105 and Pitre et al.106
Clauset et al.108 used the hierarchical structure (that is, the structures present at different scales) of networks to predict ‘missing’ links between nodes. Interaction probabilities were assigned between hierarchical groups and a pair of nodes were suggested to be possibly linked if they possessed a high average probability of connection within these hierarchical groups, but were observed as unconnected. Whilst this method was tested on various different networks, including a metabolic network, it has yet to be applied to PINs. It could potentially be successful, if PINs are found to have biologically meaningful hierarchical structure (in Yook et al.109 the authors claim this is the case). This in turn rests on our understanding of the evolution of hierarchical modularity.
A common shortcoming of such network based methodologies is that the results can vary depending on the state of the network, which in some species is sparse and error prone.
Here we assume that there is enough selective pressure that if two proteins interact to perform their cellular function, and one is lost, the other protein will also be lost (for example, this would be the case for some complexes). In the case of horizontal gene transfer in bacteria, genes will only be kept if they are transferred with other genes that they need to interact with to perform a fitness enhancing function.
Matthews et al.115 exploited this principle to predict novel interactions in the worm C. elegans based upon homology inference from high throughput assays of S. cereivisae. This method was later developed by Jonsson and Bates42 to include a scoring scheme based upon the different experimental sets that an interaction was sourced from.
Interologs are often used in species where there is little PIN data but much genetic data, for example in mouse and human.42,116 A comparison of the PINs from different organisms found that such inferences are surprisingly inaccurate.117 Whereas the degree of sequence similarity needed before an interolog is inferred is usually in the range 16–30%,115,118 this study showed that a much larger degree of sequence similarity, about 70%, is required to make an accurate prediction using an interolog. This and more recent studies have shown that paralogous interactions tend to be more conserved than orthologous interactions.11,37,119
The evolutionary assumption behind interologs is clear: if the two proteins have not diverged considerably in sequence from their common ancestor, then the interaction is likely to have been preserved. However, it is often claimed that the small amount of molecular divergence between proteins found in multiple species is not enough to explain the phenotypic divergence between species.120 Re-wiring of the networks, caused by small changes in protein sequence, is hypothesised to make the difference. If this is right, then it is a further reason to use interologs with utmost care.
As discussed above in section 3.2, the main issue is in the inference from functional association to physical interaction (from co-evolution to co-adaptation).
In analogy with sequence based interologs, structure based interologs have been investigated. Aloy et al.123 found that proteins with the same folds or structural domains tended to participate in similar interactions if the sequence identity of the proteins was above approximately 30%. Below this percentage, there is a twilight zone where proteins may or may not share similar interactions. The evolutionary assumptions and potential problems are similar to those for sequence based interologs (see above). Their advantage is that they are more accurate, their disadvantage is the relative paucity of structural information.
Protein-docking methods,124,125 where two structural proteins are rigidly combined and then refined can also be used to predict interactions. However, due to the computational cost of such methods they are in general more useful for providing information on the interacting interface of two structurally defined subunits.126 Other methods predict interactions based on surface patch comparison127 and oligomeric protein structure networks.128 Carugo and Franzot127 divided atoms on the surface of each protein into small partially overlapping sets (patches). The shapes of each pair of patches belonging to different proteins were compared, and a statistical analysis of the shape complementarity values was used to discriminate interacting and non-interacting protein pairs with an accuracy of up to 80%. Brinda and Vishveshwara128 attempted to understand the factors involved in protein interactions by analysing interactions between amino acids based on the number of non-covalent bonds, which are known to play a role in mediating protein interactions.129
One lesson from the field of protein docking in general is that supplementing physics and chemistry considerations with information deduced from sequence and structural databases can improve predictions greatly.130 The use of these databases rests on the assumption that similar structures imply similar interactions because the proteins are related to each other through evolution.
Association methods are a group of prediction methods that look for blocks of sequence or structural motifs that distinguish interacting proteins from non-interacting proteins. In one such study, Sprinzak and Margalit134 looked for sequence domains that were found to interact more often than expected by chance. They used such domains as signatures to predict new interactions.
A problem with association models is that they only consider one domain at a time and ignore the affect of other domains on the interaction. This was addressed by Deng et al.,135 who estimated the probabilities of interactions between every pair of domains, and used these to predict interactions between proteins. Rare interactions between two domains can be missed by this method. To compensate for this, Riley et al.136 developed measures based on the reduction in the likelihood of the protein–protein interactions network, caused by disallowing a given domain–domain interaction. This can give some indication of which domain–domain interaction is more likely to be responsible for a given protein–protein interaction.
In all these approaches domains are assumed to interact independently, though this can depend on other domains within a protein pair, and remains a severe limitation of these methods.
These methods could potentially be powerful in predicting whole PINs of organisms for which interaction data is not available.106 This is because surprisingly few domains have been duplicated and recombined to form proteins across the tree of life: 50% of domain structure annotations in each organism are to fewer than 200 domain families common to all kingdoms. There is cause for caution here, however, as we need to ascertain that domain interactions are not organism specific, remembering that domain combinations tend to be specific to organisms.137 In Basu et al.,138 the authors demonstrate both that domains that occur in diverse domain architectures tend to have more interactions, and that which domains end up in diverse architectures is organism specific.
Key to the reliable use of interologs and co-evolution will be a better understanding of the molecular mechanisms underpinning the evolution of interactions. There is evolutionary knowledge that has yet to be exploited in this regard, notably the large documented differences between transient and obligate interactions. These differences can potentially be detected on the basis of sequence alone.139 An investigation of the different sequence cut-offs that should be employed for interolog prediction for transient and obligate interactions would be a useful starting point.
The theories that state that indels have a key role to play in the re-wiring of the network could be tested, and if found to hold, methods which include the effect of indels could be used both in predicting interologs and for separating out co-adaptation from co-evolution.
Another step for improving interolog and co-evolution based predictions is to make use of the comparative PIN data that is now becoming available. Alignments of entire PINs, rather than just pairs of proteins, can give us additional information as to when an interolog inference is acceptable. If proteins A, B and C all interact in species S, and proteins A′ and B′ and A′ and C′ interact in species S′, then we have better evidence to predict that B′ and C′ also interact.
There is potential for models of network evolution to be used in the prediction of protein interactions. At present, we have no clear consensus as to which are the good models of network evolution. It is clear that there is evolutionary information that can be incorporated into these models to make them more realistic. For example, no model proposed, to our knowledge, distinguishes between transient and obligate interactions, which is perhaps surprising given the known differences between the evolution of these different interaction types. Good models of PIN growth, which have been assessed against multiple datasets, will be helpful in predicting new interactions. These models could be used to generate ensembles of interaction networks matching the observed statistics of the empirical PIN. The frequency with which a given pair of proteins interact in the ensemble can be used to predict likely interactors,108 although this has yet to be put into practice.
An important observation is that it is not easy to assess the relative success of different prediction methods. It is likely that different methods are more successful on certain types of proteins, which would be related to the different evolutionary assumptions underlying the different methods. Through direct comparison of which methods do best on which proteins, interaction prediction could be tailored to what else we know about the proteins involved.
Footnote |
† These can be found in publicly available databases.10 |
This journal is © The Royal Society of Chemistry 2010 |