G. S.
Nido
,
R.
Méndez
,
A.
Pascual-García
,
D.
Abia
and
U.
Bastolla
*
Centro de Biología Molecular “Severo Ochoa”, (CSIC-UAM), Cantoblanco, 28049 Madrid, Spain. E-mail: ubastolla@cbm.uam.es
First published on 11th November 2011
Here we study the properties and the evolution of proteins that constitute the Centrosome, the complex molecular assembly that regulates the division and differentiation of animal cells. We found that centrosomal proteins are predicted to be significantly enriched in disordered and coiled-coil regions, more phosphorylated and longer than control proteins of the same organism. Interestingly, the ratio of these properties in centrosomal and control proteins tends to increase with the number of cell-types. We reconstructed indels evolution, finding that indels significantly increase disorder in both centrosomal and control proteins, at a rate that is typically larger along branches associated with a large growth in cell-types number, and larger for centrosomal than for control proteins. Substitutions show a similar trend for coiled-coil, but they contribute less to the evolution of disorder. Our results suggest that the increase in cell-types number in animal evolution is correlated with the gain of disordered and coiled-coil regions in centrosomal proteins, establishing a connection between organism and molecular complexity. We argue that the structural plasticity conferred to the Centrosome by disordered regions and phosphorylation plays an important role in its mechanical properties and its regulation in space and time.
Recently, a large scale proteomic experiment has identified 114 proteins localized in the human centrosome.11 Motivated by this study, Nogales-Cadenas et al.12 retrieved from public databases such as Ensembl,13 the Human Protein Reference Database (HPRD)14 and MiCroKit15 a large number of genes annotated as centrosomal from previous literature evidence, collecting a total of 465 likely centrosomal human genes that constitute the CentrosomeDB http://centrosome.dacya.ucm.es. We take advantage of this knowledge in order to address some general aspects of the structural organization of the centrosome. Workers in the field know that proteins in the centrosome tend to be large, disordered and coiled-coil, and that phosphorylation plays a very important role in the dynamic organization of the centrosome. Here we quantify these properties, comparing them with analogous properties of non-centrosomal (control) proteins, and we investigate their evolutionary origin.
Disordered regions are protein fragments that do not take a well-defined three dimensional structure unless they form specific interactions. Experimental techniques to identify them include, among others, X-ray crystallography, where disordered regions are characterized as regions lacking electron density, nuclear magnetic resonance, using new methodologies that allow the assignment of resonances to unfolded and partially folded regions, circular dichroism, that allows to detect the lack of rigid structure of regions containing aromatic residues, and small-angle X-ray scattering (SAXS) and other techniques that allow to measure the hydrodynamic radius of a protein.16 Disordered regions are abundant in eukaryotic proteins,17 in particular in proteins that take part in cell regulation such as for instance transcription factors.16 This suitability of disordered proteins for regulatory functions is often attributed to the fact that disorder is thought to promote molecular interactions (either protein–protein or protein–DNA) with high specificity and low affinity, as needed in complex regulatory processes.16 Furthermore, disorder endows proteins with the structural plasticity necessary for multiple partner binding, so that it was proposed that it can provide the structural basis for the promiscuity of hubs in protein–protein interactions networks.18 Disorder is frequently found in interaction hubs, more in dynamic hubs forming transient interactions than in static hubs.19 Moreover, disordered proteins have a large propensity to interact between themselves.20 These observations suggest that disordered proteins are frequently encountered in complex molecular machines such as the Centrosome.
Disordered regions are frequently phosphorylated,21 and their intrinsic structural flexibility amplifies the structural effect of the negatively charged phosphate, causing large conformation changes that control the capacity of the protein to recognize different partners. Protein kinases exert a coordinate control of cell physiology, in particular during the different phases of mitosis, using the centrosome as a scaffold that allows them to coordinate their action.22 These observations indicate a deep relationship between protein disorder, phosphorylation, and the centrosome.
Coiled-coil structures consist of homopolymeric or heteropolymeric bundles of long α-helical stretches formed by repeats of a typical heptameric hydrophylic/hydrophobic motif that are stabilized through hydrophobic or electrostatic interactions with their interactions partners.23 Coiled-coil structures are very frequent in centrosomal proteins, and we found that they are frequently predicted to be disordered. This is consistent with previous observations that proteins with coiled-coil structure tend to be enriched of disordered regions.24 A large scale proteomic experiment on thermostable proteins expressed in mouse fibroblast cells found that more than 2/3 of these proteins are predicted to be substantially disordered, and that disordered domains and coiled-coil domains occur together in a large number of expressed proteins.25 Another study found that coiled-coils are often predicted to be unstructured, consistent with their obligate multimeric nature.26 These predictions suggest that many coiled-coil proteins are disordered prior to molecular interaction, consistent with the finding that their sequence complexity is typically lower than for globular proteins.24
Several recent experimental results are consistent with this view. For instance, the basic-helix-loop-helix-leucine-zipper domains of the c-Myc oncoprotein and its obligate partner Max are intrinsically disordered monomers that undergo coupled folding and binding upon heterodimerization forming a parallel coiled-coil.27 Chibby, a small and highly conserved protein that plays an antagonistic role in Wnt signaling, has an N-terminal portion that is predominantly unstructured in solution, while its C-terminal half adopts a coiled-coil structure through self-association,28 and the intrinsically disordered Thyroid cancer 1 protein interacts with Chibby via regions with high helical propensity, which strengthen their helical structure upon addition of Chibby.29 Dynein light chain (LC) 8 interacts with the natively disordered N-terminal domain of the dynein intermediate chain (IC), promoting self-association of two IC chains at a region predicted to form a coiled-coil.30 Prostate apoptosis response factor-4 (Par-4) is an intrinsically disordered protein that contains a highly conserved coiled-coil region that serves as the primary recognition domain for a large number of binding partners and self-associates via the C-terminal domain, forming a coiled-coil that is stabilized through an intramolecular association.31 The protein FIP2, which interacts with Rab11, a key regulator of plasma membrane recycling, has a C-terminal fragment that is disordered in the absence of Rab11, but acquires helical structure upon binding with it.32 The Huntingtin-interacting protein 1 (HIP1), obligate interaction partner of the protein that triggers Huntington's disease, was partly solved by X-rays and partly modeled as two coiled-coil domains linked by a disordered region that allows it to assume a U-shape upon interaction.33 Even bacterial proteins present similar phenomena, for instance helical filaments of bacterial flagella are built up by a self-assembly process from thousands of flagellin subunits whose terminal regions are disordered. Removal of C-terminal segments or truncation at both ends result in the complete loss of binding ability, consistent with the coiled-coil model of filament formation, which assumes that the α-helical N- and C-terminal regions of axially adjacent subunits form an interlocking pattern of helical bundles upon polymerization.34 Furthermore, bacterial gene clusters encoding type III secretion system (T3SS) code for small hydrophylic proteins whose amino acid sequences display a propensity for intrinsic disorder and coiled-coil formation. These properties were confirmed experimentally for the HrpO protein from the T3SS of Pseudomonas syringae, which exhibits high α-helical content with coiled-coil characteristics, low melting temperature, structural properties that are typical for disordered proteins, and a pronounced self-association propensity, most likely via coiled-coil interactions, suggesting that the flexibility and propensity for coiled-coil interactions of these proteins might play an important role for establishing the protein–protein interaction networks required for T3SS function.35
We find here that regions that are both disordered and coiled-coil constitute the structural signature of centrosomal proteins.
![]() | ||
Fig. 1 Phylogenetic tree of the six model organisms used for this study. The number into brackets indicates the estimated number of cell types according to ref. 1. The approximate divergence estimated by ref. 43 is also shown. The tree is based on the Coelomata hypothesis according to which Arthropod is a sister group of vertebrates. |
Centrosomal proteins for species other than humans were derived from the list of 465 human centrosomal proteins12 by gathering orthologous proteins from the Compara database36 of the Ensembl project,13http://www.ensembl.org. The set of control proteins was constructed with the same procedure starting with a randomly drawn set of 465 human genes. In this way, the unavoidable bias inherent in using the experimental information for human proteins and extending it to other species is present both in the centrosomal and in the control set, so that their comparison should be free from this bias.
A known feature of centrosomal proteins is the high incidence of coiled-coil structure. These structures can be reliably predicted from the protein sequence based on their characteristic heptameric pattern of hydrophobic and hydrophylic residues.23 We used two algorithms, ncoil48 and Pcoils,49 to predict coiled-coils in the proteins of our data sets. These two algorithms yield very similar predictions: with a cut-off of49 equal to 0.75, 61% of the predictions of either algorithm coincide. Moreover, they provide exactly the same qualitative picture (see Fig. S3, ESI‡). In the following, we present results obtained with the ncoil algorithm.
We computed the propensity of coiled-coil predictions and disorder prediction to occur at the same site through the formula p(x,y) = ln(P(x,y)/P(x,y)P(x,y)), where x and y represent the event that a given site is predicted as disordered and coiled-coil. Propensity is related to mutual information, and it also allows to detect the sign of the correlation: positive propensity means that x and y tend to co-occur more than at random (here this refers to co-occurrence of disorder and coiled-coil predictions).
Consistent with previous theoretical and experimental work,24–35 we found that there is a positive propensity to predict a residue as coiled-coil if it is predicted to be disordered. This propensity is not a trivial consequence of overlapping training sets for the two predictors, since disordered regions lack any stable structure unless they interact with their binding partner, and they are characterized as regions that lack electron density in X-ray crystallography experiments, whereas coiled-coil regions are characterized as long α helices in the same experiments. Regions predicted both as disordered and coiled-coil may represent disordered regions that take coiled-coil structures upon binding with their proper binding partner. The view that coiled-coil proteins are often disordered prior to molecular interaction is consistent with previous theoretical and experimental work.24–35 We found that propensities are significantly positive for all data-sets and all pairs of disordered and coiled-coil predictors, except for a few data-sets using the DisEMBL predictor. Using ncoil or Pcoils for coiled-coil predictions yields the same propensities within the statistical error. Therefore, the correlation between disorder and coiled-coil does not depend on the predictors used.
Interestingly, propensities are slightly but systematically larger for control than for centrosomal proteins and, for the latter, they tend to decrease with organism complexity, see Fig. S1 (ESI‡) consistent with the fact that in centrosomal proteins of more complex organisms there is a larger fraction of residues predicted to be disordered but not coiled-coil (see below).
We distinguish in Fig. 2 all four combinations of coiled-coil and disorder predictions, with the following results. (1) The fraction of residues predicted to be both disordered and coiled-coil is significantly larger in centrosomal than in control proteins for all organisms and it increases with the complexity (number of cell types) of the organism. The ratio of this fraction between centrosomal and control proteins also increases with the number of cell types (Fig. 2, top right). (2) The fraction of residues predicted to be disordered and not coiled-coil is significantly larger in centrosomal than in control proteins in vertebrates, but the difference is not significant in other organisms (Fig. 2, top left). Strikingly, whereas for control proteins the maximum amount of disorder is reached for D. melanogaster, the amount of disorder in centrosomal proteins tends to increase with the complexity (number of cell types) of the organism. (3) Finally, the fraction of residues predicted to be coiled-coil but not disordered is one order of magnitude smaller than those predicted to be both disordered and coiled-coil, it does not show significant differences between centrosomal and control proteins, and it does not vary significantly for different species (Fig. 2, bottom left). (4) As a consequence of these results, the fraction of globular residues (neither disordered nor coiled-coil) is significantly smaller in centrosomal than in control proteins and it decreases with the complexity of the organism, as shown in Fig. S4 (ESI‡). One can see that the fraction of disordered residues is larger for centrosomal than for control proteins for all model organisms, but the difference is only significant for vertebrates, and that disorder in centrosomal proteins tends to increase with the complexity of the organism. This behavior is robust with respect to the disorder predictor (Fig. S2, ESI‡). We obtained the same trend counting the fraction of proteins containing stretches with at least 40 consecutive disordered residues, which are likely to have functional relevance.
![]() | ||
Fig. 2 Fraction of centrosomal and control residues predicted as disordered and not coiled-coil (top left), disordered and coiled-coil (top right, note the different scale), coiled-coil and not disordered (bottom left) and phosphorylated (bottom right, in this case the fraction is with respect to the number of serine, threonine and tyrosine residues). In each figure, solid lines refer to centrosomal proteins and dashed lines refer to control proteins. The insets show the ratio between centrosomal and control proteins. |
We conclude that centrosomal proteins are enriched in disordered and coiled-coil regions in all organisms, with the enrichment correlated with the organism complexity, and they are enriched in disordered and not coiled-coil regions in vertebrates, whereas coiled-coil but not disordered regions are scarce and not significantly different from those in control proteins. As a consequence of these results, there is significant correlation between the amount of disorder and coiled-coil present in the same protein. Interestingly, these correlations are significantly stronger for centrosomal proteins than for control proteins, see Fig. 4.
We tested that the difference between centrosome and control proteins is not influenced by the fact that the control data-set contains extracellular proteins, whereas centrosomal proteins are intracellular. When we eliminated extracellular proteins from the control data-set using the Blast2GO suite,50 we found that the differences between control and centrosome did not change at all concerning the coiled-coil fraction, and even increased concerning the disordered fraction. We also tested that the results were not a consequence of the fact that centrosomal proteins tend to be longer than control proteins, by taking a control data-set with the same length distribution of the centrosomal data-set, see Fig. S5 (ESI‡).
We then predicted phosphorylated residues using the GPS51 and NetPhos52 algorithms (see Methods). The fraction of serine, threonine and tyrosine (S, T, Y) residues predicted as phosphorylated is shown in Fig. 2, bottom right. We found that the fraction of phosphorylated residues is significantly larger in centrosomal than in control proteins for all vertebrates but not for invertebrates. There are two known factors that can contribute to enhanced predicted phosphorylation in the centrosome. First, centrosomal proteins tend to contain a larger number of S, T, Y residues, and therefore they tend to have a larger number of predicted phosphorylation sites. This possible artifact is eliminated with the normalization that we adopt. Secondly, disordered regions are enriched in Proline residues and basic residues that are frequently found in motifs recognized by kinases and used by phosphorylation predictors. This bias is very difficult to correct, and it can be a genuine phenomenon. In fact, disordered regions are more accessible to kinases and more plastic and they tend to be phosphorylated more often than other regions. This fact is used in a phosphorylation prediction algorithm,21 but not in the algorithms that we adopted, thus we believe that this correlation is not an artifact of the predictors but a genuine effect. Interestingly, the correlation between the predicted phosphorylation fraction of a protein and its fraction of predicted disordered residues is usually stronger in centrosomal than in control proteins, although the difference is small, see Fig. 4. We also compared phosphorylation predictions for kinases associated to the Centrosome, such as the families Polo, Aurora, Cdk and Nek2, with those for other kinases. Centrosomal kinases are more enriched than other kinases in the centrosomal set for all species except D. melanogaster, however the difference is only a few percents and it is hardly significant, since the same motif is very often predicted as being recognized by several kinases.
It is known that centrosomal proteins tend to be rather long. We found that they are on the average from 5 to almost 30% longer than control proteins for all of our model organisms, see Fig. 3 left inset. Neither the mean length of centrosomal proteins nor the mean length of control proteins are correlated with the complexity of the organism, but the ratio between them is significantly correlated with the number of cell types (correlation coefficient r = 0.76, student-t = 2.6, P < 0.05, not shown). This increased length of centrosomal proteins with respect to control proteins is achieved by different means in different organisms. We plot in Fig. 3 the mean number of exons per protein (left plot) and the mean exon length (right plot). Whereas for yeast the number of exons is essentially the same in centrosomal and control proteins but exons are substantially longer in the former, for worm and fly exons are both more numerous and longer for centrosomal proteins than for control proteins, and for vertebrates the number of exons is much larger in centrosomal than in control genes while exon length is slightly smaller. As a consequence, the mean number of exons per gene is significantly correlated with the number of cell types both for control proteins (r = 0.87, P < 0.01, not shown) and, more strongly, for centrosomal proteins (r = 0.94, P < 0.001, not shown), and the ratio between them is also significantly correlated (r = 0.81, P < 0.01, not shown). Exon length is negatively but not significantly correlated with the number of cell types both for control (r = −0.64, not shown) and for centrosomal proteins (r = −0.66, not shown). The ratio between them is strongly negatively correlated with the number of cell types (r = −0.99, P < 10−5, Fig. 3 right inset), i.e. centrosomal exons are shorter than control exons by a factor that is strongly correlated with the number of cell types. Summarizing, genes of more complex organisms tend to contain more modules and these modules tend to be shorter. Both trends are enhanced in the centrosome in a way that is quantitatively correlated with the number of cell types.
![]() | ||
Fig. 3 Relationship between disorder content and gene length. Left: the number of exons per gene tend to be larger in centrosomal genes than in control genes, in particular for more complex organisms, and this number tends to increase with organism complexity. Left inset: Centrosomal proteins tend to be longer than control proteins. Right: exons tend to be larger in centrosomal than in control genes, in particular for simpler organisms. Right inset: the ratio between the length of centrosomal exons and the length of control exons tends to decrease with organism complexity. |
For each organism, we measured the correlation between the length of a protein and its fraction of disordered and coiled-coil residues. Both correlations are almost always positive, see Fig. 4 and they are typically larger for centrosomal than for control proteins (except disorder-length correlation in fly and coil-length correlations in yeast). Our data sets contain from 85 to 465 proteins, so that correlation coefficients larger than 0.2 can be regarded as significant and this figure goes down to 0.10 for human proteins. For the set of centrosomal proteins the correlations between protein size on one hand and disorder or coiled-coil fraction on the other hand are significant for all species except C. elegans, whereas they are almost never significant for control proteins. This means that long centrosomal proteins tend to contain a larger fraction of disordered and coiled-coil residues. As we will see in the next section, this can be explained by the fact that highly disordered proteins evolve through the addition of long disordered stretches.
![]() | ||
Fig. 4 Correlation coefficients between different properties of centrosomal and control proteins in model organisms are reported in colour code. The fraction of predicted disordered residues tend to be more correlated with the fraction of predicted coiled-coil residues in centrosomal than in control proteins, and both fractions tend to be more correlated with chain length (except for D. melanogaster and S. cerevisiae), so that longer centrosomal proteins tend to contain a larger fraction of disordered and coiled-coil residues. Disorder is also correlated with the predicted fraction of phosphorylation sites per S, T and Y residue. Both phosphorylation predictors GPS and NetPhos yield similar correlations, but systematically stronger for GPS. Here we use the intersection of the two predictions. |
The way in which we constructed the data sets only allows us to examine the Np mechanism when species a is H. sapiens, because the human data-set always contains a protein in each family by construction, therefore we present this case in Fig. 5, where each point refers to the comparison of H. sapiens with another species. One can see that most of the disorder gain in H. sapiens arises either because of the Np or because of the Li mechanism. The sum of these mechanisms is at least 85% for all comparisons, but for species closely related to H. sapiens the percentage of disordered residues arising from new proteins (Np) decreases, as expected, and the percentage arising from long insertions (Li) increases. This trend is qualitatively similar for control proteins, but the contribution of large indels is much larger for centrosomal than for control proteins, in particular in the comparison between closely related species. A similar trend is also observed for coiled-coil residues. In this case, large indels contribute to coiled-coil gain much more in centrosomal than in control proteins. An important difference between disorder and coiled-coil is that the relative contribution of substitutions and residue conservation to coiled-coil gain is much larger than their contribution to disorder gain. Summarizing, new (loosely speaking) disordered and coiled-coil regions have a strong tendency to evolve modularly from large indels both in control proteins as well as in centrosomal proteins, but this tendency is much stronger in centrosomal proteins, which evolve much more modularly. The contribution due to substitutions is very weak for disorder gain, but it is relevant for coiled-coil gain.
![]() | ||
Fig. 5 Origin of disordered residues in human centrosomal proteins. For all disordered residues that are disordered in human centrosomal proteins but not in proteins of other species we count how many of them are found in human proteins that do not have orthologs in the other species (black), how many of them correspond to gaps with more than 20 a.a. in the other species (red), to short gaps with fewer than 20 a.a. (green), to substituted amino acids (blue) and to conserved amino acids that change their nature from ordered to disordered (pink). |
The above analysis of disorder gain was complemented by the analysis of the disorder flux from species a to species b, defined as the number of changes from residues that are ordered in species a and disordered in species b (gain) minus the number of changes from residues that are ordered in species b and disordered in species a (loss) for each kind of mechanism and each pair of species. These pairwise comparisons are presented in Fig. S6 (ESI‡). As expected from the fact that the proteins of organisms other than human are collected gathering orthologs of human proteins, we found that the disorder flux due to the New proteins mechanism always goes towards the more complex species or it is zero. The disorder flux due to large insertions is more interesting. This flux mostly goes towards the more complex species, but sometimes it goes towards the less complex species, notably Drosophila proteins gain disorder due to large insertions compared with vertebrate proteins. Substitutions contribute very little to the disorder flux: Typically, the net gain of disordered residues per protein in the human-worm and fly-worm comparison is 70 disordered residues through large insertions and only 5 residues through substitutions. This indicates that large indels and new proteins are quantitatively much more important than substitutions as a mechanism for the evolution of disorder. In contrast, the net gain of coiled-coil residues due to substitutions is large and positive in the comparisons from invertebrates to vertebrates, and it is much stronger for centrosomal than for control proteins, see Fig. S7 (ESI‡).
Nevertheless, pairwise comparisons do not give a very clear picture since they are not independent: for n = 7 species there are n(n − 1)/2 = 21 pairs of species, whereas the phylogenetic tree only contains 2n − 3 = 11 independent branches. We then tried to reconstruct the history of insertions and deletions leading to the current distribution of indels in multiple protein alignments.
![]() | ||
Fig. 6 Divergence time estimated from the fossil record (abscissa) and from multiple sequence alignments (ordinate) for all model species pairs. Black symbols refer to centrosomal proteins and red symbols refer to control proteins. The small panels represent control proteins divergences versus divergence times estimated in two competing hypothesis: Coelomata (the divergence between C. elegans and the clade constituted by D. melanogaster plus vertebrates happened ≈800 My ago, and the divergence between D. melanogaster and vertebrates happened ≈700 My ago) and Ecdysozoa (the divergence between vertebrates and the clade constituted by D. melanogaster plus C. elegans happened ≈800 My ago, and the divergence between D. melanogaster and C. elegans happened ≈700 My ago). One can see that data are consistent with the Coelomata hypothesis, on which the figure is based. |
We note that our data strongly support the Coelomata hypothesis (grouping Arthropods and Vertebrates) with respect to the Ecdysozoa hypothesis (grouping Arthropods with Nematodes). In fact, the divergence between C. elegans and D. melanogaster, 1.029 ± 0.015, coincides within the statistical error with the divergences between C. elegans and the vertebrates (1.014 ± 0.018, 1.038 ± 0.016, 1.054 ± 0.016, 1.029 ± 0.017) and it is significantly larger than the divergence between D. melanogaster and the vertebrates (0.936 ± 0.013, 0.951 ± 0.014, 0.923 ± 0.016, 0.921 ± 0.016), see inset in Fig. 6. These data were obtained with control proteins, but the same qualitative results hold for centrosomal proteins. Note that these divergence estimates do not require any choice of an out-group and therefore they do not suffer from the artifact of long branch attraction that according to Philippe et al. produced the impression of the Coelomata clade.37 Moreover, they are consistent with, but not dependent on the divergence time estimated by Feng et al.43 using a different set of proteins. However, since the Coelomata versusEcdysozoa hypothesis is heavily debated, we tested that our results still hold without relying on it for our evolutionary reconstruction.
![]() | ||
Fig. 7 Example of the reconstruction of indels histories in the multiple alignment of extra spindle pole bodies homolog 1 (ESPL1) proteins. Gaps larger than 20 residues are clustered together (boxes). Insertions in the same gap region are separated if they do not satisfy a cut-off in sequence identity. Thus the first gap region contains two insertion clusters,one for vertebrates and the other for S. cerevisiae. The origin of these insertions are attributed by parsimony to the branch leading to vertebrates (cross in the figure) and the branch leading to S. cerevisiae. The disorder/order state of each site is represented by colour code (red = disordered). |
![]() | ||
Fig. 8 Disorder fraction of new proteins (left) and large insertions (right) appearing t million years ago. The leftmost points represents proteins and insertions present in the out-groupS. cerevisiae. Solid line: centrosome. Dashed line: control. The histograms represent the number of proteins and insertions per million years appearing along the branch of the phylogenetic tree that goes to the named species. Black bars: centrosome. Dotted bars: control. |
Using this parsimonious reconstruction of insertion and deletion events, we then calculated the flux of disordered residues (number of disordered residues created by an insertion minus those eliminated by a deletion) along all branches of the phylogenetic trees. Since branches do not have the same length, we transformed these fluxes into rates dividing them by the branch lengths in million years estimated in ref. 43, which are in pretty good agreement with the Poisson distance computed from multiple alignments of centrosomal and control proteins (see Fig. 6) except for the branch that goes to yeast, for which we did not compute any flux, since we used yeast just as an out-group. The resulting rates are presented in Fig. 9. Each point represents a branch in the phylogenetic tree, labeled by the time of divergence in million years, so that the branches corresponding to 750 million years refer to the divergence between the fly and the vertebrates. For each pair of branches arising from the same node, we distinguish between the high complexity growth branch (HCG) where a larger growth in number of cell types took place and the low complexity growth (LCG) branch. For instance, of the two branches arising from the fly-vertebrate node, the one leading to the vertebrates is the HCG branch and the one leading to the fly is the LCG branch. Fig. 9 is based on the Coelomata hypothesis. To verify that our results do not depend on this hypothesis, we repeated our calculations eliminating either D. melanogaster or C. elegans, obtaining plots that almost look the same as if we eliminate from Fig. 9 the points at 750 and at 815 million years, respectively. These figures are presented in Fig. S8 (ESI‡).
![]() | ||
Fig. 9 Rate of flux of disordered residues due to indels (top) and rate of flux of residues both coiled-coil and disordered due to substitutions (bottom) along branches of the phylogenetic tree. The abscissa shows the divergence time t in million years, for instance t = 730 My represents the splitting between vertebrates and fly. The tree is based on the Coelomata hypothesis. Not using this hypothesis gives similar results, reported in ESI.‡ For each node, we distinguish the HCG branch with larger increase in cell types (black) and the LCG branch with smaller increase in cell types (red). Centrosomal proteins are represented as solid line, control proteins as dashed lines. Disorder flux is normalized by the number of aligned proteins at each internal node. The panel below the tree represents the flux due to substitutions of residues that are both coiled-coil and disordered. |
The HCG branch going from yeast to human is dissected into independent partial branches connecting bifurcation events, such as for instance the branch between the common ancestor of arthropods and vertebrates and the common ancestor of vertebrates, using 14 species (see Methods). All these species have been used to reconstruct insertion and deletion events, but we combined branches obtaining 4 HCG branches with sufficient length to give good statistics. Fig. 9 represents the rate per unit time of gain minus loss of disordered residues due to indels along HCG branches (black) and LCG branches (red), for centrosomal (circles) and control proteins (crosses). One can see that indels tend to increase the disorder along all branches both for control and for centrosomal proteins, except along the shortest LCG branch, for which the disorder flux is almost zero. The rate is always larger for centrosomal than for control proteins, except for the most ancient splitting between Nematodes and other animals where control proteins have a not-significantly larger rate both in the HCG and in the LCG branch. An asterisk in the plot means that the difference between centrosomal and control proteins is significant.
Strikingly, the disorder rate of centrosomal proteins is always larger along HCG branches than for the corresponding LCG branches, and the difference is significant for the two more recent splittings. This also holds for control proteins, but the difference between corresponding HCG and LCG branches tends to be larger for centrosomal than for control proteins, and the comparison is reversed at the splitting of Arthropods, where the LCG rate reaches its maximum. This is consistent with the pairwise comparisons, which show that D. melanogaster proteome is enriched of disordered residues both for centrosomal and for control proteins. Moreover, the difference between centrosome and control along a given branch is always larger along HCG branches than along LCG branches, and it is significant in 3 out of 4 cases for HCG branches and in 1 out of 4 cases for LCG branches. These results hold for the Coelomata hypothesis. In order not to rely on this hypothesis, we eliminated either C. elegans or D. melanogaster from the tree, finding the same qualitative results (see Fig. S8, ESI‡).
The evolution of coiled-coil through indels is qualitatively the same as the evolution of disorder both for centrosomal and for control proteins, and it is presented in Fig. S9 (ESI‡). For coiled-coils, the resulting picture is even more clear, since they do not present any exception: the coiled-coil rate is always larger for centrosomal than for control proteins, and it is always larger for the HCG branch than for the corresponding LCG branch. The evolutionary rate due to insertion is faster for disordered residues (the maximum rate is 1.1 residues per protein per million year) than for coiled-coil (the maximum rate is 0.27 residues per protein per million year) for centrosomal proteins, and it is much smaller for control proteins.
We finally examined the evolution of disordered and coiled-coil regions through substitutions, still adopting a parsimonious reconstruction of evolutionary events. The flux of disordered residues due to substitutions (gain minus loss, normalized by time) is zero within the error in most of the examined cases, and when it is significant it is much smaller than the flux due to indels, being positive five times and negative only for centrosomal proteins, along the two most ancient HCG branches, see Fig. S9 (ESI‡). The picture is more interesting for the flux of residues that are both disordered and coiled-coil (i.e. most of the coiled-coils residues). This is presented in the bottom panel in Fig. 9. The flux of coiled-coils due to substitutions is smaller by approximately a factor ten than the one due to large indels, but it is larger by a factor 3 than the corresponding flux of disordered residues. The coiled-coil flux due to substitutions is never negative, and it is significantly larger for centrosomal than for control proteins in two cases (the HCG branch at the splitting between fishes and terrestrial vertebrates, where the rate is maximum, and the LCG branch at the splitting between mammals and birds), whereas the opposite happens, but at a much smaller scale, for the LCG splitting of Nematodes. For centrosomal proteins, the rate is larger along HCG than along LCG branches in all cases except along the last HCG branch leading to mammals.
Interestingly, centrosomal proteins also tend to be more phosphorylated than control proteins, and their predicted phosphorylated fraction also tends to be correlated with the number of cell types, although this is in part expected, since disordered regions and phosphorylation sites tend to have similar sequence features and kinases tend to exploit the exposure and the structural malleability of disordered regions.21
The main result of this work is the evolutionary analysis of disorder. We have shown that disordered regions are mainly gained in evolution through new proteins and through large insertions. Since the analysis of new proteins may be biased by the fact that it is more difficult to identify orthologs of disordered proteins, which tend to evolve faster than globular proteins, we focused our evolutionary analysis on insertions, and on the comparison between centrosome and control, which may suffer of the same bias. The disorder content of new proteins and large insertions is correlated with the time at which they appear, so that proteins and insertions that arose more recently contain a larger fraction of disordered residues. This holds true both for centrosomal and for control proteins but the effect is much stronger for centrosomal proteins. Substituted residues contribute very little to the evolution of disordered regions, and their contribution sometimes increases disorder, sometimes decreases it, most often it is neutral. In contrast, we found that the net effect of substitutions almost always tends to increase the size of coiled-coil regions, more strongly in centrosomal than in control proteins. This suggests that positive natural selection is involved in the growth of coiled-coil regions.
We then reconstructed the flux (gain minus loss) of disordered and coiled-coil residues due to long insertions along different branches of the phylogenetic tree. As a side result, we found that the simple evolutionary distance that we computed allows us to reconstruct the tree unambigously, and strongly supports the grouping of Arthropods and Vertebrates (Coelomata hypothesis) with respect to the grouping of Arthropods and Nematodes (Ecdysozoa hypothesis). Strikingly, we observed that disordered and coiled-coil regions evolved through insertion and deletion events at much faster rate along branches leading to a large growth in the number of cell types (HCG branches) than along branches leading to a small growth in cell type number (LGC branches). When it is significant, this difference is much larger for centrosomal than for control proteins, which means that, whatever the evolutionary force (mutation or selection) producing the bias between insertions and deletions and between HCG and LCG branches, this force acts more strongly on centrosomal than on control proteins.
Thus, the acceleration in the evolution of disorder and coiled-coil content along HGC branches is stronger for the centrosome, and it establishes a novel relationship between the molecular complexity of a proteome and the cellular complexity of the corresponding organism. Although complexity is a controversial concept with many possible definitions, complexity measured as the number of cell types is likely to be relevant for the evolution of the centrosome. The centrosome controls cell cycle and cell division, and it shows a remarkable plasticity in space and time. Unfortunately, order-of-magnitude estimates of cell-types numbers is only available for a few model organisms, which limited our analysis. From the molecular side, the relation between protein disorder and complexity naturally arises from the fact that disordered proteins can have a larger number of possible conformations and interaction partners, thus under this point of view they are more complex than globular proteins.
Interestingly, we found that indels tend to increase the disorder and coiled-coil content along all branches of the tree that we examined, both for centrosomal and for control proteins. This is consistent with the known fact that disordered proteins tend to evolve by repeat expansion,55 and it can be attributed either to a mutational pattern or to positive selection. However, the fact that the rate is significantly accelerated for centrosomal proteins along the branches of high complexity growth with respect to control proteins suggests that positive selection is responsible for this acceleration. A relevant fraction (up to 1/3) of the predicted disordered residues in centrosomal proteins are also predicted to be coiled-coil. This fraction is much smaller in control proteins. Coiled-coil regions grow mainly through indels, but we also observed a significant bias to increase the content of coiled-coil residues through substitutions. When it is significant, this bias is much stronger for centrosomal proteins than for control proteins, which seems to be more consistent with positive selection than with a mutational pattern.
The fraction of predicted phosphorylated residues is much larger in centrosomal than in control proteins, even after taking into account the biased composition of centrosomal proteins that contain many more serine, threonine and tyrosine residues. The correlation between phosphorylation and disorder content is stronger in centrosomal than in control proteins, which suggests that the enhanced phosphorylation of centrosomal proteins cannot simply be explained by the known tendency of phosphorylation to take place in disordered regions. We speculate that phosphorylation and disorder are enhanced in the centrosome by an evolutionary force that favours the regulatory plasticity of centrosomal proteins. Experimental work will be needed to investigate the relationship between the increase of disorder and coiled-coil content and the biophysical properties of the centrosome. However, some hypothesis arise in a natural way. Disordered regions are frequently involved in molecular interactions, probably because they provide high specificity but low affinity interactions as those necessary for dynamically controlled processes.16 Thus, the plasticity conferred to the centrosomal proteome by disordered regions together with phosphorylation may be necessary for avoiding nonspecific interactions and may allow them to cope with the stringent requirements imposed by the very large number of interactions in the centrosome and their strict regulation in space and time.
Concerning the abundance of coiled-coil regions in centrosomal proteins, we would like to suggest two hypothesis. It is possible that the prevalence of coiled-coil regions is due to a principle of evolutionary economy, since coiled-coil are made of low complexity sequences24 and combining coiled-coil interaction modules might be the simplest way to create a large super-molecular assembly,56 which seems to be used even to assemble bacterial flagella34 and secretion systems.35 An alternative explanation involves natural selection. Coiled-coil residues seem to be favoured by natural selection, as suggested by the bias of substitutions to increase coiled-coil content, which is stronger in centrosomal than in control proteins. If this is the case, a possible explanation may lie in the mechanical properties of disordered coiled-coil residues that, upon folding, can change their shape from a flexible polymer with size scaling similar to a self-avoiding walk (L0.6, where L is chain length) to a much longer stiff, rod-like molecule. This can have important consequences on the mechanical behavior of the centrosome. More precisely, we speculate that the prevalence of disordered residues in the centrosome might be due to their peculiar mechanical properties as entropic springs.57 It has been recently found that charge interactions modulate the size of disordered proteins,60 and that this modulation can be controlled through phosphorylation.58 Interestingly, it has also been recently observed that the size of the centrosome varies with the pH.59 These observations suggest that modulation of charge interactions in disordered centrosomal proteins through phosphorylation can have a physiological role in controlling the size and the mechanical properties of the centrosome as a whole, a possibility that is worth experimental evaluation.
We collected genes for 13 species with complete sequenced genome, chosen in such a way to divide the evolutionary distance between yeast and human in 13 independent branches. They are, in order of relatedness with respect to Human: Homo sapiens, Pan troglodytes (chimp), Macaca mulatta (macaque), Tarsius syrichta (primate), Rattus norvegicus (rat), Monodelphis domestica (opossum), Ornithorhynchus anatinus (platypus), Gallus gallus (chicken), Xenopus tropicalis (frog), Danio rerio (zebrafish), Ciona intestinalis (urochordate), Drosophila melanogaster (fruitfly), Caenorhabditis elegans (nematode worm), Saccharomyces cerevisiae (yeast). Out of these species, 7 model species for which the number of cell types is approximately known were chosen for mode detailed analysis, namely: G. gallus (370 proteins), X. tropicalis (370 proteins), D. rerio (392 proteins), D. melanogaster (241 proteins), C. elegans (206 proteins) and S. cerevisiae (104 proteins). The control set was constructed in the same way starting from 465 randomly drawn human genes, resulting in 297 proteins for G. gallus, 288 for X. tropicalis, 312 for D. rerio, 212 for D. melanogaster, 181 for C. elegans and 85 for S. cerevisiae.
Coiled-coil structures have been predicted using the implementation by Rob Russell of the algorithm ncoil described by Lupas et al.,48 and the more recent Pcoils algorithm.49 Both algorithms yield the same fraction of coiled-coil residues and the same correlations between coiled-coil and disorder within the statistical error. Results presented in the paper are obtained with the ncoil algorithm.
Phosphorylation was predicted using NetPhos (portable version 3.152) and GPS 2.1.51 The significance of NetPhos predictions is given by a single score, being the default threshold (0.5) selected. GPS provides a specific threshold for each kind of kinase family and it allows to select different levels of stringency. The most stringent level has been selected in this case. Both predictors provide a similar number of significant phosphorylation sites although GPS predicts systematically a larger number of different kinases per site. In order to get a more reliable prediction, we have looked for all residues with a significant phosphorylation prediction for both algorithms. Since only serine, threonine and tyrosine residues can be phosphorylated and the fraction of such residues is different in the centrosomal and in the control set, we computed the fraction of predicted phosphorylated residues with respect to the total number of S, T or Y residues.
We estimated the statistical error as twice the standard deviation of the mean, , where p is the observed frequency of disordered (coiled-coil or phosphorylated) residues and n is the number of independent samples, estimated as n = L/30.
Footnotes |
† Published as part of a Molecular BioSystems themed issue on Intrinsically Disordered Proteins: Guest Editor M. Madan Babu. |
‡ Electronic supplementary information (ESI) available. See DOI: 10.1039/c1mb05199g |
This journal is © The Royal Society of Chemistry 2012 |