Hedi
Hegyi
*a and
Peter
Tompa
ab
aInstitute of Enzymology, Biological Research Center, Hungarian Academy of Sciences, Budapest, Hungary. E-mail: hegyi@enzim.hu
bVIB Department of Structural Biology, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, Belgium
First published on 22nd November 2011
Intrinsic protein disorder has been studied with respect to the chromosomal location of each protein in the human proteome and also in other fully sequenced organisms. We found that in all studied mammalian species the sex chromosome-coded proteins were significantly more disordered than the autosome-coded ones, the strongest discrepancy being observed in humans. In explaining this phenomenon we analyzed local chromosomal features and found that (1) the autosomes have a stronger correlation between the GC content of the transcripts and the structural disorder of the coded proteins than the sex chromosomes; (2) the neighbors' protein disorder correlates the strongest on the sex chromosomes; (3) the GO functions on chromosome X are somewhat biased towards functions with higher disorder but do not account for the entire phenomenon; (4) the protein–protein interactions show a non-random chromosomal distribution, the Y chromosome-coded proteins having the lowest overall frequency for interactions but the largest bias towards intra-chromosomal interactions. Tissue-specific distributions showed the most protein disorder for sex-chromosome coded proteins expressed in the testis and the ovary. We raise the possibility that the high disorder of X- and Y-encoded proteins facilitates the fast evolution of testis- and cancer-specific antigenic protein clusters on these chromosomes, in relation to their immunogenic properties and likely contribution to speciation.
However, protein disorder has so far been studied only from a proteomic perspective, without regard to the chromosomal location of the genes coding for these disordered proteins. This paper is the first to our knowledge that attempts to look at this phenomenon from a genomic/chromosomal perspective. We found that human proteins coded on the X, Y sex chromosomes have a significantly higher average disorder than the ones on the 22 autosomes. To see if there is any systematic difference between the two types of chromosomes that could account for this difference, we studied several aspects of the chromosomes and the local environment such as the GC content, average gene density and the disorder of neighboring genes on all chromosomes. In accordance with others,5 we found a correlation between the GC content of the transcripts and intrinsic disorder of the corresponding proteins, including the sex chromosomes.
We also analyzed the relationship between gene density and protein disorder. Intriguingly, we found that except for the two most gene-packed chromosomes, 1 and 19, there is little correlation between gene density and protein disorder. However, we found that for most chromosomes both the GC content and the disorder of neighboring genes correlate, and for the latter we could actually see different tendencies for the sex chromosomes setting them apart from the autosomes.
We also studied the functional aspects of the sex chromosomes and the autosomes to see if these can account for the differences in disorder between the two types of chromosomes. We used the GO classification system and confirmed the high occurrence of transcription-related genes on the X chromosome. We also analyzed the number of interacting protein partners for the sex-chromosome- and autosome-coded proteins to see if the higher disorder of X and Y-coded proteins also entails a greater number of interacting partners. We found that the opposite is actually true for the sex chromosomes, and in general the number of interacting proteins correlate with the size of the chromosome, or more precisely, with the number of genes coded on a chromosome. This is particularly true of the Y chromosome, the genes of which tend to interact with one another much more often than it would be expected based on a random chromosomal distribution of interacting partners.
We also looked at the tissue specific expression of the sex chromosome-coded genes and compared them to the autosome-coded genes to see if there is a significant bias in the tissue types each gene is expressed in (e.g. as expected, if the sex chromosome-coded genes are more often expressed in sex organs and related tissues) and if these biases can account for the higher disorder of the proteins coded on the sex chromosomes. We did find significant differences between the different tissues and also between the sex chromosomes and the autosomes. We did not find a universally high disorder in all reproductive organs; it was highest in testis and ovary.
![]() | ||
Fig. 1 Median and mean %disorder (indicated as %IU) of proteins on the 24 human chromosomes, as determined by IUPred.32 |
To that effect, at first we calculated the GC content and the percentage disorder for each gene and each chromosome (using cDNAs and proteins of ENSEMBL)11 as it has been shown previously, albeit for bacterial genomes, that the two features correlate.5 Next, we calculated the r Pearson correlation coefficient for each chromosome separately, and also globally, for the whole human genome/proteome. (Table S1 (ESI‡) shows the average GC content of the transcripts of human genes in the Ensembl database and the average %disorder of the coded proteins for each chromosome.) Fig. 2 shows various correlation values between the GC content of the transcripts and the percentage disorder (%IU) value of the corresponding proteins and also the correlation of both between neighbors for each of the 24 human chromosomes. For most chromosomes the correlation between %GC content and %IU values is small but statistically significant positive value, with the exception of chromosomes 21 and Y where no statistical significance could be shown either between the two actual values or between the moving averages (Fig. 2A). The highest correlation was observed for chromosome 19 (Pearson correlation = 0.348), which also has the 2nd highest GC content (GC content = 0.570) of all the chromosomes (for p-values of the correlations in Fig. 2A and B see Table S1, ESI‡). After calculating the moving averages of 10 for both the %IU values and the GC content we found that the correlations between these two values further increase for most chromosomes except for the sex chromosomes and chromosome 22 where the correlations between the moving averages are smaller values than for the original numbers.
![]() | ||
Fig. 2 Various correlations. (A) Pearson correlation values between %disorder (indicated as %IU) of proteins and GC content of their transcripts and the moving averages of 10 consecutive values for the 24 human chromosomes. (B) Pearson correlations of neighboring proteins and Pearson correlations of GC content of neighboring transcripts for the human chromosomes. |
We also calculated the r Pearson correlation values between the percentage disorder (%IU) of neighboring genes for each chromosome (Fig. 2B) and found the values varying between 0.01 (for chromosome 22, the smallest autosome) and 0.368 (for the Y chromosome), with X chromosome having the 2nd largest correlation value (0.279). The GC content correlation of neighboring genes is also shown for each chromosome in Fig. 2B. It is clear that while the neighbors' %IU values correlate the most on the sex chromosomes, they do not have particularly high values for the GC content correlation. The latter is again highest on chromosome 19, which has the highest gene density as discussed above.
The results are shown in Fig. 3. Only the 30 most abundant GO categories are shown. Those categories that were the highest for chromosome X (when compared to the four autosomes) are boxed whereas those categories that had the smallest values for chromosome X of the five studied here are underlined. The GO categories that were most enriched on the X chromosome among the 30 most abundant categories were: nucleus, intracellular, regulation of transcription, nucleic acid binding, metabolic process and transport. On the other hand, ATP binding, DNA binding, extracellular region and binding occurred the least frequently on chromosome X of the five chromosomes (Fig. 3A). The median disorder of human proteins on the selected autosomes (5, 9, 10 and 16) and the X and combined X and Y sex chromosomes associated with each of the 30 categories are shown in Fig. 3B. The total median disorder of all proteins in the 30 categories for the selected and all autosomes were 13.08% and 12.57%, respectively, whereas for chromosomes X, Y they were 21.58% and 24.40%, respectively. To see if the higher disorder of proteins on the sex chromosomes is due to the enrichment of transcription-related proteins on X and Y, we grouped the proteins into these two categories (transcription-related or not) and calculated the median disorder for each group, shown in Table 2. The sex chromosomes-coded proteins are always more disordered, whether we take into consideration only the 30 most abundant GO categories or all of them (data not shown) and both in the transcription-related and -unrelated groups.
![]() | ||
Fig. 3 The 30 most abundant Gene Ontology (GO) categories in proteins on the X chromosome and chromosomes 5, 9, 10 and 16, with the closest number of genes to that of X. (A) Categories with highest occurrence (out of these 5 chromosomes) on X are boxed, those with the smallest numbers are underlined. (B) The combined average %disorder values with proteins in those categories on the selected autosomes (open circle, chromosomes 5, 9, 10 and 16), on chromosome X (open triangle), combined values for X and Y (filled triangle). |
We also calculated the median disorder for all proteins with a GO annotation, also shown in Table 2. The median disorder of all autosome-coded proteins was 11.7% whereas for X and Y the combined value was 21.1%. When we took into account only those proteins with no GO function associated yet (altogether 2890 proteins), the median disorder for all such autosome-coded proteins increased to 25.2% whereas for the X and Y chromosomes this value was 49.5%. These values are significantly higher both for the autosomes and the sex chromosomes than for proteins with existing GO categories.
![]() | ||
Fig. 4 Average number of protein–protein interactions (PPI) per protein for the 24 human chromosomes. Only proteins with at least one recorded PPI in STRING were taken into consideration. |
![]() | ||
Fig. 5 Tissue-specific human proteins expressed in various reproductive and other tissues, on the sex chromosomes and the autosomes. Standard deviations are shown. |
As expected, the most disordered sex chromosome-coded proteins were in reproductive tissues, more so in male than female reproductive organs. It also became clear that sex chromosome-coded proteins expressed in tumor tissues were much more disordered than the autosome-coded ones. Additionally, we found that not all reproductive tissues are equal in this respect: there are significantly more disordered X-chromosome coded proteins expressed in the ovary than in other female reproductive organs. Similarly, significantly more disordered proteins were expressed in the testis coded on X or Y than in other male reproductive tissues (49% vs. 24%, respectively, data not shown). Remarkably, the autosome-coded tissue differences were similarly biased toward reproductive and tumor tissues (albeit to a lesser extent) and were also more disordered in the brain. As we remarked in a recent publication, this latter observation is most certainly due to the higher complexity of brain tissues4 also associated with more protein binding capacity based on binding site predictions in disordered proteins.
Species | Autosomes | Sex chroms |
---|---|---|
Chicken | 9.8 | 8.1 |
Horse | 10.7 | 15.7 |
Chimpanzee | 13.0 | 18.8 |
Platypus | 9.9 | 12.0 |
Mouse | 10.2 | 21.4 |
Human | 13.5 | 25.7 |
Fruit fly | 14.2 | 19.9 |
![]() | ||
Fig. 6 Median disorder for each chromosome in the human (A) and mouse (B) proteome. Shared and unique proteins are also indicated. |
In Fig. S2 (ESI‡) all the autosome- and sex chromosome-coded, shared and unique proteins are lumped together both in human and mouse. The disorder of the shared proteins on either the autosomes or sex chromosomes does not differ significantly between the human and mouse proteins but the unique human proteins are significantly more disordered than the unique mouse ones both on the autosomes (24.1% vs. 17.5%) and the sex chromosomes (43.7% vs. 32.7%, p-value < 0.0001). In addition, the unique mouse proteins on the autosomes are significantly less disordered than the shared ones (17.5% vs. 22.3%, respectively).
![]() | ||
Fig. 7 Syntenic regions (color-coded) and species-specific proteins (grey) on the human and mouse X chromosomes. The disorder of each protein is marked with a dot under the colored schematics of each X chromosome. The continuous red lines indicate the moving averages of 10 consecutive proteins in each. The five most disordered clusters correspond to the five peaks in each moving average graph. The clusters are delineated with the representative gene names. |
The CT47 cancer/testis antigen cluster is another highly disordered and recently expanded family with at least 13 copies on Xq24.20 It is more widely distributed or more conserved than the aforementioned 3 clusters as it is present not only in primates but also in the cow, pig and rabbit, according to Ensembl (uniprot id: A6H6Y8_BOVIN). The SPANX (sperm protein associated with the nucleus on the X chromosome) proteins are another recently expanded cluster of human proteins expressed in sperm also found in other primates. Similarly to other male reproductive proteins, they are known to evolve very fast.
Upon collecting the GO functions for the X chromosome and the four most closely related chromosomes in terms of known and annotated genes, we saw some functional bias on chromosome X, such as the more frequent occurrence of transcription-related and nucleus-located genes and less extracellular ones, which would both point in the direction of more protein disorder. However, these functional biases do not fully explain the disparity in disorder and the most disordered sex chromosome-coded proteins do not have functional characterization assigned by GO (as shown in Table 2) so this kind of analysis can provide only a limited picture.
30 Transcription | 30 Non-transcription | All transcription | All non-transcr | All GO categories | No GO categories | |
---|---|---|---|---|---|---|
Autosomes | 34.6 (1331) | 9.6 (4949) | 32.5 (1617) | 12.1 (9649) | 11.7 (17221) | 25.2 (2713) |
Sex chroms | 48.3 (110) | 14.9 (401) | 45.3 (122) | 20.7 (741) | 21.1 (743) | 49.5 (177) |
Regarding protein–protein interactions, interestingly but not completely unexpectedly, we found that sex-chromosome coded proteins have fewer interactions on average than the autosome-coded ones (Fig. 4). This was especially true of chromosome Y, which has about 2.4 times less interaction for its proteins than an average autosome (63 vs. 150). This finding also largely eliminates the possibility that sex-chromosome coded proteins might be more disordered because they have more interacting partners.
Comparing the human and mouse proteomes with respect to their chromosomal distribution we reached the somewhat surprising conclusion that the mouse-specific proteins on the autosomes have less protein disorder on average than the ones shared with humans (17.4% and 22.3%, respectively). In comparison, the same values for the human-specific and shared autosome-coded proteins were 24.1% and 22.5%, respectively. This disparity was reflected in the values for the individual chromosomes too, as 10 out of the 22 human chromosomes had more disordered unique proteins but only chromosome 18 in the mouse had this feature (Fig. 6B). This might reflect the different directions the evolution of the two organisms took since the speciation between the two roughly 75 million years ago.21 However, the sex chromosomes in the mouse also have significantly more disordered proteins compared to the autosomes, also appearing in clusters (Fig. 7). All mammalian species studied here showed the same kind of bias, including the Platypus, the most ancient mammal. However the chicken, the only bird so far with a fully sequenced genome did not show a similar bias between the W,Z sex chromosomes and the autosomes. On the other hand, Drosophila again had a positive disorder bias on the X chromosome. It will be interesting to see how this feature plays out in other organisms with sex chromosomes.
While this is the first time an increase in structural disorder of proteins on the sex chromosomes has been detected, it has been known that reproduction-related proteins are evolving at a much faster rate than other proteins.22 This phenomenon has been also observed in other species, e.g. comparing sperm proteins in closely related abalone species showed that exons evolve 20 times faster than introns.23 Thus, the intriguing possibility that the two phenomena, i.e. the fast evolution and high disorder of reproductive proteins might be related could be raised.
Several pieces of evidence point in this direction:
1. The most disordered clusters of proteins on the human X chromosome tend to be unique, species-specific or confined to a few related primate species.
2. Disordered proteins are capable of evolving at a much faster rate than globular ones as there are no structural limitations trying to preserve an existing fold. This has been observed before in several instances.24,25
3. The most disordered proteins tend to occur in clusters of similar proteins in relative proximity on the X chromosomes, pointing toward recent evolutionary events: the members of these clusters are highly similar, therefore the probable results of recent gene duplications. In fact, it is known of several members of these clusters being the result of recent segmental duplications.26
It has been the subject of intense debate why the reproduction-related proteins tend to evolve at such an exceptionally high rate. There have been various attempts to explain it such as a competition between the individual sperms or as a way of speciation. However, speciation cannot be the driver of evolution in established, stable species—it can be only the consequence of selection processes driven by other forces—or else the existence of species would be a rather ephemeral phenomenon only.
We tend to think that the speed of the evolution comparable to that of the immune system might actually offer a more plausible explanation. As the sperm proteins are foreign to the female immune system, there is an elaborate mechanism taking place over the course of fertilization to make sure sperm are not eliminated by it. First, to evade histocompatibility-based responses mediated by MHC class I-restricted cytotoxic T lymphocytes, mature human sperm cells become MHC class I negative.27 On the other hand, natural killer (NK) cells target and lyse those cells that lack such markers during the leukocyte reaction.28 To avoid the NK cell-mediated cytotoxicity human sperm express on their surface at least 3 classes of N-glycans terminated with Lewisx and Lewisy sequences, the latter localized to the sperm acrosome.29 It was also shown that these N-glycans bind with high affinity DC-SIGN, a pathogen-recognition receptor.30
The above short description shows that sperm proteins and oligosaccharides interact in several direct and indirect ways with the female immune system to ensure a successful fertilization process therefore it is rather plausible that the two systems also evolve in a concerted manner. While this study might not provide a full answer to the intriguing question of fast evolution of reproductive proteins, at least it offers an insight from the side of facilitation: they evolve fast because they can. Being disordered means there is little structural constraint while having relatively few interactions with other proteins (as we have seen for the X- and especially the Y-coded proteins) also means relatively few functional constraints on the fast evolution of these proteins.
Footnotes |
† Published as part of a Molecular BioSystems themed issue on Intrinsically Disordered Proteins: Guest editor M. Madan Babu. |
‡ Electronic supplementary information (ESI) available. See DOI: 10.1039/c1mb05285c |
This journal is © The Royal Society of Chemistry 2012 |