Shelly DeFortea and
Vladimir N. Uversky*abcde
aDepartment of Molecular Medicine, Morsani College of Medicine, University of South Florida, Tampa, FL 33612, USA. E-mail: vuversky@health.usf.edu; Tel: +1-813-974-5816
bUSF Health Byrd Alzheimer's Research Institute, Morsani College of Medicine, University of South Florida, Tampa, FL 33612, USA
cDepartment of Biological Science, Faculty of Science, King Abdulaziz University, Jeddah, PO Box 80203, Jeddah 21589, Saudi Arabia
dInstitute for Biological Instrumentation, Russian Academy of Sciences, 142290 Pushchino, Moscow Region, Russia
eLaboratory of Structural Dynamics, Stability and Folding of Proteins, Institute of Cytology, Russian Academy of Sciences, St. Petersburg, Russia
First published on 22nd January 2016
Intrinsically disordered proteins (IDPs) have a troubled history in the literature. Historically, a wide variety of terminology has been used to describe these strange proteins that do not adopt a stable three-dimensional structure. We provide here a survey of the current status of both IDPs and IDP terminology in PubMed. We have performed an extensive search of the literature from 1978 through 2014 and compiled a list of 1127 proteins and protein domains and the corresponding citations that refer to these proteins using IDP terminology. We show that papers that use IDP terminology are only the tip of the iceberg in terms of the larger body of literature referring to this group of proteins. Furthermore, our analysis suggests that this is likely due to a lack of perceived relevance rather than a lack of awareness. Finally, we have analyzed the language provided by author keywords, MeSH terms, and abstracts as well as the journals that are currently publishing IDP articles. Our results demonstrate a convergence on a common set of terminology and a rise in the number of papers using this terminology. However, our results also demonstrate that we have not reached the point where IDP terminology is fully accepted and embraced in the literature.
The situation changed at the turn of century, when it was recognized that such biologically active proteins without unique structures are not merely a set of rare exceptions, but instead represent a new and very broad class of proteins.21–24 Since the publication of the first key studies and reviews describing this new concept, the literature on these proteins has virtually exploded (see Fig. 1). The articles returned by intrinsically disordered protein (IDP) search terms each year are increasing at a rate greater than PubMed as a whole. The cumulative distribution of PubMed articles closely follows a parabolic growth curve, while the growth of articles returned by IDP search terms appears to be following a more exponential curve. This creates the illusion that protein intrinsic disorder has become a well-accepted phenomenon. The goal of our study was to validate this hypothesis and to answer an important question: “Are we there yet?”.
The study of IDPs and intrinsically disordered protein regions (IDPRs) is, in many ways, the flip side of structural biology, with applications as far-reaching and ubiquitous. Initially, there was no consensus as to what these oddly behaving proteins should be called. This led to a number of different terms being used to describe the same phenomenon, such as natively denatured, intrinsically unstructured, natively unfolded, inherently flexible, and many others.25 The term intrinsically disordered, however, has emerged as the predominant and agreed-upon term. Therefore, we shall use the terms IDP or IDPR to refer to these proteins, the phrase IDP terminology to refer to the set of terms used to refer to IDPs, and the terms IDP paper or IDP literature to refer to scientific papers that use IDP terminology.
The validation of protein intrinsic disorder in vitro or in vivo is challenging, and typically a consensus regarding the presence and nature of intrinsic disorder in a given protein will be developed over many studies. Therefore, it is not surprising that there is a great deal of inconsistency in terms of how and when the language of intrinsic disorder is used in the literature. This has contributed to a bottleneck in the curation of IDPs. While the fields of structural and un-structural biology are analogous in some ways, there are also key differences. A three-dimensional structure of a protein will typically be deposited in the Protein Data Bank26 before papers related to that structure are published. There is no such process for IDPs. The current databases for experimentally verified IDPs, namely DisProt27,28 and IDEAL,28 require the considerable efforts of IDP-focused researchers and can quickly lag behind the expanding literature. Furthermore, IDP-focused proteomics studies, evolutionary studies, disease-related studies, and functional studies require a synthesis of information over many different proteins and many different experiments. We have surpassed the point where it is possible to read all published papers for known, suspected, or recently discovered IDPs, and therefore researchers who specialize in IDPs are in many ways dependent on the presence of appropriate search terms to help them find what they are looking for.
There are many aspects of intrinsic disorder that are of interest to IDP-focused researchers, such as biophysical mechanisms, structural properties, disease-related properties, and structural and functional modifications under mutation, to name just a few. Therefore, the issue of clear language indicating IDPs and IDPRs in the literature is pressing and immediate.
PubMed is a citation aggregator catering primarily to the biomedical field. One advantage of PubMed is that it includes biomedical literature from MEDLINE, which is the U.S. National Library of Medicine's bibliographic database. A key feature of MEDLINE is that records are indexed with Medical Search Headings (MeSH) that connect related terms in a hierarchical structure, allowing for more targeted searching, even when precise keywords are not used. The PubMed search engine has become increasingly sophisticated with the ongoing expansion of MeSH terms and the official addition of author keywords in 2013.29
The field of IDPs is slightly behind the curve, however, as the term “intrinsically disordered proteins” was not added to MEDLINE's MeSH terms until 2014. This represents a potential boon for the organization and connection of IDP literature going forward. However, as PubMed does not retroactively index entries, the bulk of the IDP literature must still be referenced using a variety of keywords that search the abstract and title, which will often result in an incomplete picture.
At this significant juncture, it is our intention to present a survey of the use of IDP terminology (the tip of the iceberg) and present a picture of IDPs in the literature outside of this identifying terminology (what lies below), in hopes that this will encourage the community of researchers working with IDPs and IDPRs to contribute to a better connected body of research going forward.
We used the regression-based, fast disorder predictor RAPID31 to evaluate the percent predicted disorder over the entire Swiss-Prot database. Table 1 provides the numbers of proteins for each organism in each predicted disorder interval. As expected, proteins from bacteria and archaea were predicted to be the most structured, while eukaryotic proteins were predicted to be the least structured. We found that 20% of eukaryotic proteins in Swiss-Prot were predicted to be more than 30% disordered. However, it should be noted that the UniProt consortium places a high priority on the annotation of enzymes,30 which is likely to skew the sequence space into one that is more highly structured, so this number should not be taken as representative for all eukaryotes.
The number of UniProt IDs associated with percent predicted disorder | ||||
---|---|---|---|---|
Disorder | Archaea | Bacteria | Eukaryote | Virus |
0–10% | 9997 (54.0%) | 172![]() |
69![]() |
7517 (47.0%) |
10–20% | 5086 (28.0%) | 97![]() |
45![]() |
4295 (27.0%) |
20–30% | 1673 (9.1%) | 30![]() |
24![]() |
2056 (13.0%) |
30–40% | 824 (4.5%) | 13![]() |
14![]() |
937 (5.8%) |
40–50% | 346 (1.9%) | 7391 (2.2%) | 7759 (4.5%) | 567 (3.5%) |
50–60% | 213 (1.2%) | 3710 (1.1%) | 4011 (2.3%) | 342 (2.1%) |
60–70% | 102 (0.56%) | 1795 (0.54%) | 2129 (1.2%) | 177 (1.1%) |
70–80% | 43 (0.23%) | 1063 (0.32%) | 1303 (0.75%) | 127 (0.79%) |
80–90% | 23 (0.13%) | 649 (0.2%) | 1085 (0.63%) | 40 (0.25%) |
90–100% | 41 (0.22%) | 2598 (0.78%) | 2338 (1.3%) | 72 (0.45%) |
In Fig. 2, the relative number of literature citations is displayed for each predicted disorder interval. For all disorder intervals, the majority (60–64%) of bacteria have a single citation or no citations (25–30%), with little variance. We expect that in most cases, the single citation is a proteome-level study that produced the sequence in question. Eukaryotic proteins, however, tell a different story. The number of proteins that have greater than five citations actually peaks to 20% of the set within the disorder interval between 40 and 50%. The number of citations then falls sharply to a majority with one citation (59–60%) between 80 and 100% predicted disorder.
This demonstrates that a eukaryotic protein with a medium to high amount of predicted disorder is more likely to receive a large number of citations in Swiss-Prot than a completely structured or completely disordered protein. Surprisingly, it is the highly disordered proteins in archaea and viruses that are more likely to receive a higher number of literature citations. Literature citations numbering between two and five jump up from 15% to 23% in the intervals between 70 and 90% predicted disorder in archaea. A similar occurrence happens in viruses, at the interval between 60 and 90%, where the fraction of proteins receiving between two and five citations jumps from 36% to 42%. Raw scores are available in the ESI.†
These were obtained through an extensive manual literature search (see Methods for search terms and criteria, and ESI† for the list of terms and associated PubMed IDs). 630 of our proteins could be linked to a DisProt ID, but 497 could not, demonstrating a significant expansion in the literature since the last DisProt update in 2013. Furthermore, this represents only those proteins that are referred to using IDP terminology, and therefore should not be considered to represent the entire set of IDPs.
Our objective was to compare the total number of papers that used IDP terminology to describe specific proteins to the total body of literature for those proteins. For each search term, we recorded the number of returned articles from PubMed, and calculated the fraction of all papers using IDP terminology during a given year, over the total amount of related literature for that year. For the entire set of IDP search terms, there is slow growth in the usage of IDP terminology to about 0.003 of the total set. However, because our criteria were simply that a protein be referred to as an IDP, there are likely to be some search terms in our set representing proteins that have been incorrectly assigned as IDPs. Therefore, we also created a set of “high confidence IDPs,” which we defined as those that had more than three associated articles using IDP terminology and a greater than 30 RAPID score. These proteins have a somewhat more dramatic rise and peak, at just below 0.008. It should be noted that there is a certain amount of expected noise in the set due to irrelevant results from the search terms, or incorrectly assigned proteins. However, even given this expected noise, this fraction is still surprisingly low. We think it is unlikely that this fraction of papers that use IDP terminology represents the only papers that are relevant to IDP researchers.
(1) The researchers are unaware the protein is intrinsically disordered.
(2) The researchers do not think intrinsic disorder is relevant, or do not believe the protein is intrinsically disordered.
(3) The researchers are using non-standard language to describe the structural properties of the protein.
In order to investigate the general awareness of intrinsic disorder, we examined the number of new authors publishing papers that use IDP terminology. Amongst our IDP papers there were 8425 unique authors. 6548 of those authors appeared on only 1 paper, 1151 appeared on 2 papers, 361 appeared on 3 papers, and the remaining 365 authors appeared on anywhere between 4 and 58 papers. While we expect some noise due to variations in spelling or the presence of identical names, these numbers seem to indicate that a wide variety of researchers are contributing to the IDP literature. Fig. 4 shows the number of new researchers contributing to IDP papers per year. This shows both a growth of the number of authors per IDP paper and also a steady increase in the number of new contributing authors. It is not clear whether these are researchers who will continue to contribute to the IDP literature in the future, however.
We found it useful to conceptually separate authors into one of the following groups:
(1) Authors who primarily study IDPs and contribute to papers on a variety of proteins. We would expect these authors to have a low number of papers for a specific protein, but that a high fraction of those papers would use IDP terminology.
(2) Authors who primarily study one or more specific IDP proteins, but do not focus on the disordered properties of the protein. We would expect these authors to have a higher number of papers in the subject area but a low fraction of IDP papers.
(3) Authors who primarily study one or more specific proteins and also focus on the intrinsically disordered properties of those proteins. We would expect these authors to have a high number of papers in the field and that a high fraction of those papers would use IDP terminology.
Starting with this premise, we looked at two well-known IDPs: tau and alpha-synuclein. The first surprising result is that even with these well-known IDPs, the fraction of papers using IDP terminology is still very low. Alpha-synuclein peaks at a fraction of 0.05 IDP papers, while tau peaks at 0.008. Fig. 5 shows, for each author, the number of papers published in the field versus the fraction of those papers that use IDP terminology for alpha-synuclein (Fig. 5A) and tau (Fig. 5B). Generally, it appears that the authors publishing papers using IDP terminology are publishing few papers in the field (group 1), and those who publish a large quantity of papers are, generally speaking, not using IDP terminology (group 2). Furthermore, a large number of papers using IDP terminology are published by authors who have published a research study on a different protein that is also an IDP, thus increasing the likelihood that it is IDP researchers (group 1) who are using IDP terminology.
The group of researchers who focus on the IDP properties of specific proteins (group 3), who would appear in the center and upper right portion of the graph, is fairly small, at least for alpha-synuclein and tau. While it is possible that the authors in the field are unaware of the IDP properties of these proteins, we feel this is unlikely for well-known IDPs such as alpha-synuclein and tau. Instead, it seems more likely that many authors are not using IDP terminology because they either do not believe the protein is intrinsically disordered, or they do not think the intrinsically disordered nature of the protein is relevant to the study in question.
![]() | ||
Fig. 7 The usage of specific IDP terminology in PubMed abstracts. For each year, the abstracts for the PubMed IDs in the IDP set were searched for each of the terms listed, and the number was counted. These are the top 4 terms over all abstracts out of 20 total search terms (ESI†). |
Surprisingly, 2110 of the 2278 IDP papers in the set did not have keywords available. However, this makes sense in light of the fact that PubMed did not officially add author keywords until 2013. Table 2 shows the occurrences of keywords for those entries with the author keyword field available. There were 67 appearances of either “intrinsically disordered protein” or “intrinsically disordered proteins”. Not surprisingly, method-related terms such as “nuclear magnetic resonance/nmr”, “molecular dynamics”, and “circular dichroism” were common as well. Fig. 8 shows a “Wordle” for the keywords in our set, with common words emphasized through an increase in size.
The top 10 keyword phrases | |
---|---|
Keyword phrase | Appearances |
Intrinsically disordered protein(s) | 67 |
Nuclear magnetic resonance/NMR | 14 |
Alpha-synuclein | 10 |
Protein folding | 9 |
Protein structure | 9 |
IDP | 9 |
Circular dichroism | 8 |
Protein–protein interactions | 7 |
Molecular dynamics | 6 |
Phosphorylation | 6 |
![]() | ||
Fig. 8 A Wordle for the keywords in IDP papers. The size of each word is increased in proportion to its number of occurrences. |
MeSH terms appeared in 2198 of the articles; however, only 66 of the articles had the MeSH term “intrinsically disordered proteins”. This is explained by the fact that the MeSH terms are not back indexed, and the term “intrinsically disordered proteins” was not introduced until 2014. The top 10 MeSH terms can be seen in Table 3, and not surprisingly, emphasize protein structure, binding, and sequence. The full list of keywords and MeSH terms associated with IDP papers can be found in the ESI.†
The top 10 MeSH terms | |
---|---|
MeSH term | Appearances |
Humans | 1010 |
Amino acid sequence | 923 |
Molecular sequence data | 897 |
Protein structure, tertiary | 688 |
Models, molecular | 673 |
Animals | 668 |
Protein binding | 634 |
Protein conformation | 634 |
Protein structure, secondary | 541 |
Binding sites | 441 |
The dearth of author-supplied keywords and IDP-specific MeSH terms in the literature means that either the title or abstract must contain IDP terminology in order for the majority of papers to be retrieved in an IDP-specific search. It is very possible that this has significantly contributed to the low percentage of papers that are searchable by IDP terminology.
Overall, chemistry-focused journals were highly represented, with medical and biological journals following close behind, and a smaller number of structural biology and computationally-focused journals. It should be noted, however, that our set did not include reviews, proteomic studies, or protocol papers, and these are the top journals for primarily experimental studies on individual proteins. The full list of journals publishing IDP papers can be found in the ESI.†
We also looked at which journals are publishing IDP papers. Table 4 shows the top 10 journals publishing IDP papers.
The top 10 journals publishing IDP papers | |
---|---|
Journal | Papers |
J. Biol. Chem. | 256 |
Biochemistry | 239 |
J. Mol. Biol. | 171 |
Proc. Natl. Acad. Sci. U. S. A. | 108 |
PLoS One | 75 |
Protein Sci. | 65 |
J. Am. Chem. Soc. | 51 |
Biophys. J. | 51 |
Biochim. Biophys. Acta | 50 |
Proteins | 47 |
Disorder prediction for Swiss-Prot was obtained using the fast regression-based disorder predictor RAPID at http://biomine-ws.ece.ualberta.ca/RAPID/index.php.31 We felt this predictor was the best choice because we needed to process a large number of sequences (all Swiss-Prot sequences), and RAPID provides high speed with high-quality predictions. RAPID was compared with 21 disorder predictors31 and performed as well or better than any publically available predictor that performs at the speed we needed for such a large dataset. Furthermore, we did not need the detail provided by individual residue prediction and consensus methods, and instead only needed to place proteins within a disorder prediction bin (0–10%, 10–20%, etc.), and therefore we felt that a single predictor was sufficient. All parsing of the raw data files was done through custom Python scripts.32
From the initial 3343 results, we manually examined each paper to try to ascertain which proteins were referred to as an IDP or indicated to have an IDPR. We recorded these names using the same language used in the corresponding literature. We discarded review, theory, proteomic, and method papers, as well as irrelevant results. This filtering resulted in 2278 PubMed articles attached to 1127 search terms, each corresponding to a protein or protein domain.
Our emphasis was primarily on the language used in the literature, and therefore we did not evaluate the experimental evidence. Because curation was not the primary objective of this project and naming conventions vary, there may be some duplicates and incorrect assignments, but we attempted to minimize this as much as possible. For each of the 1127 identified proteins and protein domains, we created a search term and attempted to maximize relevant results by adding qualifiers as necessary. For instance, the search term we created for tau was “tau AND (protein OR Alzheimer's OR tauopathies OR neuronal)”, because a search for “tau” alone would return many irrelevant results. Similarly, the search term for p53 was “p53 AND (CTD or C-terminal or C-terminus)”, because we wanted to specifically target our search towards the region that had been identified as intrinsically disordered. We attached DisProt and UniProt IDs to each protein search term; however, in many cases, this required an educated guess due to variations in naming conventions. In some cases, more than one UniProt and/or DisProt ID was attached when multiple organisms were referred to in the paper(s). In cases where only a domain was mentioned, a UniProt ID was not assigned. There were 630 proteins in our set that could be attached to DisProt IDs. For each UniProt ID assigned to a protein, a disorder prediction was obtained by RAPID.
In order to get the number of both IDP and non-IDP papers per year, PubMed was automatically queried for each PubMed ID using the Biopython suite of tools33 and custom Python scripts. The fraction of IDP papers is calculated as the number of IDP papers divided by the entire set of papers for that protein search query. The set of 2278 PubMed IDs formed the basis for the IDP-specific author, keyword, abstract, and journal data. This data was extracted from the corresponding MEDLINE entries using Biopython.
The curation of IDPs and the synthesis of literature that goes into building IDP theory requires painstaking manual literature searches that are further hindered by an inconsistent use of IDP terminology. It appears that the majority use of IDP terminology is by researchers who primarily study IDPs and not researchers who primarily study a specific protein or proteins. We would argue that consistent usage of IDP terminology will not increase significantly until more researchers see the value in using IDP terminology to describe the proteins they study. Therefore, the challenge is one of both increasing awareness and also expanding the perceived relevance of intrinsic disorder. The introduction of the MeSH term “intrinsically disordered proteins” and the addition of author keywords to PubMed allow for better indexing of IDPs in PubMed, and we highly recommend that studies involving IDPs recommend this MeSH term upon submission. Furthermore, we recommend the inclusion of the term “intrinsically disordered protein(s)” in the author keyword list.
Finally, we have provided here a list of 1127 proteins and protein domains that have been referred to using IDP terminology, along with their associated PubMed IDs (ESI†), which includes 497 proteins not currently in DisProt. We hope this will provide a useful starting point for the further curation of recently recognized IDPs.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c5ra24866c |
This journal is © The Royal Society of Chemistry 2016 |