Mohamed F.
Ghalwash
ab,
A. Keith
Dunker
c and
Zoran
Obradović
*a
aCenter for Data Analytics and Biomedical Informatics, Computer and Information Sciences Department, College of Science and Technology, Temple University, Philadelphia, PA 19122, USA. E-mail: zoran.obradovic@temple.edu; mohamed.ghalwash@temple.edu; Fax: +1 215 204 5082; Tel: +1 215 204 6265
bMathematics Department, Faculty of Science, Ain Shams University, Cairo, Egypt
cCenter for Computational Biology and Bioinformatics, Department of Biochemistry and Molecular Biology, School of Medicine, Indiana University, Indianapolis, IN 46202, USA. E-mail: kedunker@iupui.edu
First published on 21st November 2011
A grand challenge in the proteomics and structural genomics era is the prediction of protein structure, including identification of those proteins that are partially or wholly unstructured. A number of predictors for identification of intrinsically disordered proteins (IDPs) have been developed over the last decade, but none can be taken as a fully reliable on its own. Using a single model for prediction is typically inadequate because prediction based on only the most accurate model ignores model uncertainty. In this paper, we present an empirical method to specify and measure uncertainty associated with disorder predictions. In particular, we analyze the uncertainty in the reference model itself and the uncertainty in data. This is achieved by training a set of models and developing several meta predictors on top of them. The best meta predictor achieved comparable or better results than any other single model, suggesting that incorporating different aspects of protein disorder prediction is important for the disorder prediction task. In addition, the best meta-predictor had more balanced sensitivity and specificity than any individual model. We also assessed the effects of changes in disorder prediction as a function of changes in the protein sequence. For collections of homologous sequences, we found that mutations caused many of the predicted disordered residues to be flipped to be predicted as ordered residues, while the reverse was observed much less frequently. These results suggest that disorder tendencies are more sensitive to allowed mutations than structure tendencies and the conservation of disorder is indeed less stable than conservation of structure. Availability: five meta-predictors and four single models developed for this study will be publicly freely accessible for non-commercial use.
Traditionally, the 3-D structures of the proteins are determined by using costly experimental methods such as X-ray crystallography, Overhauser Effect Enhanced Nuclear Magnetic Resonance (NMR) spectroscopy and Circular Dichroism (CD) spectroscopy. The first IDP predictor11 was developed in 1997 in our laboratory, and this event was followed by development of numerous additional disorder predictors by our group and others. The current state of protein disorder predictors with their advantages and drawbacks has been summarized recently.12–15
Each disorder predictor uses different concepts, different physico-chemical properties, or even different machine learning algorithms. For example, some protein disorder predictors assume that the prediction for each residue is independent of the prediction for other residues while taking into consideration that the predicted disorder tendency of neighboring positions could be beneficial to the disorder prediction task. On the other hand, some other predictors are specific to one type of disorder only, represented by missing residues from X-ray structures. So that, relying on a single disorder predictor is not necessarily the best strategy.
Two sources of uncertainty in disorder prediction are model uncertainty and data uncertainty. Although model uncertainty in disorder prediction depends on the selected model, the relationship between model uncertainty and model selection has not been systematically investigated. Usually, from a space of models, one is selected that is more accurate than several alternatives when tested on a given set of sequences. A limitation of this kind of selection process is that problems arising from model uncertainty are not considered in sufficient depth.
To study the above discussed effects, we developed several protein disorder predictors to analyze uncertainty in disorder prediction. Each of the implemented predictors uses a different machine learning algorithm and/or was trained on a different dataset. Based on the individual predictors, we built five meta-predictors.
In the first part of this study, the aim was to analyze the uncertainty in the disorder prediction. For that purpose, CASP 8 (122 protein sequences)16 and CASP 9 (117 protein sequences)17 sequences were used to evaluate the performance of multiple models with respect to accuracy of protein disorder prediction. Our results showed that more than one model deserves consideration in making inferences of disorder prediction and integrating multiple predictors with different flavors is better than using individual components and relying on one predictor.
In the second part of the paper, which to the best of our knowledge has not been done before, the uncertainty in the protein sequence data is analyzed. We analyzed disorder prediction as a function of the effect of change in the data provided to the predictors. We applied our method to 29489 CASP 8 and 27
450 CASP 9 homologous sequences and studied the effect of change in the sequence similarity on the disorder predictions. Although protein structure is generally more conserved than sequence,18–20 we found that the regions that are predicted as disorder are sensitive to the changes (mutation) in the sequence unlike the regions that are predicted as structure, suggesting that the disorder predictions are more sensitive to changes in the sequences than structure predictions and the conservation of disorder is indeed less stable than conservation of structure.
DISOPRED2 23 is the first method using support vector machine (SVM) for the protein disorder prediction. DISOPRED2 was trained on a dataset of missing residues of solved structures, separately for N-, C- and middle regions. The input was constructed from PSI-BLAST generated profiles of position specific scoring matrices.24 The main reason for the low false positive rate of DISOPRED2 is one of the advantages of SVM, namely that it can incorporate greater cost of misclassification for one of the classes, therefore it can compensate for unbalanced datasets.
In the case of feed-forward neural networks and SVM the prediction for each residue is independent of the prediction for other residues. On the other hand, DISpro25 uses recurrent networks that can propagate data from later processing stages to earlier stages. DISpro involves the use of evolutionary information in the form of predicted secondary structure and solvent accessibility, and 1D-recursive neural network. In DISpro, the prediction at each position depends on the entire sequence through a recursive network of neighboring positions instead of using a fixed window size. The recently published method OnD-CRF26 uses conditional random fields (CRF) for accurately predicting the transition between structured and disordered regions in proteins. The input was constructed from the amino acid sequence and secondary structure prediction. Both methods, DISpro and OnD-CRF, have the ability to take into account the disorder prediction of neighboring residues.
All aforementioned methods are specific to one type of protein disorder only, represented either by missing residues of X-ray structures or DisProt database.27 Their performance tested on the other dataset resulted in significantly lower efficiencies. This problem was first addressed by the PONDR VSL2 method.28,29 It is composed of two specialized predictors optimized for short (≤30 residues) and long (>30 residues) disordered regions that are integrated by an independent linear SVM meta-predictor using the inner-product kernel. The inputs of all three methods are composed of various amino acid propensities, sequence complexity, and optionally sequence profiles and secondary-structure predictions, calculated within a sliding fixed local window. At the first level, the two methods predict short and long disordered segments, respectively. The meta-predictor then determines the optimal weight to combine the output of these two composite predictors. This architecture ensured that PONDR VSL2 has a more balanced performance on disordered segments of various lengths.
PONDR-FIT 30 is another meta-predictor based on a consensus artificial neural network (ANN). It was developed by combining the outputs of six individual disorder predictors. For each residue along the sequence, the prediction results of the six individual predictors on a sliding window of 21 residues centered at that residue were fed into a single layer artificial neural network (ANN) with 20 hidden units. The single output at the output layer is the disorder score of the meta-predictor for that centered residue. The meta-predictor (MD)31 is developed by combining a set of several orthogonal methods that capture many types of disorder without sacrificing the distinction of the type of disorder that is detected. A simple arithmetic average over different methods slightly improved over the best method. Many other meta-predictors32–35 have been developed to improve the accuracy of disorder prediction.
To the best of our knowledge, no work has been done before to study uncertainty in protein disorder prediction models and data.
The formation of protein complexes often involves segments that undergo disorder-to-order transitions upon formation of the complexes. Then a dataset reduced in such segments, a second dataset, which we will call monomeric dataset, was assembled from single chain PDB structures. This set contains 102636 ordered and 95
156 disordered residues. Also included in this set were all of the disordered protein segments from the DisProt database.27
AA frequency-based features are the numbers of the indicated amino acids in a given window of a prespecified length centered at the current residue. For example, FWC means the number of amino acid F plus the number of amino acid W plus the number of amino acid C over a window centered at the current residue. The list of all AA frequency-based features used in our model is shown in Table 1.
Features | Features | Features |
---|---|---|
WFYC | VILM | WCFIYVLHM |
WYFEDH | VIYFW | D |
E | H | I |
M | N | P |
R | S | T |
V | K | PEVK |
Property-based features37 are the sum of the residue's property-values. Following previous studies28,29 a window of length L = 41 centered at each residue in the sequence was used. Since the window has varying sizes at the N-, C-terminal regions,38 we use s and e to adjust the start and the end of the window as follows:
s = max{1,i − (L − 1)/2} |
e = min{M,i + (L − 1)/2} |
Four features were calculated from the properties of the amino acids. The first property, hydrophobicity, which is an important determinant of protein chain folding, was calculated as follows:
The second property-based feature that was used is the flexibility.40 It was calculated as follows:
Since disordered proteins characterized by different experiments exhibit similar complexity distributions,21 which are shifted to lower values compared to, but significantly overlapping with, the distributions for ordered proteins, the complexity of the sequence was used as the third property-based feature in our model. The sequence complexity as measured by Shannon's entropy was calculated as:
Finally, we also used amino acid propensities,41 which is a scale to measure how likely an amino acid is to be unfolded, which was calculated as:
Sequence pattern-based features were used to capture the local sequence similarity associated with disordered and structured proteins.42 For each sequence, we slid a window of length 15 centered at each residue. Then, we measured the similarity between this window and each subsequence of length 15 in the dataset. The score of aligning two subsequences s1 and s2 of the same length was measured as:
Logit is a statistical linear model that makes no assumption about the distribution of the independent variables. NN is a non-linear representational classifier that has the ability to capture the non-linear relationship between the input and the output. SVM is a mathematical model that maps the original finite-dimensional space into a much higher-dimensional space, presumably making the separation easier in that space by finding the hyperplane that maximizes the margin between the two classes. Finally, CRF is a discriminative undirected probabilistic graphical model that captures the correlation between the neighboring residues.
A common approach for meta-predictors is to use a voting scheme. The prediction for each residue was computed based on the votes from all models on that particular residue. The question is what happens if there is disagreement among the models. One way to handle this issue is based on the weighted average of the votes. In this method, each model is assigned a weight for its vote. The weights are assigned in the training process. Unfortunately, this method requires that all models be trained on the same dataset. Therefore, this strategy cannot be used in our framework. Another simple and commonly used approach is majority voting. Normally, the majority voting is considered to be the best voting scheme in the meta-predictor framework. Here we show that, for disorder prediction using these data and these models, majority voting does not give the best results.
The voting scheme depends on how many votes from the individual models are considered. Five voting schemes for disorder prediction were tested in this paper. For each residue, if at least x = {1,2,…,5} predict disorder (positive), the overall prediction by the given meta-predictor for that residue is that it is in a disordered region and the residue-specific prediction is called Positive Voter x (PVx). The results for the different voting schemes are explained in Section 4.3 and in Table 2.
CASP 8 | CASP 9 | |||||||
---|---|---|---|---|---|---|---|---|
Sensitivity | Specificity | Accuracy | AUC | Sensitivity | Specificity | Accuracy | AUC | |
a Best accuracy across all individual models. b Best accuracy across all 10 predictors. c Best balance between sensitivity and specificity. | ||||||||
CRF | 0.758 ± 0.060 | 0.850 ± 0.010 | 0.804 ± 0.030a,b | 0.879 ± 0.031 | 0.543 ± 0.041 | 0.847 ± 0.012 | 0.695 ± 0.021a | 0.773 ± 0.023 |
SVM | 0.572 ± 0.087 | 0.862 ± 0.010 | 0.717 ± 0.044 | 0.760 ± 0.053 | 0.492 ± 0.035 | 0.848 ± 0.011 | 0.670 ± 0.017 | 0.721 ± 0.021 |
NN | 0.519 ± 0.107 | 0.877 ± 0.010 | 0.698 ± 0.054 | 0.755 ± 0.058 | 0.468 ± 0.037 | 0.865 ± 0.011 | 0.666 ± 0.018 | 0.729 ± 0.019 |
Logit | 0.599 ± 0.108 | 0.877 ± 0.013 | 0.738 ± 0.054 | 0.798 ± 0.054 | 0.338 ± 0.032 | 0.845 ± 0.015 | 0.591 ± 0.017 | 0.613 ± 0.026 |
VSL2B | 0.767 ± 0.070 | 0.828 ± 0.010 | 0.798 ± 0.035 | 0.867 ± 0.039 | 0.557 ± 0.036 | 0.797 ± 0.012 | 0.677 ± 0.018 | 0.709 ± 0.026 |
PV1 | 0.860 ± 0.045 | 0.664 ± 0.015 | 0.762 ± 0.024 | — | 0.776 ± 0.028 | 0.631 ± 0.017 | 0.704 ± 0.015 | — |
PV2 | 0.787 ± 0.064c | 0.803 ± 0.012c | 0.795 ± 0.033c | — | 0.643 ± 0.035c | 0.779 ± 0.014c | 0.711 ± 0.018b,c | — |
PV3 | 0.648 ± 0.091 | 0.901 ± 0.009 | 0.774 ± 0.046 | — | 0.474 ± 0.038 | 0.882 ± 0.010 | 0.678 ± 0.019 | — |
PV4 | 0.523 ± 0.106 | 0.949 ± 0.006 | 0.736 ± 0.053 | — | 0.315 ± 0.031 | 0.937 ± 0.008 | 0.626 ± 0.015 | — |
PV5 | 0.397 ± 0.124 | 0.978 ± 0.003 | 0.687 ± 0.062 | — | 0.188 ± 0.026 | 0.972 ± 0.005 | 0.580 ± 0.013 | — |
First, for each CASP 8 and CASP 9 sequence, all-against-all BLAST was performed to search in a non-redundant protein database for all homologous sequences. Default parameters of BLAST were used except that a high E-value was used to avoid bias in local similarity search. We removed all homologous sequences that are either more than 10% longer or shorter than the CASP sequence.
A follow-up step was performed to compute the global similarity between homologous sequences, derived from BLAST, and the CASP sequences. The identity between the CASP sequence and its homologous sequence was computed as the percentage of the perfect match. Hereafter, the sequence similarity and identity are used interchangeably.
We applied all 10 disorder predictors, consisting of five individual predictors described in Section 3.3 and five meta-predictors described in Section 3.5, on both CASP 8 and CASP 9 sequences and their homologous sequences. For each predictor, we computed the percentage of flips from disorder to order and from order to disorder predictions. We then analyzed the conservation of the disordered predictions at multiple levels of similarity for all 10 predictors. The results are summarized in Section 4.4.
![]() | ||
Fig. 1 Training dataset and feature set for each model. |
![]() | ||
Fig. 2 ROC for five disorder predictors on CASP 8. |
![]() | ||
Fig. 3 ROC for five disorder predictors on CASP 9. |
![]() | ||
Fig. 4 Evaluation of 10 predictors on CASP 8. |
![]() | ||
Fig. 5 Evaluation of 10 predictors on CASP 9. |
Upon comparing our 5 predictors on CASP 8 and CASP 9 data, the CRF was the best model among five individual predictors in terms of accuracy and AUC. This provides evidence that exploiting the correlation between disorder predictions at neighboring residues is beneficial to the disorder prediction task as CRF is the only one of the five models that takes advantage of these correlations. VSL2B and CRF were quite similar on CASP 8. VSL2B had the best sensitivity rate on both CASP 8 and CASP 9 among the five individual models. However, although the VSL2B predictor is applicable to disordered regions of any length and can accurately identify the short disordered regions, it performed worse than CRF on CASP 9. SVM and NN were quite similar on CASP 9. However, SVM had higher sensitivity on CASP 8. All these results suggest that relying on only one predictor is not fully reliable and different datasets have different flavors of disordered proteins.43
It is worth mentioning that, according to the evaluation of the disorder predictions on CASP 816 and CASP 9,17 our CRF predictor also compared well with respect to those disorder predictors that participated in CASP 8 and CASP 9. The CRF predictor was one of the best 6 and 10 predictors with respect to accuracy on CASP 8 and CASP 9, respectively, and was one of the best 11 predictors with respect to AUC on CASP 8. Our CRF model achieved comparable balanced accuracy with OnD-CRF26 on CASP 9 but was superior with respect to AUC.
The power of each of the five single predictors is demonstrated by their application to one of the target proteins from CASP 9. The disorder predictions for all 10 models on the experimentally determined T0631 target sequence are shown in Fig. 6. The C-terminal region (residues 165–168) was predicted correctly by CRF and VSL2B and was predicted incorrectly by the rest of the models. The region in the middle (residues 56–64) was predicted incorrectly by CRF while the rest of the models were in a close agreement on that region. However, VSL2B predicted other regions like residues 102–107 incorrectly. For those cases, no predictor was fully reliable on its own which demonstrated that there is a need to integrate multiple predictors.44
![]() | ||
Fig. 6 Prediction of the five models on the T0632 target sequence from CASP 9. The top part shows the prediction score of each individual model. The horizontal red line is the threshold used for each classifier. All thresholds are aligned together to simplify the plot. The bottom parts are the final predictions for each model. The top bar, labeled as Obs, represents the true classifications while the other five bars represent the predictions of the five models. The red points represent the disordered residues while the blue points represent the ordered residues. |
First, we looked at the agreement among the five models described in Section 3.3 in both positive (disordered) and negative (structured) predictions. As illustrated in Fig. 7, the agreement among the models could result either in true or false prediction. The percentage of true agreement is high, especially in regions predicted to be structured. In contrast, the percentage of false prediction is fairly small. These results suggest that the meta-predictor is the most reliable especially for structured regions.
![]() | ||
Fig. 7 Fraction of full agreement among five disorder prediction models CRF, SVM, NN, Logit and VSL2B on CASP 8 and CASP 9. |
To further analyze the agreement, we tested five integration methods on CASP 8 and CASP 9. For each integration method, a certain number of models were used for the voting on the disorder (positive) prediction. For example, in the case of PV2 meta-predictor, the overall prediction is disorder if any two of the underlying models predict disorder. Otherwise, the prediction is that the residue is in a structured region. Please observe that PV3 is the commonly used majority voting algorithm. The five meta-predictors were applied to CASP 8 and CASP 9 sequences and the results are reported in Table 2. It reveals that the most accurate model of five meta prediction models is PV2. Consistent results were obtained on CASP 8 and CASP 9. In CASP 9, the meta-predictor PV2 outperformed the CRF, while in CASP 8 CRF outperformed the PV2. However, the meta-predictor PV2 is the only predictor that has a good balance between sensitivity and specificity which is a very desirable property for any protein disorder predictor.
In addition, according to the evaluation17 of the disorder prediction on CASP 9, the meta-predictor PV2 is one of the best 8 disorder predictors with respect to the balanced accuracy.
![]() | ||
Fig. 8 Breakdown of the fraction of flips between disorder and structure into multiple levels of similarity on CASP 8 sequences. X-axis is the sequence similarity. Y-axis is the percentage of flips. Number in parentheses is the number of homologous sequences that fall in the corresponding sequence similarity category. Red bar represents the fraction of disorder to disorder predictions. Blue bar represents the fraction of disorder to structure predictions. Yellow bar represents the fraction of structure to structure predictions. Green bar represents the fraction of structure to disorder predictions. |
![]() | ||
Fig. 9 Breakdown of the fraction of flips between disorder and structure into multiple levels of similarity on CASP 9 sequences. X-axis is the sequence similarity. Y-axis is the percentage of flips. Number in parentheses is the number of homologous sequences that fall in the corresponding sequence similarity category. Red bar represents the fraction of disorder to disorder predictions. Blue bar represents the fraction of disorder to structure predictions. Yellow bar represents the fraction of structure to structure predictions. Green bar represents the fraction of structure to disorder predictions. |
The most interesting finding is that the fraction of flips from disordered to ordered was larger than that from ordered to disordered. This suggests that a small change in sequence in a disordered region could easily flip prediction to a structured region while regions predicted to be structured were more robust to mutation. As shown in Fig. 8 and 9, the fractions of flips varied a lot among predictors although some of the models, such as SVM and NN, behaved quite similar on both CASP 8 and CASP 9. This supports that the disorder prediction is affected by the changes in the sequence. However, the most accurate predictors PV2, CRF and VSL2B appeared to be also the most stable with respect to changes in sequence.
Two recent studies compared evolutionary models of structured protein regions versus disordered regions. The first study showed that disordered regions have a greater chance of changing and that the sequence changes are structurally nonconservative.19 In the second study, predictions of secondary structure and predictions of disorder were compared for three evolutionary models, which differed in their choice of amino acid substitution matrices for assignment of the mutations, with the finding that secondary structure predictions are conserved while disorder predictions are not.45 Since all three substitution matrices were biased towards the substitutions found in structured proteins, the observed result might result in part from the bias towards structure of the substitution matrices.46 Here we did not use evolutionary models but directly compared related sequences for the tendency of order and disorder predictions to be conserved. The greater conservation of predicted order as compared with predicted disorder supports the view that conservation of disorder is indeed less stable than conservation of structure, as suggested,45 and that, when conservation of disorder does occur, such an event is highly nontrivial.47
Assuming that order and disorder predictions indicate the relative sensitivity of these two types of structures to mutation, how can we understand the bias that mutations flip disorder to order more frequently than the reverse? There is a long history of reports indicating that residues on the surfaces of proteins are much more subject to mutational change as compared to buried residues48,49 including recent interesting developments showing that there is a quantitative relationship between the degree of conservation and the inverse of the local packing density.50 Thus, given that disordered regions are poorly packed if there is any packing at all, it is understandable that the residues in such regions are subject to rapid nonconservative changes just as those recently reported.19 On the other hand, the packing of structured regions leads to preferences for conserved amino acid changes in order to maintain both the structure and the function of the protein.48–50 Given the relative lack of constraint regarding mutations in disordered regions and given the strong constraints regarding mutations in structured regions, it is no wonder that mutations are more likely to change structural tendencies from disorder to order compared to the reverse.
Does this tendency for mutations to flip sequences from disorder to order have biological implications? As already suggested, when disorder tendencies are conserved for particular regions, this likely indicates that important functions are being carried out by such regions.47 Mutations that increase the tendency for order could also provide a mechanism for disordered regions to evolve into order while gaining structure-dependent functions along the way. In this regard, in two recent studies, the creation of new proteins was studied. In one case, new protein loops were created by mutations leading to intron-to-exon conversion.51 In the second case, new viral proteins arose by a process called overprinting, in which a coding region of RNA was translated into a new reading frame.52 In both cases, the newly created protein was highly polar and disordered. We surmise that if newly created sequences were hydrophobic, then such sequences would be very unlikely to fold into 3D structure and instead would be massively prone to non-specific aggregation and interaction, thus leading to cell death. However, as just mentioned, such newly created disordered sequences could evolve both structure and structure-dependent functions due to the tendency of random mutations to increase structure-forming propensity.
Another interesting related observation is that IDPs are often associated with complex diseases such as cancer, neurodegenerative diseases, cardiovascular diseases, and diabetes,5,53 likely because errors in signaling and regulation arising from IDPs are important for these disease associations. These associations of IDPs with disease led us to suggest the “disorder in disorders” or the D2 concept.54 Analysis of disease-inducing mutations in such IDPs reveals that mutations that cause disorder tendencies to flip to structure tendencies are the most likely mutations in disordered regions to be disease-causing.55 Another recent paper provides additional support for this idea.56
Another source of uncertainty is data uncertainty. We analyzed the effects of changes in the protein sequence on the disorder prediction. We showed that protein disorder predictions were affected by the changes in the sequence. Changes in the sequence result in different behavior for the disorder prediction. For example, SVM and NN were quite similar on CASP 8 and CASP 9. However, they behaved very differently on similar sequences as shown in Fig. 8 and 9. This provides an evidence that the disorder predictions are quite sensitive to mutations and that mutations often cause regions with predictions indicating disorder to shift to predictions indicating structure.
Recent studies compared evolutionary models of structured protein regions versus disordered regions.19,45 The observed result might result in part from the bias towards structure of the substitution matrices used in the evolutionary models. In this study, we directly compared related sequences for the tendency of order and disorder predictions to be conserved. The findings support that conservation of disorder is indeed less stable than conservation of structure.
Still, the most accurate predictors were found to be the most stable with respect to changes in the sequence suggesting that research should be aimed at developing accurate models that also have low uncertainty.
Footnote |
† Published as part of a Molecular BioSystems themed issue on Intrinsically Disordered Proteins: Guest Editor M. Madan Babu. |
This journal is © The Royal Society of Chemistry 2012 |