Open Access Article
Pengyue Gao
a,
Sheng Wanga,
Jian Lvb,
Yanchao Wang*a and
Yanming Ma*a
aState Key Laboratory of Superhard Materials, Jilin University, Changchun 130012, China. E-mail: wyc@calypso.cn; mym@jlu.edu.cn
bCollege of Materials Science and Engineering, Jilin University, Changchun 130012, China
First published on 16th August 2017
The complex and rugged potential energy landscape has made protein structure prediction a challenging task in computational biology. Here, we propose an efficient protein structure prediction method combining both template-based and template-free methods. Specifically, the initial protein conformations can be built by a non-redundant protein database and random sampling method with constraints of the secondary structure of the proteins. Three different structure evolution methods including improved particle swarm optimization (PSO) algorithm, random perturbation and fragment substitution are employed to update the protein structures while keeping the secondary structures the same. The present method is benchmarked on several known protein structures with distinct folding patterns, including α proteins, β proteins and αβ proteins. The high success rate and the accuracy of the results demonstrate the reliability of this method.
000 protein structures) have been determined experimentally at the atomic level in the Protein Data Bank (PDB).3 This reflects the large gap between identifying the sequence of a protein and determining its structure. Therefore, the development of low-cost and efficient computer-aided methods for predicting protein 3D structures is highly desirable.
Currently, protein structure prediction methods are classified into two categories based on the extent to which they exploit the known experimental structures in the PDB database: template-based method and template-free method. Template-based method, including comparative modeling4 and fold recognition modeling,5,6 builds protein models by aligning query sequences on solved protein structures to identify structure templates. Comparative modeling identifies templates by sequence or sequence profile comparison, since similar sequences adopt similar protein structures. Fold recognition identifies templates by matching the query sequence directly onto known protein structures. Template-based method has been proven more accurate than template-free methods. However, it's only successful when templates are available in the PDB library. On the contrary, template-free method, also called ab initio modelling,7–9 doesn't rely on any protein structure template and conducts a conformational search under guidance of the energy function to determine the protein structure. According to the thermodynamic hypothesis proposed by Anfinsen,10 the native conformation lies at the global free energy minimum. Several ab initio methods8,9,11–18 have been developed and used to determine various protein structures.8,11,14 However, with increasing protein sizes, the conformational phase space of sampling sharply increases, which makes the ab initio modeling extremely difficult. In practice, the current trends are pointing to approaches, which can extensively combine both methods. Template-based method always includes exploring template independent conformational space. Similarly, the ab initio modeling builds up models by using fragments of known structures.19–22
Here, we develop a general protein structure prediction method enabling the determination of the three-dimensional structure of a protein based on its sequence. The protein sequence is divided into two regions: template-dependent region and template-independent region. The former is constructed by template-based method and the latter is constructed by ab initio method (e.g., random sampling using the secondary structure information). It is also noteworthy that ab initio method is employed for structure evolution. Specifically, all the structures will be updated by improved PSO algorithm, fragment substitution method and random perturbation. Due to employment of combination of the template-based and ab initio methods, our method can achieve an adequate trade-off between two opposite terms: exploration and exploitation. The method is applied to several proteins with distinct folding patterns and system sizes. The high success rate and accuracy support the reliability of the method for protein structure prediction.
The secondary structure of proteins is the 3D form of local segments of proteins. Once the secondary structure is determined, there will be a considerable reduction of the computational cost of structure prediction. Here, one of the widely used secondary structure prediction method: PSIPRED23 is adopted in our method.
Threading, as a template-based protein structure prediction method, aims to select template proteins, which share the similar structural motifs with the target protein, from known protein structure databases. The process details are as follows:
First, the query sequence is matched against a non-redundant sequence database (downloaded from http://zhanglab.ccmb.med.umich.edu/library/) by position-specific iterated BLAST (PSI-BLAST),24 to identify sequence similarity. Note that a critical parameter (E-value) is used in the PSI-BLAST search to estimate the probability of sequence similarity and exclude the similar sequences or homologous proteins. In this manuscript, the templates with an E-value < 0.05 are usually excluded to prove the robustness of the method. For a given protein sequence, a large number of target-template alignments can be found by the PSI-BLAST search. In order to select the potential template to construct the initial conformation, all the target-template alignments are evaluated by a score function related to both sequence and structure, borrowed from a profile–profile aligning approach: PPA approach25 and ranked according to their scores. Only the structures with the high-score can be selected as the potential templates.
To validate the threading method used in our studies, the protein 1r69_ (61 residues) is employed as the benchmark. The sequence of protein 1r69_ as the query sequence is matched against the non-redundant sequence database by PSI-BLAST method and all the target-template alignments are evaluated by PPA approach. A protein alignment (1y7yA) with the highest score of 2.82 is shown in Fig. 2. The Cα-RMSD between the structures of the native protein 1r69_ (3rd–56th residue) and protein 1y7yA (11th–65th residue) is only 1.4 Å. Obviously, both of structures share the similar sequence and structural motif, validating the threading method.
It should be emphasized that two key criteria should be suggested for the potential templates to construct the initial protein model: structural similarity and alignment length. To evaluate the structural similarity, the Cα-RMSD of all target-template alignments should be calculated. It is accepted that the template with Cα-RMSD of template less than 3.0 Å and alignment length more than half of the target sequence usually leads to high success ratio to predict the native-like protein. Therefore, two conditions should be satisfied for a potential template: the Cα-RMSD in the alignment region from template structure to native protein is less than 3.0 Å and the alignment length is larger than half of the target sequence. The score is well established to quantify the alignment length and structural similarity of template, while one of the central challenges is deciding an appropriate and acceptable the cutoff of score to find the potential template protein. To obtain well-defined cutoff of the score, we use the protein 1r69_ as an example to calculate the success rate of finding a potential template with different cutoff. It can be noted that the success rate for the target-template alignments as potential templates dependence on the score is shown in Fig. 3a. The success rate is greater than 80%, when the cutoff is set as 1. While it drops to 62% with cutoff as 0.75. Therefore, the cutoff of score is suggested to be 1. In order to further test the cutoff setting, we calculate the success rate for other proteins. Just as shown in Fig. 3b, the success rates for all the protein are not lower than 50%. Particularly, seven proteins are greater than 70%. The high success rates reveal that it is reasonable to set 1.0 as the truncation of score for identifying a potential template among all the target-template alignments.
As mentioned before, there usually are a large number of target-template alignments, which share similar structures or structure motifs. To use these information, we try to build a fragment library on the basis of these templates. The fragment library can be used to update protein models to improve the success rate of structure prediction. The process is as followed: we list all amino acid sequence fragments (seen as segments) with length of 9 residues26 in the query sequence sequentially, until the last letter of the query sequence is included. All the templates can be divided into several protein fragments in the same way (no gap is permitted in any fragment). The fragments can be ranked by the alignment score25 and 25 fragments with high score for each segment are used to build fragment library. The protein fragment containing 9 residues associates with protein database, while the segment is the amino acid sequence fragment with length of 9 residues in the query sequence. In practice, a segment can be replaced by the protein fragment, which can be randomly selected from 25 fragments in the fragment library.
After the PSI-BLAST search, the query sequence is divided into two regions: template-dependent region and template-independent region. For the template-dependent region, the dihedral angles of the template are copied to construct the initial conformation. For the template-independent region, the initial conformation is constructed based on results of the secondary structure prediction. The two common secondary structures are α-helix and β-sheet, in which the dihedral angle ϕ/ψ values are fixed: −57° and −47° for α-helix and −139° and 135° for β-sheet, respectively.29 Other dihedral angles are randomly generated.
In the PSO scheme, a structure (an individual) in the search space is regarded as a particle. A set of individual particles is called a population or a generation. Each particle is initialized at a random position in the search space. During the evolution process, eqn (1) is used to update the positions of particles as
| xi,jt+1 = xi,jt + νi,jt+1. | (1) |
The new velocity (νi,jt+1) of the ith individual at the jth dimension is calculated based on its previous location xi,jt, previous velocity νi,jt, current location pbesti,jt with an achieved best fitness for this individual, and the population global location to date, gbestt, with the best fitness value for the entire population according to eqn (2)
| νi,jt+1 = K × (νi,jt + n1 × r1 × (pbesti,jt − xi,jt) + n2 × r2 × (gbestt − xi,jt)) | (2) |
![]() | (3) |
Fragment substitution method uses the structural fragments from the built fragment library to replace the existing part of protein conformations for structure evolution. The position for replacing the protein conformation is randomly selected and the fragment is randomly selected from the 25 fragments. Dihedral angles from the selected fragment replace the dihedral angles in the protein conformation. Fragment substitution can lead to rapid convergence on collapsed structures of plausible topology. Random perturbation method tries to perturb the current dihedral angles, which are randomly selected. In our method, the upper limit of the perturbation is 30°. Random perturbation method can effectively increase the structure diversity.
Note that the secondary structures and bond lengths are frozen, only dihedral angles of coil regions are updated during the structure evolution. Each operated structure is rejected or accepted on the basis of energy criteria. If the energy of the operated structure is lower than the energy of previous structure, the operation is accepted. Otherwise, it is rejected.
To evaluate the performance of different structure evolution strategies, we take protein 1r69_ as an illustrative example. The evolution histories with different structure evolution strategies are shown in Fig. 4a. First, we compare three structure evolution strategies: the traditional PSO algorithm (back line), random perturbation (blue line) and the improved PSO algorithm (red line). During the structure evolution by the traditional PSO algorithm or random perturbation, the system soon stagnates and one or more local minima dominate the search, while the improved PSO algorithm can achieve a fine exploration of potential energy surface and further predict the protein model with lower energy. Therefore, it is more efficient than traditional PSO algorithm or random perturbation. Furthermore, we also consider combination of different structure operations, including the improved PSO algorithm combined with random perturbation (purple line), random perturbation and fragment substitution (green line). The combination of two operations (purple line) can obtain the model with lower energy and it demonstrates that the combination is superior to single one. Especially, the energy decreases fast and the model with lowest-energy is successfully predicted at 225th generation, when the combination of three structure operations is used to perform structure evolution. The model with lowest-energy is proved to be a native-like structure and the Cα-RMSD of lowest energy model relative to native protein is only 1.9 Å. Furthermore, the superposition of lowest-energy model onto the native structure is shown in the Fig. 4b. A subtle difference between the lowest energy model and native structure is insignificant. Obviously, the structure evolution strategy of composite structure operations is the most powerful method to update protein structure.
The three categories of proteins with distinct folding patterns, including α proteins, β proteins and αβ proteins, are used to evaluate the performance of our approach. The modelling results of several known proteins, with length from 47 residues to 118 residues, are predicted by our approach and shown in Table 1. It is obvious that the Cα-RMSDs of most targets are less than 10 Å, which can be useful for biological application.50 Particularly, the Cα-RMSD of seven targets of 1shfA, 1r69_, 1di2A, 1ogwA, 1o2fB_, 2f3nA and T0716-D1 predicted by our method are less than 2.5 Å (the generally accepted criterion of high-resolution model50). Thus it can be seen that our method can successfully predict native-like models with different systems sizes and folding patterns, demonstrating the robustness and accuracy of our method.
| PDB code | N | Secondary structure | REmin | ROSETTA | I-TASSER |
|---|---|---|---|---|---|
| a Ref. 25.b Website: http://zhanglab.ccmb.med.umich.edu/decoys/decoy3.html.c Website: http://predictioncenter.org/casp10/index.cgi. | |||||
| 1shfAa | 59 | β | 2.1 (0.71) | 10.8 | 1.7 (0.75) |
| 1r69_a | 61 | α | 1.9 (0.71) | 1.2 | 1.9 (0.75) |
| 1csp_a | 67 | β | 5.3 (0.61) | 4.7 | 2.1 (0.76) |
| 1di2A_a | 69 | αβ | 1.6 (0.81) | 2.6 | 2.3 (0.78) |
| 1n0uA4a | 69 | αβ | 5.1 (0.42) | 10.2 | 4.4 (0.48) |
| 1mla_2a | 70 | αβ | 3.8 (0.59) | 8.7 | 2.8 (0.66) |
| 1ogwA_a | 72 | αβ | 1.5 (0.82) | 1.0 | 1.1 (0.88) |
| 1dcjA_a | 73 | αβ | 8.9 (0.37) | 2.5 | 10.5 (0.39) |
| 1dtjA_a | 74 | αβ | 4.8 (0.61) | 1.2 | 1.9 (0.80) |
| 1mkyA3a | 81 | αβ | 5.1 (0.51) | 6.3 | 5.2 (0.40) |
| 1tfi_a | 47 | β | 5.0 (0.46) | — | 4.6 (0.56) |
| 1ah9a | 63 | β | 4.1 (0.45) | — | 4.3 (0.56) |
| 2f3nAa | 65 | α | 1.7 (0.75) | — | 1.8 (0.74) |
| 1kviAa | 68 | αβ | 3.1 (0.56) | — | 2.0 (0.72) |
| 1itpAa | 68 | αβ | 4.5 (0.44) | — | 10.9 (0.33) |
| 1o2fB_a | 77 | αβ | 2.0 (0.76) | — | 7.1 (0.41) |
| 1sro_a | 71 | β | 6.3 (0.33) | — | 3.4 (0.66) |
| 1of9Aa | 77 | α | 4.1 (0.47) | — | 3.6 (0.53) |
| 1gixAa | 77 | β | 5.2 (0.56) | — | 6.9 (0.44) |
| 1ten_a | 87 | β | 2.7 (0.69) | — | 1.6 (0.85) |
| 1npsAa | 88 | αβ | 3.6 (0.63) | — | 2.1 (0.79) |
| 1no5Aa | 93 | αβ | 9.0 (0.35) | — | 10.6 (0.43) |
| 1thx_a | 108 | αβ | 5.9 (0.56) | — | 2.1 (0.83) |
| 1jnuAa | 104 | αβ | 5.1 (0.54) | — | 2.7 (0.75) |
| 1orgAa | 118 | α | 5.2 (0.53) | — | 2.4 (0.78) |
| 2czsAb | 70 | α | 10.9 (0.24) | — | — |
| 1i27Ab | 73 | αβ | 8.2 (0.43) | — | — |
| 1w53Ab | 84 | α | 11.0 (0.40) | — | — |
| 2ip6Ab | 87 | α | 11.5 (0.33) | — | — |
| T0716-D1c | 51 | α | 2.1 (0.73) | — | — |
| T0662-D1c | 79 | α | 2.9 (0.69) | — | — |
| T0726-D2c | 81 | αβ | 4.3 (0.57) | — | — |
The accuracy of our method is demonstrated through comparison with the all-atomic ROSETTA53 and I-TASSER25 methods. Here we take three typical proteins, including α protein (1r69_) and αβ protein (1di2A_ and 1thx_) as examples. The protein 1r69_ has a topology of four α-helices with 61 amino acid residues. The Cα-RMSD of the lowest energy model predicted by our method is 1.9 Å, which is comparable to the values determined by the ROSETTA (1.2 Å) and I-TASSER (1.9 Å) method. The superposition of the lowest model predicted by our method onto the corresponding native structure is shown in Fig. 5a. A subtle difference observed between the two models demonstrates the accuracy of our method. Furthermore, protein 1di2A_ contains three stands and two helices. The superposition of the lowest-energy model predicted by our method onto the corresponding native structure is shown in the Fig. 5b. The difference between the two models is negligible. Particularly, the model predicted by our method has the Cα-RMSD of 1.6 Å, which is superior to that obtained by the ROSETTA (2.6 Å) and I-TASSER (2.3 Å). We also test a complex protein 1thx_ with more than 100 amino acid residues. The Cα-RMSD of the lowest energy model predicted by our method is 5.9 Å. A native-like structure of target 1thx_ is successfully generated and the difference between the predicted structure and native structure is the relative position of the secondary structures (Fig. 5c). Although the model is not good as that obtained with the I-TASSER method (2.1 Å), the model has approximately correct topology and is still useful for biological application.50 Therefore, the present method is comparable or superior to previously described protein structure search method and can be used to efficiently predict protein structures.
It is well-known that the conformational phase space of sampling sharply increases with protein sizes increasing, which makes the ab initio modeling of proteins with large sizes extremely difficult. Our calculations indicate the accuracy of the models obtained by our method decreases with increasing protein sizes. Just as illustrated in Table 1, the Cα-RMSD of proteins with length > 100 residues (1thx_, 1jnuA and 1orgA) predicted by our method are more than 5 Å. It is obvious that only the coarse models can be obtained by our method for the proteins, whose lengths are more than 100 residues.
To explore the reasons for the success of our developed method, we perform the structure prediction of two typical proteins of 1shfA and 1n0uA4. The 1shfA is a typical β protein, containing 59 amino acid residues. The Cα-RMSD of the lowest-energy model predicted by our method is 2.1 Å, which is superior to that obtained by ROSETTA method (10.8 Å). In-depth study shows that a high-resolution initial model with the Cα-RMSD of 2.5 Å is obtained from the known protein structure databases plays a critical role in successfully predicting the native-like protein. It demonstrates the process of determining the template is very effective for our protein structure prediction. To illustrate the effectiveness of ab initio structure evolution method, we take the protein 1n0uA4 as an example. Protein 1n0uA4 has the length of 69 amino acid residues with four stands and two helices. The initial model is constructed by selected potential template with the Cα-RMSD of 6.8 Å. The lowest-energy model is predicted at 833th generation with the Cα-RMSD of 5.1 Å, which is superior to the initial model. The superposition of the native and the lowest-energy structures showed in the Fig. 6a indicates that a native-like structure of target 1n0uA4 is successfully generated. These results indicate that the structure evolution algorithm adopted in our method can effectively improve the modelling accuracy.
![]() | ||
| Fig. 6 (a) The superposition of the native and the lowest-energy structures of protein 1n0uA4. The purple represents the native structure and the green represents the predicted lowest-energy model. (b) The comparison of Cα-RMSD between initial models and predicted lowest-energy models of the studied proteins listed in Table 1. | ||
In order to further evaluate the performance of our method, we compare the Cα-RMSD of the initial models with the lowest-energy models for all proteins in Table 1. Just as shown in Fig. 6b, it can be clearly seen that most Cα-RMSDs of predicted models are reduced. In other words, the predicted protein conformations become closer to the native proteins than initial ones. These results further demonstrate that our structure evolution algorithm has a powerful ability to improve the modelling accuracy.
| This journal is © The Royal Society of Chemistry 2017 |