Anupaul Baruah and
Parbati Biswas*
Department of Chemistry, University of Delhi, Delhi-110007, India. E-mail: pbiswas@chemistry.du.ac.in
First published on 26th November 2013
Intrinsic conformational flexibility is a consequence of the rugged energy landscape of the protein which is encoded in its amino acid sequence. In this work, a self-consistent mean-field based theory is developed and applied on lattice and real proteins to design sequences with varying foldability and desired flexibility. Monte Carlo simulations are used to evaluate the conformational flexibility of the designed sequences in terms of the relative propensity of the sequence to choose the native state among the ensemble of near-native conformations. The mutation of specific residue pairs in the designed sequences affects their conformational flexibility. Site pairs which exhibit a greater deviation from the corresponding random values of the residue probabilities may be preferentially selected for pair mutation of residues to modulate the flexibility. The mutation of such residue pairs due to evolution may lead to significantly altered conformational flexibility and stability while retaining its sequence identity. This result is supported by experimental evidence [Lehmann et al., Protein Eng., 2000, 13, 49], which suggests that the conserved residues in a set of homologous protein sequences contribute more to the stability of the protein. Mutations of such residue pairs by decreasing the foldability lead to higher flexibility and may mimic protein evolution [Tomatis et al., Proc. Natl. Acad. Sci. U. S. A., 2008, 105, 20605]. The spatial distance between any two residues is minimally correlated to the generation of the mutations. However, long ranged correlation between residue pairs may also be important for specific residue pairs. A higher correlation between residue pairs is noted to be sufficient but not a necessary condition for accommodating higher generations of mutation.
In this work, a self-consistent mean field theory is developed to design sequences consistent with a particular foldability criterion with varying flexibility. The sequence-structure compatibility is characterized by the choice of a coarse-grained potential with one-body and two-body interactions between amino acid residues. Flexibility is measured in terms of the propensity of the protein to fold into its native conformation through repeated mutation of a specific pair of residues consistent with a similar foldability criterion until a foldable sequence can be designed. This pair mutation procedure is then repeated for all possible pairs of residues. The sequence flexibility may be controlled by varying the foldability criterion. Residue pairs that lead to higher or lower flexibility through pair mutations are identified. The calculated values of the site-specific probability and pairwise probability profile of amino acids are used to examine the mutability of particular residues. It may be observed that mutations of specific pairs of residues make the protein more flexible compared to others. Site pairs with higher deviation from the random values of the residue probabilities may be chosen for pair mutation to tune the flexibility of a sequence. If such residue pairs are mutated due to evolution, the resulting protein may have altered conformational flexibility and stability with high sequence similarity to that of the unmutated protein. Earlier experimental studies12,13 suggest that the conserved residues present in a set of homologous protein sequences have the maximum contribution to the stability of the sequence. Conserved residues indicate a higher preference for such residues at the corresponding sites. In the present article, such residue pairs may be identified by the higher deviation from the random values of the residue probabilities. Pair mutations induced by decreasing the foldability indicates that a protein compromises stability at the expense of increased conformational flexibility and therefore mimics protein evolution.14,15 Pair mutations induced by increasing the foldability may represent less flexible ancestor-type sequences. This theory is applied to cubic lattice proteins and verified on real protein conformations. The theory may be used to evaluate design strategies for a given target structure of a protein and provides ample scope to assay the designability of protein structures as a function of their conformational flexibility.
217
728. Exact enumeration by the first depth algorithm16 yields 103
346 unique, compact conformations on a 3 × 3 × 3 lattice, which are not related by rotational, reflectional or translational symmetry.16 The native (target) conformation is identified as the most designable non-degenerate conformation since it16–18 represents the lowest energy conformation for a maximum number of sequences. The binary hydrophobic-polar patterning is based on the assumption that the hydrophobic interaction is the dominant force behind protein folding.19,20 The reduction of a 20 letter amino acid alphabet to a 2 letter HP model makes it exactly solvable and computationally tractable. Such HP models are extensively used by us4,10,21–24 and others25,26 to successfully model protein folding and design. In this work, both one-body and two-body potentials are used simultaneously to construct a new method which incorporates pair mutations through self-consistent mean field theory to design sequences with controlled conformational flexibility.
The energy of a sequence in a given conformation may be expressed as the sum of the one-body and two-body terms25
![]() | (1) |
![]() | (2) |
In our earlier works, we have either used a pure one-body potential or a pure two-body potential. In this work, we have considered a potential including both one-body and two-body hydrophobic interactions to study the effects of pair mutations on both one-body and two-body interactions of proteins and their respective impact on the conformational flexibility. The two-body interaction parameter makes it possible to study the effect of long distance hydrophobic–hydrophobic interactions as non-bonded site pairs separated by a cut-off distance, irrespective of their separation along the sequence, are considered as interacting neighbors. The one-body energy parameter γ(1)k(αi) denotes the energy contribution of the amino acid type αi in the k-th structural context. Solvent accessibility is chosen as an appropriate structural context here. Solvent accessibility is measured in terms of the coordination number. The higher the coordination number, the lower the solvent accessibility. Sites with coordination numbers of 3 and 4 are considered to be in structural context, k = 2, while coordination numbers of 1 and 2 are considered to be in structural context, k = 1. The two-body term γ(2)(αi, αj) quantifies the inter-residue contact propensities,25 including excluded volume interactions of the monomer types αi and αj at the i-th and j-th sites, respectively.
| γ(1)1(H) = 0, γ(1)1(P) = 0, γ(1)2(H) = −1ε, γ(1)2(P) = 0, γ(2)(H, H) = −3ε′, γ(2)(H, P) = −1ε′, γ(2)(P, H) = −1ε′, γ(2)(P, P) = 0 | (3) |
By assuming small fluctuations in the native state energy due to sequence variation, the sequence averaged energy of the native conformation may be written as
![]() | (4) |
Similarly, the average energy of the unfolded ensemble (ensemble average) may be expressed as
![]() | (5) |
A suitable choice of the energy function may define a foldability criteria, which measures the sequence-structure compatibility.30 For an NVT ensemble, the change in the Helmholtz free energy of folding (ΔA) is related to the equilibrium constant (Keq) as
| Keq = exp(−βΔA) = Zf/Zu | (6) |
![]() | (7) |
The stability gap, Δ, represents the difference between the energy of the native (target) state and the average energy of the ensemble of unfolded states, while Γunf2 is the variance in the energy of the unfolded ensemble. Ωu is the number of possible unfolded states. Since, ln
Ωu is constant for all sequences, a foldability criteria may be defined as
![]() | (8) |
The one-body and two-body residue interactions are treated independently. However, the one-body and two-body probabilities are not independent, they are coupled to each other and among themselves through the constraints which specify the local/global features of the structure and sequence. Within this approximation the sequence entropy may be expressed as the sum of the contributions from the one-body and two-body residue probabilities
![]() | (9) |
The most probable set of one-body probabilities (ωi(αi)) and two-body probabilities (ωi,j(αi, αj)) may be determined by maximizing the entropy subject to the relevant constraints on the foldability criteria and normalization of the site-identity and pairwise monomer probabilities. These constraints are:
(i) the normalization of the one-body probabilities of each site,
![]() | (10) |
(ii) the normalization of the two-body probabilities of each pair of sites,
![]() | (11) |
(iii) the foldability criteria, ϕ, as expressed in eqn (8).
The variational functional, V, of the set of site identity and pairwise monomer probabilities may be expressed as
![]() | (12) |
![]() | (13) |
![]() | (14) |
This set of equations is solved numerically to determine the most probable sequence consistent with a particular value of the foldability criterion, ϕ.
In this work we propose a new method to self-consistently incorporate pair mutations in the designed sequences. Pair mutations are incorporated in the generated sequences by varying the foldability, the site-specific monomer and the pairwise monomer probability values of the selected pairs of sites, while all other monomer and two-body probabilities are kept constant. For a given value of foldability, the probability values for the k-th and p-th sites are determined self-consistently by constraining eqn (10) and (11) for the k-th and p-th sites and eqn (8). Thus, the variational functional may be expressed as
![]() | (15) |
This functional is maximized with respect to one and two-body probabilities of k-th and p-th sites only and the resulting equations are solved simultaneously to get the new probabilities. Successive pair mutations are introduced by either increasing or decreasing the ϕ value stepwise until a foldable sequence is generated self-consistently. This procedure is repeated for all 351 possible pairs of residues.
The probability (Pi) of a given sequence to populate any conformation, i, is dependent on the energy (Ei) of the sequence in that conformation, and can be expressed as
![]() | (16) |
![]() | (17) |
539 conformations are selected with an equal or lower number of buried sites and an equal or lower number of two-body contacts compared to the crystal structure in order to preserve the highest designability of the specified native structure.
A Metropolis MC simulation was performed to find the effect of introducing pair-mutation on the stability of designed sequences in the target conformation. A foldable sequence with a negative ϕ value was considered and any conformation was selected randomly among the ensemble of possible conformations. The energy of the sequence in the selected conformation was calculated by using eqn (4). Finally, the probability (Fi) of that sequence adopting the selected conformation was determined according to the Metropolis criterion.33 This yields a measure of the flexibility of the given sequence, as discussed in the theory section. The results of the MC simulations suggest that the site pairs have varied abilities to alter the flexibility of a given sequence through mutation. Thus, the identification of residue pairs which have the ability to alter the conformational flexibility of a sequence may help to design sequences with desired flexibility and function through pair mutations.
To study the change in the flexibility of a sequence as a function of the generations of mutations, all possible residue pairs are mutated successively by increasing and decreasing the respective foldability. After mutating each residue pair of the real protein sequence at ϕ = −7.0 by increasing the foldability, two pair-mutated sequences are randomly selected, one with a higher generation of mutations and the other with a lower generation of mutations. The selected residue pairs are 3, 7 (3rd and 7th residues) and 2, 4 with 52 and 16 generations of mutation, respectively. The MC simulation of the mutated sequence at 3, 7 selects the target (native) state in 3
053
117 MC steps out of a total of 5
000
000 steps. However, the MC simulation of the sequence mutated at 2, 4 selects the target state in 2
191
176 MC steps, while the initial unmutated sequence at ϕ = −7.0 selects the target state in 2
137
092 MC steps. The sequence mutated at residue pair 3, 7 mostly resides at the native conformation with less fluctuations since it selects the target state in a higher number of MC steps than the sequence mutated at 2, 4 and the original sequence. Hence, mutation at 3, 7 tolerates higher generations of mutation, which decreases the flexibility of the initial sequence to a larger extent compared to the mutation at 2, 4 accommodating lower generations of mutation. Similarly, after mutating the same sequence for all possible residue pairs by decreasing the foldability, the residue pairs 10, 13 and 2, 3 are randomly selected with 89 and 25 generations of mutation, respectively. The MC simulation of the sequence mutated at residue pair 10, 13 selects the target state in 1
019
744 MC steps and the sequence mutated at residue pair 2, 3 selects the target state in 1
581
059 MC steps. Both mutated sequences select the target state in a lower number of MC steps compared to the original sequence, which selects the target state in 2
137
092 MC steps. Thus, these mutated sequences frequently fluctuate to conformations other than the target, implying increased conformational flexibility compared to the initial unmutated sequence. Again, it may be noticed that the sequence mutated at 10, 13 with higher generations of mutation increases the flexibility more than the sequence mutated at residue pair 2, 3 with lower generations of mutation. Similarly, 4 residue pairs were appropriately selected for lattice proteins. MC simulations were run for the selected pair-mutated sequences. Each MC simulation of 5
000
000 steps was run for 10 trials and the MC results were averaged over these trial runs. Fig. 1 plots the number of MC steps in which the native state is selected by the sequence against the generation of mutations by increasing and decreasing the foldability, respectively. The generation of mutations by increasing the foldability is represented along the negative x-axis. The red line depicts the real protein sequences and the black line represents the lattice protein sequences. The corresponding residue pairs are indicated in parentheses. This plot clearly shows that the number of MC steps which select the native state changes with an increase in the generations of mutation, compared to that of the unmutated sequence. Thus, the residue pairs which accommodate higher generations of mutation can change the flexibility to a larger extent.
To investigate the impact of the correlation between two residues on the mutation pattern and generation of mutation, the residue correlation was calculated from the monomer and the pair probabilities of the residues for the specified site pair (i, j). If the residues at the i-th and j-th sites are assumed to be independent of each other then the pair probability of the occurrence of αi at i and αj at j may be predicted from
| Pi,j(αi, αj) = ωi(αi)ωj(αj) | (18) |
However, the calculated pair probability is given by ωi,j(αj, αj). Hence, the ratio of the calculated two-body probability to the predicted two-body probability measures the correlation between the residue pair, Ci,j(αi, αj) given by
| Ci,j(αi, αj) = ωi,j(αi, αj)/(ωi(αi)ωj(αj)) | (19) |
The higher the deviation of Ci,j(αi, αj) from unity, the higher the correlation between the residue pairs (αi, αj) at the i-th and j-th sites. In the HP model, there can be 4 Ci,j(αi, αj) values for each site pair. In Fig. 2(a), the generation of mutations is plotted as a function of the maximum correlation value, (Ci,j(αi, αj))max for each site pair of real proteins. Fig. 2(a) shows that for real proteins, a higher correlation implies a higher generation of mutations. However, a lower correlation does not necessarily mean a lower generation of mutations, as is evident from Fig. 2(a). Site pairs with lower residue correlations accommodate both lower and higher generations of mutation. Thus, highly correlated pairs of residues accommodate higher generations of mutation but correlation between given residue pairs is not a necessary condition for obtaining higher generations of mutation. For lattice proteins, a similar trend is observed, as depicted in Fig. 2(b).
![]() | ||
| Fig. 2 The generation of mutations is plotted against (Ci,j(αi, αj)max) for (a) a real protein sequence with ϕ = −10.0 and (b) a lattice protein sequence with ϕ = −10.0. | ||
In Fig. 3(a), the generation of mutation for real proteins is plotted vs. a measure of the average deviation from the random values of the residue probabilities, specifically at the selected pair of sites to be mutated. The average deviation from the random values of residue probabilities at the i-th and j-th sites may be calculated from
![]() | (20) |
![]() | ||
| Fig. 3 The generation of mutations is plotted against dr(i, j) for (a) a real protein sequence with ϕ = −10.0 and (b) a lattice protein sequence with ϕ = −10.0. | ||
The role of the spatial distance between a pair of residues as a function of the generation of mutations is depicted in Fig. 4. Although the generation of mutations decreases with increasing spatial distance between residue pairs, the correlation between them is weak. The correlation coefficient of the best fit line (red line) is just −0.62. Thus, the spatial distance between two residues is almost independent of the generation of mutations. Thus, a lower spatial distance between a residue pair does not necessarily indicate higher generations of mutation or vice versa. This is evident from Fig. 4 as the residue pair shown by the solid red arrow, separated by ∼6.9 Å, accommodates only 14 generations of mutation. However, the residue pairs shown by the dashed red arrow at a distance of ∼14.7 Å accommodate 71 generations of mutation. Even the residue pair accommodating the highest generations of mutation (shown by the blue arrow) is separated by a distance (∼9.3 Å), which is more than the cut-off distance (7.5 Å) between non-bonded interacting residue pairs.32 Thus, long distance correlations between residue pairs play an important role in evaluating the generations of mutation of a given sequence, which is an important determinant for protein stability and flexibility.
![]() | ||
| Fig. 4 The generation of mutations is plotted against the spatial distance between a pair of sites for real proteins. | ||
| This journal is © The Royal Society of Chemistry 2014 |