Designing sequences with varied flexibility and stability through pair mutations

Anupaul Baruah and Parbati Biswas*
Department of Chemistry, University of Delhi, Delhi-110007, India. E-mail: pbiswas@chemistry.du.ac.in

Received 30th October 2013 , Accepted 26th November 2013

First published on 26th November 2013


Abstract

Intrinsic conformational flexibility is a consequence of the rugged energy landscape of the protein which is encoded in its amino acid sequence. In this work, a self-consistent mean-field based theory is developed and applied on lattice and real proteins to design sequences with varying foldability and desired flexibility. Monte Carlo simulations are used to evaluate the conformational flexibility of the designed sequences in terms of the relative propensity of the sequence to choose the native state among the ensemble of near-native conformations. The mutation of specific residue pairs in the designed sequences affects their conformational flexibility. Site pairs which exhibit a greater deviation from the corresponding random values of the residue probabilities may be preferentially selected for pair mutation of residues to modulate the flexibility. The mutation of such residue pairs due to evolution may lead to significantly altered conformational flexibility and stability while retaining its sequence identity. This result is supported by experimental evidence [Lehmann et al., Protein Eng., 2000, 13, 49], which suggests that the conserved residues in a set of homologous protein sequences contribute more to the stability of the protein. Mutations of such residue pairs by decreasing the foldability lead to higher flexibility and may mimic protein evolution [Tomatis et al., Proc. Natl. Acad. Sci. U. S. A., 2008, 105, 20605]. The spatial distance between any two residues is minimally correlated to the generation of the mutations. However, long ranged correlation between residue pairs may also be important for specific residue pairs. A higher correlation between residue pairs is noted to be sufficient but not a necessary condition for accommodating higher generations of mutation.


Introduction

Under native conditions, the structure of proteins in solution may be represented as a dynamic equilibrium among a hierarchy of substates in the conformational ensemble.1 These equilibrium fluctuations among the conformational states characterizes the flexibility of proteins and may be described in terms of a Boltzmann distribution. The ensemble of the near-native conformations makes the folding energy landscape rugged, with the native state residing at the free energy minimum. The local minima in the funnel-shaped landscape act as folding traps leading to off-pathway protein aggregation and misfolding.2–4 A biologically relevant consequence of this “minimally frustrated” energy landscape is the structural robustness of the protein.5,6 Mutating a residue in a sequence changes the energy of the native state and thus alters both the local and global flexibility of proteins.7,8 Mutation induced flexibility may be frequent, large and long-ranged,9 where some mutations are responsible for altered functionality of proteins through altered flexibility, while others may change their long range dynamics without affecting the overall fold or the functionality to a significant extent. Modulating the conformational flexibility may be an important determinant for misfolding of proteins.10,11 Thus, designing sequences with desired flexibility may provide important insights to probe the relationship between protein stability and function, mediated by changes in protein flexibility.

In this work, a self-consistent mean field theory is developed to design sequences consistent with a particular foldability criterion with varying flexibility. The sequence-structure compatibility is characterized by the choice of a coarse-grained potential with one-body and two-body interactions between amino acid residues. Flexibility is measured in terms of the propensity of the protein to fold into its native conformation through repeated mutation of a specific pair of residues consistent with a similar foldability criterion until a foldable sequence can be designed. This pair mutation procedure is then repeated for all possible pairs of residues. The sequence flexibility may be controlled by varying the foldability criterion. Residue pairs that lead to higher or lower flexibility through pair mutations are identified. The calculated values of the site-specific probability and pairwise probability profile of amino acids are used to examine the mutability of particular residues. It may be observed that mutations of specific pairs of residues make the protein more flexible compared to others. Site pairs with higher deviation from the random values of the residue probabilities may be chosen for pair mutation to tune the flexibility of a sequence. If such residue pairs are mutated due to evolution, the resulting protein may have altered conformational flexibility and stability with high sequence similarity to that of the unmutated protein. Earlier experimental studies12,13 suggest that the conserved residues present in a set of homologous protein sequences have the maximum contribution to the stability of the sequence. Conserved residues indicate a higher preference for such residues at the corresponding sites. In the present article, such residue pairs may be identified by the higher deviation from the random values of the residue probabilities. Pair mutations induced by decreasing the foldability indicates that a protein compromises stability at the expense of increased conformational flexibility and therefore mimics protein evolution.14,15 Pair mutations induced by increasing the foldability may represent less flexible ancestor-type sequences. This theory is applied to cubic lattice proteins and verified on real protein conformations. The theory may be used to evaluate design strategies for a given target structure of a protein and provides ample scope to assay the designability of protein structures as a function of their conformational flexibility.

Theory

The 27-mer protein is modeled as a self-avoiding walk (SAW) on a three-dimensional cubic lattice where each lattice point represents a residue/structural unit. For a 27-mer protein with binary patterning of the amino acids as hydrophobic and polar, the total number of possible sequences is 227 = 134[thin space (1/6-em)]217[thin space (1/6-em)]728. Exact enumeration by the first depth algorithm16 yields 103[thin space (1/6-em)]346 unique, compact conformations on a 3 × 3 × 3 lattice, which are not related by rotational, reflectional or translational symmetry.16 The native (target) conformation is identified as the most designable non-degenerate conformation since it16–18 represents the lowest energy conformation for a maximum number of sequences. The binary hydrophobic-polar patterning is based on the assumption that the hydrophobic interaction is the dominant force behind protein folding.19,20 The reduction of a 20 letter amino acid alphabet to a 2 letter HP model makes it exactly solvable and computationally tractable. Such HP models are extensively used by us4,10,21–24 and others25,26 to successfully model protein folding and design. In this work, both one-body and two-body potentials are used simultaneously to construct a new method which incorporates pair mutations through self-consistent mean field theory to design sequences with controlled conformational flexibility.

The energy of a sequence in a given conformation may be expressed as the sum of the one-body and two-body terms25

 
image file: c3ra46247a-t1.tif(1)
where N is the total number of amino acid residues in the protein. The energy contribution due to the presence of the residue type αi at the i-th site is denoted by γi(αi), which quantifies the propensity of the monomer type αi to reside in a particular structural context.27,28 The two-body term γi,j(αi, αj) represents the pair interaction of the i-th and the j-th sites with the monomer types αi and αj respectively.25,28,29 Both the one-body and two-body interaction terms may be expressed as follows
 
image file: c3ra46247a-t2.tif(2)
where k is the number of structural contexts and σ(1)ik is the one-body structural parameter which indicates whether the i-th site is in the k-th structural context. Such structural contexts may denote whether site i is buried in the interior of the protein, accessible to solvent or the type of secondary structure associated with it.21 The two-body structural parameter σ(2)ij accounts for the interaction between the monomers at sites i and j. Non-bonded nearest neighbors are considered as being in contact.4

In our earlier works, we have either used a pure one-body potential or a pure two-body potential. In this work, we have considered a potential including both one-body and two-body hydrophobic interactions to study the effects of pair mutations on both one-body and two-body interactions of proteins and their respective impact on the conformational flexibility. The two-body interaction parameter makes it possible to study the effect of long distance hydrophobic–hydrophobic interactions as non-bonded site pairs separated by a cut-off distance, irrespective of their separation along the sequence, are considered as interacting neighbors. The one-body energy parameter γ(1)k(αi) denotes the energy contribution of the amino acid type αi in the k-th structural context. Solvent accessibility is chosen as an appropriate structural context here. Solvent accessibility is measured in terms of the coordination number. The higher the coordination number, the lower the solvent accessibility. Sites with coordination numbers of 3 and 4 are considered to be in structural context, k = 2, while coordination numbers of 1 and 2 are considered to be in structural context, k = 1. The two-body term γ(2)(αi, αj) quantifies the inter-residue contact propensities,25 including excluded volume interactions of the monomer types αi and αj at the i-th and j-th sites, respectively.

 
γ(1)1(H) = 0, γ(1)1(P) = 0, γ(1)2(H) = −1ε, γ(1)2(P) = 0, γ(2)(H, H) = −3ε′, γ(2)(H, P) = −1ε′, γ(2)(P, H) = −1ε′, γ(2)(P, P) = 0 (3)
ε and ε′ are the energy contributions for one-body and two-body interactions, respectively. The choice of this coarse-grained energy function favors hydrophobic residues at sites of high coordination which are buried in the interior of the protein and are less solvent accessible, thus implicitly incorporating the hydrophobic solvent interactions.

By assuming small fluctuations in the native state energy due to sequence variation, the sequence averaged energy of the native conformation may be written as

 
image file: c3ra46247a-t3.tif(4)

Similarly, the average energy of the unfolded ensemble (ensemble average) may be expressed as

 
image file: c3ra46247a-t4.tif(5)
where ωi(αi) is the probability of finding an αi type of amino acid at the i-th site, while ωi,j(αi, αj) is the joint probability of finding αi and αj type amino acids at the i-th and j-th sites, respectively.

A suitable choice of the energy function may define a foldability criteria, which measures the sequence-structure compatibility.30 For an NVT ensemble, the change in the Helmholtz free energy of folding (ΔA) is related to the equilibrium constant (Keq) as

 
Keq = exp(−βΔA) = Zf/Zu (6)
where, Zf and Zu are the respective partition functions for the folded and unfolded conformational states of the protein. Using the cumulant expansion of Zu and retaining terms up to second order, eqn (6) may be simplified as (for a somewhat detailed derivation see ref. 21)
 
image file: c3ra46247a-t5.tif(7)

The stability gap, Δ, represents the difference between the energy of the native (target) state and the average energy of the ensemble of unfolded states, while Γunf2 is the variance in the energy of the unfolded ensemble. Ωu is the number of possible unfolded states. Since, ln[thin space (1/6-em)]Ωu is constant for all sequences, a foldability criteria may be defined as

 
image file: c3ra46247a-t6.tif(8)
where ϕ is a dimensionless quantity appropriately scaled with respect to β. From eqn (7) it is clear that negative values of ϕ correspond to negative free energy changes of folding, making the folding transition spontaneous. Thus, negative ϕ values ensure that the target/native state is at the free energy minimum and the respective designed sequences are compatible with the target/native state.

The one-body and two-body residue interactions are treated independently. However, the one-body and two-body probabilities are not independent, they are coupled to each other and among themselves through the constraints which specify the local/global features of the structure and sequence. Within this approximation the sequence entropy may be expressed as the sum of the contributions from the one-body and two-body residue probabilities

 
image file: c3ra46247a-t7.tif(9)

The most probable set of one-body probabilities (ωi(αi)) and two-body probabilities (ωi,j(αi, αj)) may be determined by maximizing the entropy subject to the relevant constraints on the foldability criteria and normalization of the site-identity and pairwise monomer probabilities. These constraints are:

(i) the normalization of the one-body probabilities of each site,

 
image file: c3ra46247a-t8.tif(10)

(ii) the normalization of the two-body probabilities of each pair of sites,

 
image file: c3ra46247a-t9.tif(11)

(iii) the foldability criteria, ϕ, as expressed in eqn (8).

The variational functional, V, of the set of site identity and pairwise monomer probabilities may be expressed as

 
image file: c3ra46247a-t10.tif(12)
(βnorm1)i, (βnorm2)i,j and βϕ are the Lagrange multipliers for the constraint eqns (10), (11) and (8) respectively. This functional is maximized with respect to one-body and two-body probabilities to obtain a set of non-linear equations. These simultaneous equations may be solved to yield the following set of coupled transcendental equations.
 
image file: c3ra46247a-t11.tif(13)
where,
 
image file: c3ra46247a-t12.tif(14)

This set of equations is solved numerically to determine the most probable sequence consistent with a particular value of the foldability criterion, ϕ.

In this work we propose a new method to self-consistently incorporate pair mutations in the designed sequences. Pair mutations are incorporated in the generated sequences by varying the foldability, the site-specific monomer and the pairwise monomer probability values of the selected pairs of sites, while all other monomer and two-body probabilities are kept constant. For a given value of foldability, the probability values for the k-th and p-th sites are determined self-consistently by constraining eqn (10) and (11) for the k-th and p-th sites and eqn (8). Thus, the variational functional may be expressed as

 
image file: c3ra46247a-t13.tif(15)

This functional is maximized with respect to one and two-body probabilities of k-th and p-th sites only and the resulting equations are solved simultaneously to get the new probabilities. Successive pair mutations are introduced by either increasing or decreasing the ϕ value stepwise until a foldable sequence is generated self-consistently. This procedure is repeated for all 351 possible pairs of residues.

Conformational flexibility

Sequences with different degrees of foldability are obtained by following the above procedure. The generated sequences have different stabilities in the native (target) conformation. The conformational flexibility of a sequence, however, depends on its relative stability in the native state. This may be measured by calculating the frequency of the occurrence of a conformation as compared to the native conformation.

The probability (Pi) of a given sequence to populate any conformation, i, is dependent on the energy (Ei) of the sequence in that conformation, and can be expressed as

 
image file: c3ra46247a-t14.tif(16)
where, Z is the partition function, kB is the Boltzmann constant and T is the temperature. Here, kBT is assumed to be 1. The frequency of the occurrence (Fi) of the sequence in the i-th conformation relative to the native conformation may be expressed as a ratio of the probability of occurrences in the respective states
 
image file: c3ra46247a-t15.tif(17)
where, Ei and Enat are the energy of the given sequence in the i-th and the native conformation, respectively. Eqn (17) clearly shows that the relative population of any conformation compared to the native conformation is dependent on the energy difference (EiEnat). When Ei is comparable to Enat i.e. (EiEnat) is small, then the given sequence may fold to either the i-th or the native conformation. With an increase in the ruggedness of the energy landscape, the number of near-native conformations increases, and consequently, the sequence may preferentially choose any such conformation relative to the native one. Hence, Fi may be used as a measure of the conformational flexibility for a given sequence.

Real proteins

The mean-field theory is applied to the conformations of a 21-mer real protein with pdb id 1EDN (X-ray resolution = 2.18 Å, R-value = 0.19) (http://www.rcsb.org/pdb/).31 The crystal structure is chosen as the initial template and the coarse-grained C-α chain backbone is subjected to a Monte Carlo (MC) simulation by constraining the pseudo C-α–C-α bond length of 3.8 ± 0.15 Å to explore the conformational space.10 Non-bonded site pairs separated by a distance up to 7.5 Å are considered as interacting neighbors irrespective of their separation along the sequence.32 Finally, a set of 35[thin space (1/6-em)]539 conformations are selected with an equal or lower number of buried sites and an equal or lower number of two-body contacts compared to the crystal structure in order to preserve the highest designability of the specified native structure.

Results and discussion

Sequences are designed with different foldabilities, varying from ϕ = −34.0 to +12.0 for real proteins and ϕ = −23.0 to +50.0 for lattice proteins, compatible with the target (native) conformation. Sequences with a negative foldability represent foldable sequences while positive foldability corresponds to less foldable sequences. Selected sequences with higher foldability are then subjected to pair mutations. The foldability of the chosen designed sequences are varied continuously (increased/decreased) to generate new sequences self consistently by changing the monomer and two-body probabilities of a specific pair of sites only. All other monomer and two-body probabilities are kept constant. Hence, the resulting sequence is just a pair mutated sequence but with a lower/higher foldability than the original one. Such mutations are repeatedly executed for a specific pair of residues, until a foldable sequence can be designed i.e. the set of equations given by eqn (13) converges to a solution. The number of successive stepwise mutations accommodated by a pair of residues until a foldable sequence can be designed, yields the number of generations of mutations for that pair. This procedure of mutation is followed for all possible pairs of residues. Residue pairs have varied abilities to accommodate generations of mutation ranging from a minimum of 11 to a maximum of 90 generations of mutation in the real protein sequence with ϕ = −7.0.

A Metropolis MC simulation was performed to find the effect of introducing pair-mutation on the stability of designed sequences in the target conformation. A foldable sequence with a negative ϕ value was considered and any conformation was selected randomly among the ensemble of possible conformations. The energy of the sequence in the selected conformation was calculated by using eqn (4). Finally, the probability (Fi) of that sequence adopting the selected conformation was determined according to the Metropolis criterion.33 This yields a measure of the flexibility of the given sequence, as discussed in the theory section. The results of the MC simulations suggest that the site pairs have varied abilities to alter the flexibility of a given sequence through mutation. Thus, the identification of residue pairs which have the ability to alter the conformational flexibility of a sequence may help to design sequences with desired flexibility and function through pair mutations.

To study the change in the flexibility of a sequence as a function of the generations of mutations, all possible residue pairs are mutated successively by increasing and decreasing the respective foldability. After mutating each residue pair of the real protein sequence at ϕ = −7.0 by increasing the foldability, two pair-mutated sequences are randomly selected, one with a higher generation of mutations and the other with a lower generation of mutations. The selected residue pairs are 3, 7 (3rd and 7th residues) and 2, 4 with 52 and 16 generations of mutation, respectively. The MC simulation of the mutated sequence at 3, 7 selects the target (native) state in 3[thin space (1/6-em)]053[thin space (1/6-em)]117 MC steps out of a total of 5[thin space (1/6-em)]000[thin space (1/6-em)]000 steps. However, the MC simulation of the sequence mutated at 2, 4 selects the target state in 2[thin space (1/6-em)]191[thin space (1/6-em)]176 MC steps, while the initial unmutated sequence at ϕ = −7.0 selects the target state in 2[thin space (1/6-em)]137[thin space (1/6-em)]092 MC steps. The sequence mutated at residue pair 3, 7 mostly resides at the native conformation with less fluctuations since it selects the target state in a higher number of MC steps than the sequence mutated at 2, 4 and the original sequence. Hence, mutation at 3, 7 tolerates higher generations of mutation, which decreases the flexibility of the initial sequence to a larger extent compared to the mutation at 2, 4 accommodating lower generations of mutation. Similarly, after mutating the same sequence for all possible residue pairs by decreasing the foldability, the residue pairs 10, 13 and 2, 3 are randomly selected with 89 and 25 generations of mutation, respectively. The MC simulation of the sequence mutated at residue pair 10, 13 selects the target state in 1[thin space (1/6-em)]019[thin space (1/6-em)]744 MC steps and the sequence mutated at residue pair 2, 3 selects the target state in 1[thin space (1/6-em)]581[thin space (1/6-em)]059 MC steps. Both mutated sequences select the target state in a lower number of MC steps compared to the original sequence, which selects the target state in 2[thin space (1/6-em)]137[thin space (1/6-em)]092 MC steps. Thus, these mutated sequences frequently fluctuate to conformations other than the target, implying increased conformational flexibility compared to the initial unmutated sequence. Again, it may be noticed that the sequence mutated at 10, 13 with higher generations of mutation increases the flexibility more than the sequence mutated at residue pair 2, 3 with lower generations of mutation. Similarly, 4 residue pairs were appropriately selected for lattice proteins. MC simulations were run for the selected pair-mutated sequences. Each MC simulation of 5[thin space (1/6-em)]000[thin space (1/6-em)]000 steps was run for 10 trials and the MC results were averaged over these trial runs. Fig. 1 plots the number of MC steps in which the native state is selected by the sequence against the generation of mutations by increasing and decreasing the foldability, respectively. The generation of mutations by increasing the foldability is represented along the negative x-axis. The red line depicts the real protein sequences and the black line represents the lattice protein sequences. The corresponding residue pairs are indicated in parentheses. This plot clearly shows that the number of MC steps which select the native state changes with an increase in the generations of mutation, compared to that of the unmutated sequence. Thus, the residue pairs which accommodate higher generations of mutation can change the flexibility to a larger extent.


image file: c3ra46247a-f1.tif
Fig. 1 Number of MC steps that selects the native state is plotted against the generation of mutation for real (red) and lattice (black) sequences. The generation of mutations by increasing the foldability is represented along the negative x-axis. In the parentheses corresponding mutated residue pairs are mentioned.

To investigate the impact of the correlation between two residues on the mutation pattern and generation of mutation, the residue correlation was calculated from the monomer and the pair probabilities of the residues for the specified site pair (i, j). If the residues at the i-th and j-th sites are assumed to be independent of each other then the pair probability of the occurrence of αi at i and αj at j may be predicted from

 
Pi,j(αi, αj) = ωi(αi)ωj(αj) (18)

However, the calculated pair probability is given by ωi,j(αj, αj). Hence, the ratio of the calculated two-body probability to the predicted two-body probability measures the correlation between the residue pair, Ci,j(αi, αj) given by

 
Ci,j(αi, αj) = ωi,j(αi, αj)/(ωi(αi)ωj(αj)) (19)

The higher the deviation of Ci,j(αi, αj) from unity, the higher the correlation between the residue pairs (αi, αj) at the i-th and j-th sites. In the HP model, there can be 4 Ci,j(αi, αj) values for each site pair. In Fig. 2(a), the generation of mutations is plotted as a function of the maximum correlation value, (Ci,j(αi, αj))max for each site pair of real proteins. Fig. 2(a) shows that for real proteins, a higher correlation implies a higher generation of mutations. However, a lower correlation does not necessarily mean a lower generation of mutations, as is evident from Fig. 2(a). Site pairs with lower residue correlations accommodate both lower and higher generations of mutation. Thus, highly correlated pairs of residues accommodate higher generations of mutation but correlation between given residue pairs is not a necessary condition for obtaining higher generations of mutation. For lattice proteins, a similar trend is observed, as depicted in Fig. 2(b).


image file: c3ra46247a-f2.tif
Fig. 2 The generation of mutations is plotted against (Ci,j(αi, αj)max) for (a) a real protein sequence with ϕ = −10.0 and (b) a lattice protein sequence with ϕ = −10.0.

In Fig. 3(a), the generation of mutation for real proteins is plotted vs. a measure of the average deviation from the random values of the residue probabilities, specifically at the selected pair of sites to be mutated. The average deviation from the random values of residue probabilities at the i-th and j-th sites may be calculated from

 
image file: c3ra46247a-t16.tif(20)
where ωi,j(αi, αj) = 0.25 represents a completely random two-body probability for the αi and αj type of residues at the i-th and j-th sites, respectively. Hence, the first term in eqn (20) measures the average deviation from the random value of the two-body probabilities at the i-th and j-th sites. Similarly, the second term in eqn (20) measures the average deviation from the random value (0.5) of site-identity monomer probabilities at the i-th and j-th sites. Thus, dr(i, j) is a simple way to quantify the average deviation from random values of monomer and pair residue probabilities at the selected site pair. For real proteins, Fig. 3(a) indicates a direct correlation between the generation of mutations and the deviation from the random value for both two-body and site-identity monomer probabilities at the site pair. The correlation coefficient is found to be 0.91 for the best fit line (red line). Fig. 3(b) also shows a similar correlation in the lattice proteins with a correlation coefficient of 0.95 for the best fit line (red line). The higher the deviation from the random values of one-body and two-body probabilities, the higher the generation of mutation accommodated by the specific site pair. Thus, such residue pairs may effectively alter the stability and flexibility of a sequence in the native state. A higher deviation from the random values of residue probabilities at the selected site pairs implies a higher/lower preference for some specific residues at that site pair. Thus, a slight change in the monomer and residue pair probabilities for that site pair through mutation may strongly affect the stability and flexibility of a sequence in the target conformation. This result is also supported by the consensus approach,34 which states that conserved residues of homologous amino acid sequences contribute more to the stability of a protein. Highly stable sequences are experimentally designed by incorporating all consensus residues in one multiple mutant sequence.12,13 Conserved residues are highly preferred in the corresponding sites, and hence are represented by residues with a high deviation from the random probability values in the present study. Thus, our work supports the consensus approach, i.e. conserved residue pairs can affect the stability and flexibility of a protein sequence in the native conformation. The selection of the residue pairs with a higher deviation from the random values of residue probabilities for pair mutation will alter the flexibility without changing the overall sequence, while residue pairs with lower deviation from the random values of residue probabilities will permit an overall change in the sequence without an appreciable change in the conformational flexibility. Thus, site pairs with higher deviation from the random values of the residue probabilities may have a significant role in dictating the evolution of proteins. Such pair mutations of amino acid residues due to evolution may lead to new proteins with altered flexibility and stability. An earlier experimental study14,15 suggests that during evolution, the structural flexibility of proteins gets augmented. Thus, mutation of residue pairs with higher deviation from the random values of residue probabilities by decreasing foldability which leads to increased flexibility may mimic protein evolution. This implies that during evolution, proteins may assume higher structural flexibility at the expense of stability. Similarly, the mutated sequences with lower flexibility generated by increasing the foldability may represent corresponding ancestor sequences. As depicted in Fig. 1, higher conformational flexibility through evolution may be achieved at the cost of decreased stability/foldability, while higher rigidity may impart higher stability in the ancestor sequences.


image file: c3ra46247a-f3.tif
Fig. 3 The generation of mutations is plotted against dr(i, j) for (a) a real protein sequence with ϕ = −10.0 and (b) a lattice protein sequence with ϕ = −10.0.

The role of the spatial distance between a pair of residues as a function of the generation of mutations is depicted in Fig. 4. Although the generation of mutations decreases with increasing spatial distance between residue pairs, the correlation between them is weak. The correlation coefficient of the best fit line (red line) is just −0.62. Thus, the spatial distance between two residues is almost independent of the generation of mutations. Thus, a lower spatial distance between a residue pair does not necessarily indicate higher generations of mutation or vice versa. This is evident from Fig. 4 as the residue pair shown by the solid red arrow, separated by ∼6.9 Å, accommodates only 14 generations of mutation. However, the residue pairs shown by the dashed red arrow at a distance of ∼14.7 Å accommodate 71 generations of mutation. Even the residue pair accommodating the highest generations of mutation (shown by the blue arrow) is separated by a distance (∼9.3 Å), which is more than the cut-off distance (7.5 Å) between non-bonded interacting residue pairs.32 Thus, long distance correlations between residue pairs play an important role in evaluating the generations of mutation of a given sequence, which is an important determinant for protein stability and flexibility.


image file: c3ra46247a-f4.tif
Fig. 4 The generation of mutations is plotted against the spatial distance between a pair of sites for real proteins.

Conclusions

The flexibility of a sequence plays a crucial role in dictating the biological function of a protein. Designing optimal flexible sequences for a pre-determined target structure, while ensuring that the target structure is at the global energy minimum, is a challenging task. A self-consistent mean field based theory is developed and applied to design sequences for lattice and real proteins with specified values of the foldability. The flexibility of the designed sequences is measured in terms of the propensity of the sequence to fold into the target (native) state. The designed sequences are mutated pairwise by the stepwise decrease/increase of the foldability to observe the effect on the conformational flexibility of the sequence. Our study shows that all residue pairs cannot accommodate equal generations of mutation. Some pairs are highly sensitive to mutation while some can accommodate higher generations of mutation. Those pairs of residues accommodating higher generations of mutation may alter the flexibility of a sequence. So, it is important to identify those pairs of residues which are able to tolerate higher generations of mutations. A higher correlation between a residue pair is a sufficient but not necessary condition for accommodating higher generations of mutation. The spatial distance between any two residues is almost independent of the generations of mutation. However, large distance correlations between residue pairs are also important. The deviation of the one-body and two-body residue probabilities from their corresponding random values is directly correlated to the generations of mutation. The site pairs with higher deviation from the random values of the residue probabilities may be chosen for residue pair mutation to tune the flexibility of a sequence. This conclusion is supported by experimental studies12,13 which show that multiple mutated sequences which retain all conserved residues of homologous sequences lead to higher stability. This implies that mutation of highly conserved residues may affect the stability and flexibility of a sequence in the target conformation. Thus, pair mutations of such residues due to evolution may result in new proteins with high sequence identity but significantly altered conformational flexibility compared to the unmutated protein. Pair mutations due to evolution impart higher conformational flexibility at the expense of stability.14,15 Pair mutations achieved through an increase in the foldability may represent less flexible ancestor sequences. Similar approaches with more detailed potentials incorporating electrostatic interactions, hydrogen bond formation etc. may be important for designing de novo folds with controlled flexibility.

Acknowledgements

A. Baruah acknowledges the University of Delhi for providing a University Teaching Assistantship. The authors acknowledge the University of Delhi for financial assistance through a University Research Grant.

References

  1. H. Frauenfelder, F. Parak and R. D. Young, Annu. Rev. Biophys. Biophys. Chem., 1988, 17, 451–479 CrossRef CAS PubMed.
  2. R. Wetzel, Trends Biotechnol., 1994, 12, 193–198 CrossRef CAS.
  3. E. T. Powers and D. L. Powers, Biophys. J., 2008, 94, 379–391 CrossRef CAS PubMed.
  4. A. Bhattacherjee and P. Biswas, J. Phys. Chem. B, 2010, 115, 113–119 CrossRef PubMed.
  5. P. G. Wolynes, Philos. Trans. R. Soc., A, 2005, 363, 453–467 CrossRef CAS PubMed.
  6. E. D. Nelson and J. N. Onuchic, Proc. Natl. Acad. Sci. U. S. A., 1998, 95, 10682–10686 CrossRef CAS.
  7. I. N. Smirnova and H. R. Kaback, Biochemistry, 2003, 42, 3025–3031 CrossRef CAS PubMed.
  8. J. L. Battiste, R. Li and C. Woodward, Biochemistry, 2002, 41, 2237–2245 CrossRef CAS PubMed.
  9. D. Verma, D. J. Jacobs and D. R. Livesay, PLoS Comput. Biol., 2012, 8, e1002409 CAS.
  10. A. Baruah, A. Bhattacherjee and P. Biswas, Soft Matter, 2012, 8, 4432–4440 RSC.
  11. K. Teilum, J. G. Olsen and B. B. Kragelund, Cell. Mol. Life Sci., 2009, 66, 2231–2247 CrossRef CAS PubMed.
  12. M. Lehmann, D. Kostrewa, M. Wyss, R. Brugger, A. D'Arcy, L. Pasamontes and A. P. van Loon, Protein Eng., 2000, 13, 49–57 CrossRef CAS PubMed.
  13. J. Miyazaki, S. Nakaya, T. Suzuki, M. Tamakoshi, T. Oshima and A. Yamagishi, J. Biochem., 2001, 129, 777–782 CrossRef CAS.
  14. P. E. Tomatis, S. M. Fabiane, F. Simona, P. Carloni, B. J. Sutton and A. J. Vila, Proc. Natl. Acad. Sci. U. S. A., 2008, 105, 20605–20610 CrossRef CAS PubMed.
  15. J. D. Bloom, S. T. Labthavikul, C. R. Otey and F. H. Arnold, Proc. Natl. Acad. Sci. U. S. A., 2006, 103, 5869–5874 CrossRef CAS PubMed.
  16. E. Shakhnovich and A. Gutin, J. Chem. Phys., 1990, 93, 5967–5971 CrossRef CAS PubMed.
  17. H. Li, R. Helling, C. Tang and N. Wingreen, Science, 1996, 273, 666–669 CrossRef CAS.
  18. H. S. Chan and K. A. Dill, J. Chem. Phys., 1991, 95, 3775–3787 CrossRef CAS PubMed.
  19. K. A. Dill, Biochemistry, 1990, 29, 7133–7155 CrossRef CAS.
  20. W. Kauzmann, Adv. Protein Chem., 1959, 14, 1–63 CrossRef CAS.
  21. P. Biswas, J. Zou and J. G. Saven, J. Chem. Phys., 2005, 123, 154908 CrossRef PubMed.
  22. A. Bhattacherjee and P. Biswas, J. Chem. Phys., 2009, 131, 125101 CrossRef PubMed.
  23. A. Bhattacherjee and P. Biswas, J. Phys. Chem. B, 2009, 113, 5520–5527 CrossRef CAS PubMed.
  24. A. Bhattacherjee and P. Biswas, Phys. Rev. E: Stat., Nonlinear, Soft Matter Phys., 2010, 82, 011906 CrossRef.
  25. J. G. Saven and P. G. Wolynes, J. Phys. Chem. B, 1997, 101, 8375–8389 CrossRef CAS.
  26. J. Zou and J. G. Saven, J. Mol. Biol., 2000, 296, 281–294 CrossRef CAS PubMed.
  27. J. U. Bowie, R. Luthy and D. Eisenberg, Science, 1991, 253, 164–170 CAS.
  28. R. A. Goldstein, Z. A. Luthey-Schulten and P. G. Wolynes, Proc. Natl. Acad. Sci. U. S. A., 1992, 89, 9029–9033 CrossRef CAS.
  29. S. Miyazawa and R. L. Jernigan, Macromolecules, 1985, 18, 534–552 CrossRef CAS.
  30. J. G. Saven, Chem. Rev., 2001, 101, 3113–3130 CrossRef CAS PubMed.
  31. H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. Bhat, H. Weissig, I. N. Shindyalov and P. E. Bourne, Nucleic Acids Res., 2000, 28, 235–242 CrossRef CAS PubMed.
  32. L. A. Mirny, A. V. Finkelstein and E. I. Shakhnovich, Proc. Natl. Acad. Sci. U. S. A., 2000, 97, 9978–9983 CrossRef CAS PubMed.
  33. N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller and E. Teller, J. Chem. Phys., 1953, 21, 1087–1092 CrossRef CAS PubMed.
  34. B. van den Burg and V. G. Eijsink, Curr. Opin. Biotechnol., 2002, 13, 333–337 CrossRef CAS.

This journal is © The Royal Society of Chemistry 2014
Click here to see how this site uses Cookies. View our privacy policy here.