The histone H 3 N-terminal tail : a computational analysis of the free energy landscape and kinetics †

Histone tails are the short peptide protrusions outside of the nucleosome core particle and they play a critical role in regulating chromatin dynamics and gene activity. A histone H3 N-terminal tail, like other histone tails, can be covalently modified on different residues to activate or repress gene expression. Previous studies have indicated that, despite its intrinsically disordered nature, the histone H3 N-terminal tail has regions of notable secondary structural propensities. To further understand the structure– dynamics–function relationship in this system, we have carried out 75.6 ms long implicit solvent simulations and 29.3 ms long explicit solvent simulations. The extensive samplings allow us to better characterize not only the underlying free energy landscape but also kinetic properties through Markov state models (MSM). Dihedral principal component analysis (dPCA) and locally scaled diffusion map (LSDMap) analysis yield consistent results that indicate an overall flat free energy surface with several shallow basins that correspond to conformations with a high a-helical propensity in two regions of the peptide. Kinetic information extracted from Markov state models reveals rapid transitions between different metastable states with mean first passage times spanning from several hundreds of nanoseconds to hundreds of microseconds. These findings shed light on how the dynamical nature of the histone H3 N-terminal tail is related to its function. The complementary nature of dPCA, LSDMap and MSM for the analysis of biomolecules is also discussed.


Introduction
In eukaryotic cells, highly conserved histone proteins provide the basic scaffold to package the genome inside the nucleus, while playing a critical role in regulating gene activity at the same time.A DNA segment of approximately 146 base-pairs wraps tightly around a histone hetero-octomer consisting of two subunits of each type of histone protein (H2A, H2B, H3 and H4) to form a nucleosome, which can further fold into higher order chromatin fibers. 1,24][5][6] The histone terminal tails extending out of the NCP and adjacent regions are critical for chromatin remodeling and thus regulation of gene transcription. 1 The histone tails can be covalently modified upon receiving upstream signals.Post-translational modifications, such as methylation, acetylation and phosphorylation, can alter the charges on specific residues, leading to changes in the binding affinity with other functional consequences. 7,80][11][12][13] However, the dynamic properties of histone tails, such as the kinetics of transitions among different metastable conformations, remain to be elucidated.
A histone H3 N-terminal tail, in particular, can be covalently modified with various patterns and is of great interest to both experimental and computational studies.][16] Methylation of K9 is a specific target for heterochromatin protein 1 (HP1). 15,17,18In the NMR structure of its complex with HP1, the region from K4 to S10 showed an extended structure with K4 and K9 both in the dimethylated form. 19In another study, an 8-residue variant of the peptide with trimethylated K4 showed an extended b-strand conformation from A1 to T6 in complex with an ING2 (inhibitor of growth family, member 2) plant homeodomain (PHD) finger. 20An experimental study using circular dichroism (CD), however, showed that some a-helix content is present in histone H3/H4 tails, even though it was difficult to selectively isolate H3 or H4 tails. 9Previous computational studies 11,13 showed that the histone H3 tail populates a-helical conformations, and that post-translational modifications only have a marginal influence on the most populated conformations. 11The degree of sampling in these pioneering computational studies was somewhat limited, despite the use of replica exchange techniques, 13 leading to some regions of the free energy surface only sparsely sampled.Moreover, the kinetics of transitions among the various conformations were not explored.In this study, we use the Markov state model (MSM), dihedral angle principal component analysis (dPCA) and locally scaled diffusion map (LSDMap) to gain a further understanding of the free energy landscape and dynamics of this peptide.
][23][24][25][26][27] Here we run extensive explicit solvent simulations on the H3 N-terminal tail using graphic processing units, with seeds from intensively sampled implicit solvent conformational ensembles.We extract kinetic information from the MSM and characterize the kinetically metastable states.No metastable state has a dominant population and the peptide makes transitions between different states at the microsecond time scale.dPCA and LSDMap also reveal a generally flat free energy surface with low barriers between the basins.The low free energy barriers and rapid transitions between different metastable conformations are consistent with the functional requirement of the H3 tail to respond to different covalent modifications and binding partners.

Simulation protocol
Due to the lack of a defined experimental structure, we build a stretched initial structure of the histone H3 N-terminal tail according to its amino acid sequence, 28 with the C-terminal capped with an N-methyl group.A 38 residue construct is used (with the sequence: ARTKQ TARKS TGGKA PRKQL ATKAA RKSAP ATGGV KKP), which is slightly longer than the biochemically isolated one by trypsination.This sequence preserves the structural geometry of the tail and is of the same length as the construct used in another computational study, 13 making it straightforward to compare the current and previous results.All simulations are carried out using the Amber14 MD package on graphic processing units. 29,30The ff99SBnmr1 force field 31 is used because recent benchmark calculations suggested that this force field provides reliable structural propensities with both explicit and implicit solvent simulations. 32We carry out both implicit and explicit solvent simulations on the system in this work.
The initial structure is minimized and equilibrated using the GB7 generalized Born implicit solvent model. 33An initial 2 ms long simulation is then carried out.The conformations in the trajectory are clustered using the K-means clustering algorithm implemented in the MMTSB Tool Set, 34 and seed conformations are randomly chosen from different clusters for parallel production runs.In total, 75.6 ms long simulations are collected.In these simulations, no cutoff is used for the non-bonded interactions.0.15 M of NaCl is applied using a modified generalized Born model based on the Debye-Hu ¨ckel limiting law for ionic screening of electrostatic interactions. 35he SHAKE algorithm 36 is used to constrain all bonds involving hydrogen atoms.An integration time step of 2 fs is used.Langevin dynamics with a collision frequency of 20 ps À1 is used to keep the temperature at 300 K.
For explicit solvent simulations, the seed conformations are taken from the entire implicit solvent simulation ensemble following the same clustering and selection procedure.This seeding procedure enhances sampling by starting simulations from as diverse a set of conformations as possible; as shown in the ESI, † the propensities of secondary structures from the implicit and explicit solvent simulations are generally similar after extensive sampling, supporting the use of implicit solvent models to seed explicit solvent simulations.The seed conformations are solvated with TIP3P water, 37 with a box size of around 70 Â 70 Â 70 Å 3 using periodic boundary conditions.The system is neutralized with counterions and an additional 0.15 M NaCl is added to mimic the physiological conditions.The electrostatic interactions are calculated using the particle mesh Ewald method with a grid spacing of about 1.0 Å.The cutoff for van der Waals interactions is set at 10 Å.The system is first minimized with the fixed protein and then for the entire system. 1 ns NPT simulation is carried out to equilibrate the volume of the system.Production simulations are run in the NVT ensemble at 300 K using the weakcoupling algorithm 38 with a time constant of 5.0 ps.A total of 29.3 ms of trajectories are collected for analysis.

Dihedral principal component analysis (dPCA)
The explicit solvent simulations are analyzed using principal component analysis based on backbone dihedral angles. 39,40ince the histone H3 tail frequently undergoes folding/unfolding in the simulations, the backbone dihedral angles are better coordinates for the analysis compared to Cartesian coordinates.If we denote each backbone dihedral angle as j i , we have 75 [2 Â number of residues (38) À 1] backbone dihedral angles in total, lacking the f angle for the first residue.Due to the periodicity of dihedral angles, we transform each angle into its cosine and sine values, where n = 1,. .., 75.The correlated internal motions can then be represented by the covariance matrix where q 1 ,. .., q 150 are the transformed cosine and sine values for the dihedral angles.By diagonalizing the covariance matrix, we obtain 150 pairs of eigenvalues and eigenvectors.The conformations in the ensemble are then projected onto the two eigenvectors with the largest eigenvalues.The free energy surface of the system is then calculated as where r is an estimate of the probability density function obtained after binning the data.r max is the maximum density, which has the free energy minimum with DG set to 0.

Locally scaled diffusion map
The conformation ensemble of the histone H3 N-terminal tail is further analyzed using LSDMap.LSDMap is a multiscale method to determine the collective reaction coordinates of macromolecular dynamics. 41For computational efficiency, the ensemble is subsampled and a total of 11 731 conformations are used for the calculation.The transition probability between two conformations is described by the kernel K, where Jx i À x j J is the root mean square deviation (RMSD) between the two conformations x i and x j based on backbone atoms (CA, C, N, and O) and CB. e i and e j are the respective local scales.The local scales for every point in the data set are estimated according to the previously proposed procedure. 41The elements in the kernel matrix characterize the ease of a conformation to diffuse into another.The kernel can be normalized and converted to a Markov matrix, from which we compute the first few largest eigenvalues and the corresponding eigenvectors.The eigenvectors represent the diffusion coordinates.For technical details, we refer the interested reader to ref. 41.

MSM construction and validation
For the construction of MSM, conformations in the 29.3 ms long explicit solvent simulation trajectory are saved every 50 ps, resulting in B586 thousand conformations.These conformations are clustered into 8000 microstates using the hybrid k-centers/k-medoids clustering method 24 based on the RMSD of backbone atoms (CA, C, N, and O) and CB.The number of transitions between the microstates at an interval of a certain lag time is counted.The count matrix is then symmetrized and normalized to obtain the transition probability matrix (T).The Markov time, the lag time at which the model is Markovian, is determined by examining the implied time scales at different lag times.At a specific lag time, the implied time scales can be calculated as: where k is a relaxation time scale, t is the lag time, and m(t) is an eigenvalue for the transition matrix T(t).If the model is Markovian at a certain lag time, the relaxation time scales should remain constant when using longer lag times, satisfying the Chapman-Kolmogorov equation.Here we use the smallest lag time at which the implied time scales level off to build the microstate MSM for further analysis.The microstate MSM is used to calculate the kinetic properties.To facilitate interpretation, the 8000 state microstate model is coarse grained to 150 macrostates using the Bayesian agglomerative clustering engine (BACE), 42 which has been shown to be a robust method in a recent study that compared different coarse graining protocols. 43The population of each macrostate is the sum of the equilibrium populations of the microstates belonging to this macrostate.The MSMBuilder software is used to construct the model. 21,24,44an first passage time (MFPT) calculations MFPT is defined as the average transition time between a pair of states in an MSM.The MFPT from initial state i to final state j can be determined by solving the following set of equations: where t is the lag time and T(t) is the corresponding transition probability matrix. 45To compute the MFPT from the microstate MSM, we first set the MFPTs of all the microstates within the destination macrostate to be zero.

Results and discussion
In the following, we first comment on the degree of sampling and validation of the Markov model.Then we characterize the free energy landscape by analyzing the conformational features of macro states from MSM and comparing the observations to the free energy surface computed by dPCA and LSDMap.Finally, we discuss the kinetics of conformational transitions by computed MFPTs.To better elucidate the free energy landscape, we coarse grain the 8000 microstate model to a 150 state macrostate MSM using the BACE coarse graining method.The implied time scales for the 150 state model, as shown in Fig. 1(b), also level off at about 6 ns.The relaxation time scales are slightly shorter than those of the microstate MSM.As noted in a previous study, 27 this discrepancy results from lumping very small microstates with the large ones and may be corrected by keeping those small microstates as independent states.Here, we will only use the macrostate MSM for the purpose of elucidating key features of the free energy landscape.

Analysis of the free energy landscape
The equilibrium populations of the kinetically metastable states obtained from the macrostate MSM (Fig. 2) show that the free energy surface is rugged without a single dominant well.The most populated macro state has a population of about 3.4%.This is physically reasonable considering the high structural flexibility of the histone tail.
To gain a better understanding of the characteristics of the major free energy basins, we further investigate the structural features of the top 9 most populated macro states (there is a small gap in population between the 9th and 10th states, see Fig. 2).As shown in Fig. 3, most of the structures in these basins have two a-helix regions, one from residue 2 to residue 11 and another from residue 17 to residue 28; the second a-helix segment has a higher propensity to be folded.The other regions mainly assume coil and turn structures.The most populated state, as shown in Fig. 3(a), has an N-terminal a-helix region with an average probability of 0.41 and a second a-helix region with an average probability of 0.77.It has an average R g of 18.04 Å.The 4th most populated state has a very low probability for the first a-helix (Fig. 3(d)).The R g 's among the top nine states range from 15.09 Å to 19.49 Å, with the standard deviation of B1.5 Å.An outlier is the 9th state, which only has a single short a-helix between residues 20 and 26; it also features a much more collapsed structure with an average R g of 10.00 AE 0.14 Å (Fig. 3(i)).
In Fig. 4, the R g values for the 150 macro states are plotted in the order of their populations.There is not a strong correlation between the level of compactness and population.Most of the macrostates have R g values between 12 and 18 Å, with substantial standard deviations.The only exception is the already noted 9th state, which features only very compact conformers with a small (0.14) standard deviation in R g .Several other macro states also have an average R g of around 10-11 Å but with substantially higher standard deviations.The fact that the standard deviations in R g for most macro states are fairly large (B1-2 Å) suggests that many conformations of different R g 's interconvert rapidly (e.g., when the isomerization only involves simple bond rotations etc., see discussions below).Many compact structures are assigned to different macro states, which explains that the macro states with low average R g 's are not highly populated.
To better understand the overall structural propensities of the peptide, we investigate the secondary structure distribution  for the entire conformational ensemble (Fig. 5).On average, about 21 residues are coil and about 10 residues assume an ahelix structure (Fig. 5(a)).The most a-helical regions are observed from residues 5 to 9 and residues 17 to 26, with the rest mainly being coils and turns (Fig. 5(b)).The most populated macro states have a higher content of a-helix structures than the average macro states.The analysis of the implicit solvent simulations shows a similar pattern (see Fig. S3, ESI †).
In the previous computational study with a 16-residue H3 N-terminal tail variant, the region from residues 4 to 11 showed a high propensity to assume an a-helix structure, 11 corresponding to the first a-helical fragment in this study.In another computational study using the same construct as this work, 2-3 regions strongly favoring the a-helix structure were observed, 13 in qualitative agreement with our findings.

Free energy landscape analyzed using dPCA
Dihedral PCA decomposes the free energy surface into three major basins.As shown in Fig. 6, the free energy surface is well connected and the barriers between the major basins are around 0.5-1.5 kcal mol À1 .The largest basin, labeled 'ii' in the figure, has the characteristic structural feature of two a-helical segments as observed for the top-populated macro states from MSM (Fig. 3).The second largest basin, labeled 'i' in the figure, has very compact conformations.It lacks the N-terminal helix but maintains the second helical segment, similar to the 9th Fig. 3 Representative structures and structural features for the nine most populated macrostates.The population for each state is labeled on the top-left.The representative structures are colored in rainbow with blue denoting the N-terminus and red for the C-terminus.To the right is the probability for each residue to be a-helical in each state.The mean and standard deviation of the radius of gyration (R g ) for the conformations in each state are also listed.
This journal is © the Owner Societies 2015 macro state from MSM.The conformations in basin 'iii' are quite open in nature and have a lower helical content than those in basin 'ii'.Even though it is difficult to make a one to one mapping between the free energy basins from dPCA and the macro states from MSM, the dPCA results confirm the overall feature of the free energy landscape implied by the MSM analysis.In the study by Papoian and co-workers, 13 the free energy landscape was also analyzed using dPCA, which showed dispersed free energy basins with depths of around 1-3 kcal mol À1 .The free energy surface in the current work is better connected due to the more extensive sampling, although the gross features are fairly similar.The ensemble average R g of 15.2 Å is, however, higher than that reported in the previous study, which was 11.6 Å.

Free energy surface analyzed using LSDMap
The LSDMap shows two major basins on the free energy surface (Fig. 7).The barrier between these two basins is around 2-2.5 kcal mol À1 .The small basin (basin 'i') has very compact structures, with an average R g of 10.0 Å.However, the structures have a low a-helix content (B18.2%) and are largely in random coils.The large basin (basin 'ii') has an average R g of 16.9 Å and an a-helix content of 33.5%, showing that the structures are more open with two main a-helix segments.The time scale separation, as shown in Fig. 7a, indicates that a single time scale appears to dominate the dynamics.The projection of the free energy landscape onto the first diffusion coordinate and R g (Fig. 8) shows that the first diffusion coordinate corresponds to the collapse/expansion of the peptide.
Comparing the free energy landscape analyzed using LSDMap and dPCA, based on the average R g and the a-helix content, basin 'i' on the LSDMap corresponds to basin 'i' on the dPCA free energy surface, and basin 'ii' on the LSDMap corresponds to basins 'ii' and 'iii' on the dPCA free energy surface.The basin 'i' on both free energy surfaces feature very compact conformations, similar to the 9th-populated macro state from MSM (Fig. 3).The other basins have higher R g values, similar to other top-populated MSM macro states.

MFPTs between different metastable states
Kinetic information is extracted from the MSM by calculating the MFPTs between different metastable states.As shown in Fig. 9, the MFPTs between these 150 metastable states range from hundreds of nanoseconds to hundreds of microseconds.Most of the transitions are on the order of several microseconds.The fastest transition occurs at B178 ns and the slowest at B242 ms.The fastest transition corresponds to the conversion from the 14th most populated state to the most populated state.Both of these two states have quite an open structure (with R g 's of 16.46 Å and 18.04 Å, respectively) and two helical regions.The slowest transition is from the 122nd most populated state to the 9th most populated state.Both of these two states have very compact structures, with R g 's of 11.31 Å and 10.00 Å, respectively.These observations are plausible considering that the transition between the open structures might simply involve the rotation of several single bonds while that between the compact structures may have to involve the breaking of electrostatic interactions and hydrogen bonds.For example, the representative structure in the 122nd most populated state is a compact random coil structure (Fig. S4(a), ESI †), while that for the 9th most populated state has an a-helical region (Fig. S4(b), ESI †).This conformational transition involves breaking of the original contacts in the compact random coil structure and the formation of an a-helix fragment.
The generally rapid transitions observed here are consistent with the relatively low barriers between free energy basins obtained using dPCA and LSDMap.This feature might be due to the abundant positive charges on the histone H3 N-terminal tail (it has four Arg residues and eight Lys residues, and no Asp/ Glu).The generally repulsive electrostatic interactions make it harder to form many distinct compact structures.This is also supported by the relatively open conformations in the most populated macro states.The transitions likely mainly involve   the partial folding and unfolding of the a-helix segments, movements around the linkage between these two segments, and also the flexible C-terminus.

Concluding remarks
In this study, we carry out extensive explicit solvent simulations on the histone H3 N-terminal tail and construct the MSM to describe its dynamics.The MSM shows that the peptide has a rather flat free energy surface with shallow free energy basins separated by low barriers.The conformations in the main metastable states show two a-helix prone segments.The overall secondary structural propensity is consistent with observations from previous experimental and computational studies. 9,11,13he free energy surfaces obtained using dPCA and LSDMap show similar features, with conformations in the free energy basins exhibiting partial helical contents.The MFPTs computed from MSM indicate that the transitions between different metastable states range from hundreds of nanoseconds to hundreds of microseconds.
The rapid transition between different conformational states of the histone H3 N-terminal tail may well indicate how this peptide functions.The low free energy barrier between different conformational states can be easily perturbed by various modifications on different residues.The binding between the peptide and its counterpart may also involve an induced fitting process.For instance, in the NMR structure of a K4 and K9 dimethylated form in complex with HP1 and the crystal structure of K4 trimethylated form in complex with the ING2 PHD finger, the histone H3 N-terminal tail in general showed extended structures. 19,20Since chemical modifications were not found to have a major impact on the structure of this peptide, 11 these structural variations observed in different experiments are expected to be due largely to binding to other proteins, which led to the conformational transition from helical to extended structures.The resolved region in the experimental structures largely corresponds to the first a-helical fragment in this study.It is worth noting that the N-terminal helical region is more labile than the second helical region.In the 9th most populated macrostate in the MSM and the basins labeled 'i' on the free energy surfaces analyzed with dPCA and LSDMap, the structures all have an extended N-terminus very similar to that found in the experimental structures.The side chains of K4 and K9 orient away from the helical region (Fig. S4(b-d), ESI †), which might facilitate the modifications and the binding process.The conformational transitions of the peptide in the induced-fit process can be facilitated by the intrinsically fast kinetics of the peptide.A similar discussion has been made for other intrinsically disordered proteins, whose conformations respond actively to solution conditions, self aggregation and binding to other proteins. 27,46,47These studies and the current work highlight how the internal dynamics of this important class of peptides contribute to the function.
From a technical perspective, the present work illustrates the advantages of integrating MSM, dPCA and LSDMap to study the free energy landscape and kinetics of a flexible system.dPCA is computationally efficient since it relies only on structural information and provides a compact description of the free energy landscape.LSDMap goes beyond dPCA by considering the ''kinetic connectivity'' among different conformations, thus the dominant eigenvectors better describe the collective dynamics of the system.MSM explicitly considers microscopic kinetic information for the interconversion among metastable states and therefore leads to a more complete description of the free energy landscape, although at the cost of requiring a more thorough sampling.In the current work, the results from the three methods are largely consistent in terms of dominant structural features, although MSM clearly contains richer information regarding the kinetic stability and accessibility of different populations.The integration of the three methods provides us with a means of cross validation and a more complete characterization of the free energy landscape and kinetics of the histone H3 N-terminal tail.
An adequate sampling is critical for extracting valid equilibrium and kinetic information from molecular simulations.The various analyses in this work are based on the fairly long (B30 ms) explicit solvent simulations, which are seeded by different conformations from a longer set of (B80 ms) implicit solvent simulations.To evaluate the convergence of the simulations, we monitor the radius of gyration (R g ) distribution and the dPCA free energy surface when only part of the simulation data is used.As shown in Fig.S1 and S2 (ESI †), when only 15 ms, 20 ms or 25 ms of the explicit solvent data is used, the R g distributions remain nearly the same; the free energy landscapes obtained using dPCA are also similar in terms of the main free energy basins and their connectivity.Thus we conclude that the degree of sampling carried out here is adequate to draw meaningful conclusions.In the MSM analysis, for the model to be Markovian, it should satisfy the Chapman-Kolmogorov equation and the relaxation time scales should remain constant at different lag times.Instead of spot-checking specific states, we conduct a global check on the MSM by observing the implied time scale plot at This journal is © the Owner Societies 2015 different lag times.As shown in Fig.1(a), the implied time scales for the 8000 microstate model reach a plateau at about 6 ns, which is used to construct the microstate MSM.The quantitative properties, including the MFPTs, are calculated from this model.

Fig. 1
Fig. 1 Analysis of the Markov state model (MSM) with different lag times.(a) Implied time scale plot for the 8000-state microstate MSM.(b) Implied time scale plot for the 150-state macrostate MSM.In each case, the slowest eigenmodes of the corresponding transition matrix are analyzed.

Fig. 4
Fig.4The average R g 's of the 150 metastable states in the MSM; the states are numbered based on their populations, with the first state being most populated and the last being least populated.The error bars are twice the standard deviation of the R g 's among conformations in each state.

Fig. 5
Fig. 5 Secondary structure of the histone H3 tail in explicit solvent simulations.(a) The average number of residues having a specific type of secondary structure.The error bars show the standard deviation.(b) The probability for each residue to have each type of secondary structure.

Fig. 6
Fig. 6 Free energy profile of the histone H3 tail dynamics projected onto two main dPCA principal components.Representative structures from the three main free energy basins are shown in rainbow colors, with blue denoting the N-terminus and red for the C-terminus.The average R g and the a-helix content are also shown for each basin.

Fig. 7
Fig. 7 Results from the locally scaled diffusion map (LSDMap) analysis.(a) The exponential of the negative of the Fokker-Planck operator eigenvalues as a function of the eigenvalue number.The zero-th eigenvalue corresponds to the Boltzmann distribution.The major spectral gap between the first and second eigenvalues suggests that there is a single time scale that tends to dominate the dynamics.(b) Free energy landscape of the histone H3 tail projected onto the first and second diffusion coordinates.Representative structures for the two main free energy basins are shown in rainbow colors, with blue denoting the N-terminus and red for the C-terminus.The average R g and the a-helix content are also shown for each basin.

Fig. 8
Fig. 8 Free energy landscape of the histone H3 tail projected onto the first diffusion coordinate and R g .
MFPTs starting from each microstate in the starting macrostate are then calculated.A weighted average is then obtained as MFPT if : MFPT if ¼ P l2ip l MFPT lf , where p l is the normalized population of microstate l within macrostate i.