Robert
Schneider
,
Jie-rong
Huang
,
Mingxi
Yao
,
Guillaume
Communie
,
Valéry
Ozenne
,
Luca
Mollica
,
Loïc
Salmon
,
Malene
Ringkjøbing Jensen
and
Martin
Blackledge
*
Protein Dynamics and Flexibility, Institut de Biologie Structurale Jean-Pierre Ebel, CEA, CNRS, UJF UMR 5075, 41 Rue Jules Horowitz, Grenoble 38027, France. E-mail: martin.blackledge@ibs.fr; Tel: +33 4 38789554
First published on 26th August 2011
In order to understand the conformational behaviour of Intrinsically Disordered Proteins (IDPs), it is essential to develop a molecular representation of the partially folded state. Due to the very large number of degrees of conformational freedom available to such a disordered system, this problem is highly underdetermined. Characterisation therefore requires extensive experimental data, and novel analytical tools are required to exploit the specific conformational sensitivity of different experimental parameters. In this review we concentrate on the use of nuclear magnetic resonance (NMR) spectroscopy for the study of conformational behaviour of IDPs at atomic resolution. Each experimental NMR parameter is sensitive to different aspects of the structural and dynamic behaviour of the disordered state and requires specific consideration of the relevant averaging properties of the physical interaction. In this review we present recent advances in the description of disordered proteins and the selection of representative ensembles on the basis of experimental data using statistical coil sampling from flexible-meccano and ensemble selection using ASTEROIDS. Using these tools we aim to develop a unified molecular representation of the disordered state, combining complementary data sets to extract a meaningful description of the conformational behaviour of the protein.
One obvious aim of a structural description of IDPs is to determine rules that define the behaviour of the flexible protein in terms of probability to populate a defined region of conformational space. This is often achieved by evoking an explicit ensemble description of interconverting structures, whose populations are interpreted in terms of a population-weighted distribution that represents the true conformational equilibrium. However the definition of this distribution is no easy task. IDPs populate a vast conformational space, and the mapping of this potential energy landscape represents a classical ill-posed problem, in which the number and complexity of the available degrees of conformational freedom far outweigh the accessible experimental data that can be measured for a particular system. Some caution therefore needs to be exercised when treating such under-determined systems, where the development of an ensemble description that is in agreement with the experimental data may not ensure that the associated conformational sampling is correct. The development of robust procedures that address this issue is of paramount importance.
Most importantly NMR provides access to ensemble and time averaged conformationally dependent parameters at atomic resolution. The measurement of structurally dependent parameters inherently provides a basic tool to study local conformational propensities that may be important for folding upon binding,14 and transient or persistent long-range contacts or tertiary structure that may also play a role in molecular interactions.14–16 In this article we describe advances of some NMR-based techniques that have taken place in recent years for the description of the conformational behaviour of IDPs.17–19
The chemical shift of a specific nucleus reports on the local physico-chemical environment of the nucleus, and in the presence of conformational flexibility, depends on a population-weighted average over local conformations sampled by all molecules in the ensemble that are exchanging on timescales faster than the millisecond. This timescale therefore dictates our interpretation of all NMR parameters that are measured from this chemical shift averaging process. The chemical shift can also provide information about the local structural propensity20 that can be detected in intrinsically disordered proteins by analyzing the deviation of measured parameters from the expected value that would be measured in the absence of any local structure (the so-called ‘random coil’ value).21,22 The absolute definition of a random coil remains open to argument, in most cases amino-acid specific values are measured experimentally from small peptides with no apparent local structure.23–25 The chemical shift provides a sensitive probe of local structural sampling, in particular 13C shifts, whose values depend, in order of importance, on the covalent structure (13Cα, 13Cβ or 13C′), the type of amino acid, and finally on the local structural propensity which is the parameter of interest. The difference between the measured shift and the amino-acid specific random coil shift, known as the ‘secondary’ chemical shift, is commonly used to identify the presence of transient structure in flexible chains.26–28 Scalar couplings between nuclei on the backbone of the protein also depend on backbone dihedral angles and average in a similar way to chemical shifts.29–31 Again random coil values have been measured in small peptides and these values can be compared to experimental values to determine the level of transient local structure.
Residual dipolar couplings (RDCs), measured between pairs of nuclei, are also extremely promising tools for studying the conformational behaviour of disordered proteins.32–36 RDCs become measurable when the protein of interest is dissolved in a dilute liquid crystalline medium, such that the average dipolar coupling, normally averaged to zero in free solution, has a residual, non-zero value.37–39 Under these conditions RDCs depend on the average over the ensemble of orientations of the vector connecting the two spins in the following way:
![]() | (1) |
![]() | ||
Fig. 1 Illustration of the sensitivity of RDCs to the presence of local structure. The orientational dependence shown in eqn (1) results in positive 1DNH RDCs for the central helical element, where the NH bond vectors tend to be aligned with the field, while in the disordered regions the RDCs are negative, because the average orientation is perpendicular to the direction of the chain. |
Disordered proteins often exhibit evidence of fluctuating long-range tertiary structure, that may be important for physiological interactions, for example via so-called fly-casting interactions,16 in the control of early folding events, or to provide protection from aggregation or proteolysis. While it is difficult to detect such transient contacts via standard approaches to the measurement of internuclear distances, using 1H–1H cross relaxation, the detection of such long-range information is possible by exploiting the strength of the dipolar relaxation between the nuclear spin and an unpaired electron that can be introduced into the protein by attaching a nitroxide group to a cysteine mutant.48–50 Because the gyromagnetic ratio of the electron spin is over 600 times higher than the proton spin, the observed line broadening due to the paramagnetic relaxation enhancement provides long-range probes of intra- and intermolecular distances and distance distribution functions that can be detected even if only weakly populated.51–57
A number of additional NMR parameters can be used to characterize the unfolded state: the most common are pulse-field gradient spin echo experiments,58 that report on the population weighted average translational diffusion properties of the chain and heteronuclear spin relaxation, that report on local order on picosecond to nanosecond timescales.59–61 The complementary information available from small angle X-ray scattering that reports on the average mass distribution in three dimensional space, and therefore the dimensions of the ensemble of structures, is also often exploited in combination with NMR data to provide a more complete picture of the disordered state.62–68
An entirely different approach does not use the experimental data to drive the individual members of the ensemble into a conformation in agreement with the experimental data, but instead samples conformational space as broadly as possible, and then exploits the experimental data to define the region of conformational space that is appropriate for the system under investigation. Enhanced molecular dynamics approaches such as accelerated molecular dynamics have been used in this way to study intrinsic dynamics in folded proteins,44,75,76 although the potential extent of conformational space available to IDPs complicates the successful application of such approaches to these highly flexible systems. An alternative strategy is to attempt to flood conformational space by creating a statistical coil model of the protein based on the intrinsic conformational behaviour of each amino acid, derived for example from backbone dihedral angle distributions found in loop regions of protein structures.77–79
An explicit ensemble description of IDPs, called flexible-meccano, builds multiple copies of the protein that are ensemble designed to represent all possible states that are relevant for the NMR observable.35Flexible-meccano randomly samples amino-acid-specific backbone dihedral angle {ϕ/ψ} propensities derived from non-secondary structural elements of high-resolution X-ray crystallographic structures,80 and thereby assembles a conformational ensemble from which experimental values can be calculated. Amino-acid specific hard-sphere steric clashes are used to provide a physically reasonable model of repulsive interatomic forces, and no attractive forces are explicitly used. The simplicity of the model allows for highly efficient structure ensemble assembly (100000 structures of a 100 amino acid protein can be created in 30 minutes on a single processor). The ensembles are randomly sampled from population-weighted distributions that are taken to represent the potential energy surface of each amino acid. Although this does not guarantee a Boltzmann distribution, the absence of additional constraints in this sampling phase avoids distortions due to additional potential energy terms such as those used in restrained MD calculations.
The presence of a single set of signals detected in NMR spectra of denatured and intrinsically disordered proteins imposes the implicit assumption that all conformers used to predict an experimental value are in rapid exchange on timescales faster than the millisecond. The ensemble of structures can then be used to predict experimental values that would be measured if the statistical coil model were relevant. For the prediction of chemical shifts and scalar couplings, local structural information is sufficient to predict the expected value, while for RDCs the calculation of the expected alignment of each conformer is necessary before averaging over the ensemble. In the most common case of steric alignment this calculation is performed on the basis of the three dimensional shape of the protein.81
RDCs simulated using this very simple approach predict values in reasonable agreement with experimental couplings measured in both intrinsically disordered and chemically denatured proteins. Initial studies already indicated that the orientational space sampled by inter-nuclear bond-vectors from RDCs is sensitive enough to pick up differences in amino-acid specific backbone dihedral angle distributions, even in the absence of secondary structural propensity.10,35Flexible-meccano has also been used in combination with molecular dynamics based simulations, to quantify the level of β-turn propensity in the K18 domain of the protein Tau82 and α-helical propensity in the transactivation domain of the protein p53.83
While N–HN RDCs alone have been shown to provide evidence for local structural propensity, the measurement of multiple RDCs from each peptide unit provides the necessary information to make quantitative estimates of the detail and population of the structural elements. Thus, the combination of RDCs from different bond-vectors (N–HN, Hα–Cα, C′–HN, Cα–C′) was also shown to be crucial to the description of the length and population of different helical structures that form the rapidly exchanging conformational equilibrium of the molecular recognition element of the disordered C-terminal domain of the nucleoproteins from Sendai and measles viruses.84,85 In this case entire ensembles of all possible helical elements were calculated, and the minimum combination that could reproduce the experimental data was determined, along with their associated populations. Remarkably, in both cases, the helical elements present in the molecular recognition elements that were significantly populated in solution were found to follow amino acids with known propensity to stabilize helices in free solution.85 An extensive set of RDCs, including a large number of long-range 1H–1H couplings, were measured in the protein Ubiquitin in its denatured state,87 and used in combination with flexible-meccano to identify modifications of the statistical coil model that are appropriate to account for conformational sampling of the unfolded chain in the presence of the denaturant.88,89
The statistical coil description of the disordered state thus provides a relatively straightforward approach for calculating RDC profiles that would be expected if the protein behaved as a random coil. The establishment of such approaches is essential in order to develop a clear understanding of the origin of experimentally observed fluctuations in the absence and in the presence of specific or persistent local or long-range structure. However the next step, requiring the quantitative interpretation of departures from expected random coil values in terms of specific local or long-range conformational behaviour, is of equal importance and fundamentally more challenging.10,90,91
In order to address this issue, the ensemble selection algorithm, ASTEROIDS (A Selection Tool for Ensemble Representation Of Intrinsically Disordered States) has been developed to determine appropriate regions of conformational space populated by the IDP by selection of conformers from the flexible-meccano ensemble using experimental NMR data.92–94 The ASTEROIDS algorithm is based on an efficient genetic algorithm that is used to propose conformational ensemble descriptions selected from a large pool of possible conformers that are in agreement with the experimental data. In order to identify conditions under which an approach that evokes a sub-ensemble of structures can be accurately applied to describe a pseudo-continuum of conformers, we systematically adopt the following simple procedure that clearly quantifies the conformational accuracy of such approaches: (1) Data are simulated under specific conditions of conformational sampling and appropriately averaged over an ensemble of a very large number of conformers (between 50 and 100 thousands). (2) Sub-ensembles of tractable size are generated using ASTEROIDS to be in agreement with these data, and the conformational sampling represented in these ensembles is compared to the target sampling used in step (1) to generate the data.
One of the most important problems encountered in the treatment of RDCs derives from the large number of structures required before a simple arithmetic average reaches convergence. The reason for this is that, in addition to the obvious dependence on local conformational sampling, the RDCs for each individual conformer depend on conformational degrees of freedom throughout the molecule, that each define the shape of the protein, and therefore the size and distribution of the RDCs. Indeed, convergence of RDCs from a 76 amino acid chain is not yet achieved in 10000 structures. More rapid convergence of RDCs can be achieved using a smaller number of conformers if the protein were divided into short, uncoupled segments (Local Alignment Windows—LAWs) and the RDCs are calculated using the alignment tensor of these segments.95 This is an important result: the ability to describe the conformational properties using fewer structures renders ensemble selection more tractable.
However there are important aspects that need to be addressed before such approaches can be used to explain experimental data. Adopting the procedures described above, RDCs were calculated using specific conformational sampling regimes averaged over a large ensemble.92 The average RDCs were then used, in combination with a 15 amino acid window, to select different sized ensembles of conformers from a large pool in agreement with the data. The results demonstrated that ensembles that evoked only 20 structures reproduced the experimental data, but critically did not reproduce the backbone dihedral angle distributions that were at the origin of the average. Only when at least 200 structures were used in the average was the conformational behaviour sufficiently well reproduced. The reason for this is the instability of adding additional RDCs to an ensemble where the average is not yet converged.
The revelation that experimental data can be reproduced by an ensemble of structures that does not represent the correct conformational sampling was initially surprising to us, although this appears to be a predictable manifestation of the potential pit-falls of deriving ensembles under such under-determined conditions. The result has particular importance, and highlights the risks of reducing the number of members of a conformational average until the data are reproduced. Such a procedure can clearly produce ensembles whose local conformational sampling is quantitatively incorrect, while reproducing experimental data.
Secondly, and possibly more critically, approaches that only use a LAW to analyze RDCs patently ignore the fact that RDCs are affected both by the local conformational sampling and long-range order. This is important even in the absence of specific long-range contacts, because the chain-like nature of the unfolded protein induces an effective baseline reflecting the increasing degrees of freedom available towards the ends of the chain (Fig. 2a and h). Long-range information is necessarily absent from an approach that only employs LAWs to predict the RDCs. If this approach is employed the simulated data need to be corrected for the effects of the unfolded chain. This can be achieved when LAW-predicted RDCs are multiplied by the expected baseline of an unfolded chain, whose bell-shaped dependence can be parameterised by fitting to numerical simulation.
![]() | ||
Fig. 2 The effects of long-range contacts on expected RDC profiles. Top: (a) 1DNH and 1DCaHa RDCs calculated for a 100 amino acid sequence in the absence of specific contacts. The program PALES was used to calculate RDCs from each conformer. 100![]() |
The effects of ignoring long-range contacts when analyzing RDCs from disordered chains can however be much more severe when preferential long-range contacts exist in the protein, as demonstrated by the following simulations: RDCs were predicted from 100000 strong ensembles using the flexible-meccano simulations of a 100 amino acid model sequence in the presence of weakly defined long-range contacts, defined as a contact between any of two 20 amino acid strands (Fig. 2). In comparison to the expected values for a chain with no specific long-range contacts, the effect is significant, even for such diffuse long-range contacts. Simulation predicts significant quenching of RDC values in regions between the two contact regions. Importantly, although the local conformational sampling is not measurably affected by the contacts, the resulting RDCs are very different because of the transient long-range order that is also present. This again demonstrates that extreme caution needs to be exercised when interpreting RDCs uniquely in terms of the local structure. Comparison with identical simulations for a poly-valine indicates that the actual effect of diffuse long-range contacts is to convolute a more complex ‘baseline’ on the local structure of the expected RDCs. Fortunately a generic mathematical expression that accurately models the form of this baseline can be derived that reproduces the numerically predicted baselines shown in Fig. 2, which depends only on the position of the contacts and the length of the chain. The consequences of this are that long-range information, for example derived from paramagnetic relaxation enhancement (vide infra), can be combined with the efficient sliding window approach, to simultaneously account for both aspects within the same ensemble average (Fig. 3).93
![]() | ||
Fig. 3 Combination of effects of long-range order derived from PREs with local conformational sampling using local alignment windows for the interpretation of RDCs. (a) Blue: Data averaged over the target ensemble where each conformer has a contact between 11–20 and 61–70. Red: Average PREs over an ensemble of 80 structures selected using ASTEROIDS. The four boxes show the PRE data for simulated spin labels at residues 20 (top left), 40 (top right), 60 (bottom left) and 80 (bottom right). Lines show the PREs calculated from a control ensemble with no specific contacts. The distance matrix shows the chain proximity in the ensembles selected using ASTEROIDS (above the diagonal), compared to target ensembles (below the diagonal). Average distances between sites are shown in terms of: Δij = log(〈d0ij〉/〈dij〉) where dij is the distance in any given structure of the ASTEROIDS ensemble between sites i and j, and d0ij is the distance in any given structure of the reference ensemble between sites i and j. Values above the diagonal have been multiplied by 2 for ease of identification of the contact. Top: Black: RDCs calculated using the local alignment window (LAW). Blue: Predicted effect of the long-range contact detected using the ASTEROIDS interpretation of the PREs. Bottom: Combination (purple) of the two curves shown in the top panel and RDCs averaged over 100![]() ![]() |
The ability to combine long-range information from PREs and RDCs in this way represents a major step forward in our ability to describe highly disordered systems. As an example, we applied these methods to experimental PRE and RDC data from α-Synuclein.57,96 Experimentally measured RDCs agree significantly better when a long-range contact between the N and C terminal domains, derived from PREs, is included in the RDC analysis (rmsd of 0.51 compared to 0.75). This not only validates the predicted effects on RDC profiles due to long-range transient contacts in disordered systems, but also demonstrates that PREs and RDCs can be meaningfully combined to understand experimental data (Fig. 4).
![]() | ||
Fig. 4 Combined analysis of PREs and RDCs in the context of experimental data from α-synuclein. (a) Comparison of experimental 1DNH RDCs with couplings calculated using a standard flexible-meccano prediction (red). The rmsd between the two distributions is 0.78 Hz. (b) Contact map showing the relative proximity of different parts of the chain in α-synuclein, derived from experimental PRE data. Average distances between sites are shown in terms of: Δij = log(〈dij〉〈d0ij〉) where dij is the distance in any given structure of the ASTEROIDS ensemble between sites i and j, and d0ij is the distance in any given structure of the reference ensemble between sites i and j. (c) LAW-predicted RDCs (red) and effective baseline derived from the distance matrix shown in (b). (d) Combination of the curves shown in (c) (red) in comparison to the experimental 1DNH RDCs (rmsd = 0.52 Hz). Reprinted with permission from the Journal of the American Chemical Society.93 |
![]() | ||
Fig. 5 Flowchart showing the iterative construction of a conformational ensemble using ASTEROIDS on the basis of heteronuclear chemical shifts. |
![]() | ||
Fig. 6 Application of ASTEROIDS to ensemble representation on the basis of chemical shifts. Secondary chemical shifts from an ensemble of 200 structures determined using the ASTEROIDS algorithm compared to experimental secondary chemical shifts (blue). Red: secondary chemical shifts averaged over the final ensemble. (A) α carbon, (B) β carbon, (C) carbonyl, (D) amide nitrogen. (E, F) Reproduction of independent parameters by the ensemble based on chemical shift selection. (E) 15N–1H residual dipolar couplings (RDCs) measured in sterically aligned NTAIL compared to averages over 50![]() |
Footnote |
† Published as part of a Molecular BioSystems themed issue on Intrinsically Disordered Proteins: Guest Editor M. Madan Babu. |
This journal is © The Royal Society of Chemistry 2012 |