M.
Sanchez-Martinez
and
R.
Crehuet
*
Institute of Advanced Chemistry of Catalunya (IQAC), CSIC, Spain. E-mail: ramon.crehuet@iqac.csic.es
First published on 13th October 2014
We present a method based on the maximum entropy principle that can re-weight an ensemble of protein structures based on data from residual dipolar couplings (RDCs). The RDCs of intrinsically disordered proteins (IDPs) provide information on the secondary structure elements present in an ensemble; however even two sets of RDCs are not enough to fully determine the distribution of conformations, and the force field used to generate the structures has a pervasive influence on the refined ensemble. Two physics-based coarse-grained force fields, Profasi and Campari, are able to predict the secondary structure elements present in an IDP, but even after including the RDC data, the re-weighted ensembles differ between both force fields. Thus the spread of IDP ensembles highlights the need for better force fields. We distribute our algorithm in an open-source Python code.
A very suitable technique to characterize the secondary structure at a residue level is the NMR residual dipolar couplings (RDCs),9 a technique that has been thoroughly developed by Blackledge10–13 and Forman-Kay14–17 groups, among others. In an isotropic medium, such as liquid water, dipolar couplings average out to zero. But if the medium has some preferential directions, then there is a partial alignment of the molecules and a residual coupling can be measured.
Contrary to what is the case for folded proteins, in IDPs the alignment tensor is essentially determined by the local (secondary) structure.16 When the main mechanism of alignment is steric, repulsion between the protein and the alignment medium tends to align secondary structure elements parallel to the medium. For this reason N–H couplings convey important information on the secondary structure. When the alignment medium is parallel to the field they are positive in α-helices – as all N–H are parallel to the helix – negative in β-sheets – as N–H are perpendicular to the sheet – and are very low for regions without any secondary structure, where residue orientations are random. A qualitative interpretation of RDCs can be based on these principles, but a quantitative explanation can be achieved if one is able to generate an ensemble of configurations that reproduce the measured RDCs.11,12,15,16
The generation of the ensemble that fit the RDCs is the crux of several approximations used in this field.18 A common approach is to sample random coil regions of the Ramachandran plot with codes such as Flexible Meccano,13,19 TraDES,16,20 or BEGR21 and then introduce secondary structure regions and weight them with a statistical analysis11,17 or a genetic algorithm.13 This is because the physics behind these force fields is very simple and cannot predict secondary or tertiary structure. These methods have proved extremely successful in interpreting several IDP studies, but lack predictive value in terms of secondary structure elements.
The problem of optimizing an ensemble is a case of inferential structure determination,22 albeit with a much broader probability distribution. If this distribution comes from a simulation, we would like to modify it so that it agrees with the experimental data. Ideally, the inclusion of the experimental data should create ensembles that agree among themselves, even if coming from different simulation methods. Here we explore to which extent this is true.
We present a method based on the maximum entropy principle (MaxEnt) to fit RDC data to simulated ensembles. Maximum entropy is a logically consistent way to fit a distribution to previously known values introducing the minimum possible modifications.23,24 It has been advocated very recently as a powerful technique to solve structural problems25 and it has already been applied to SAXS ensemble determination.26
We generated our ensembles from two coarse-grained force fields, which have more accurate physical terms than TRaDES or Flexible Meccano while remaining computationally affordable. Coarse-grained methods allow sampling of the large conformational space essential to describe IDPs and converge RDC data. However the simulation force-field does not influence the validity of the presented selection procedure, which can be applied to all types of ensembles.
Our aim of this work is three-fold. First, we develop a fitting algorithm to adjust experimental RDCs to an ensemble of conformations. We implement our method in a publicly available code so that it can be compared to others, and can be used by any research group.27 Second, we explore the information content of RDC data and the influence of our force field; in other words, how much do the RDCs constrain the initial ensemble. Considerable efforts have been made to determine how much different experimental data determine the properties of the ensembles.17 Here we want to highlight the relevance of the underlying model, which is often overlooked. And third, we test whether some coarse grained methods can produce more accurate ensembles than random-coil-based force fields and thus increase the prediction of RDCs.
We decided to implement an a posteriori re-weighting so that our method could be applied to ensembles generated with any software or force field. A second reason is that when applying the constraints on-the-fly, one usually averages by the number of replicas running in parallel32,33 but the number of replicas needed to converge the RDC values for IDPs is of the order of thousands (see Results section), which means that constraint molecular dynamics could only be run in supercomputers.
In our a posteriori re-weighting we assume we have a set of N structures {Xj=1,N} that we have previously calculated with a Monte Carlo or molecular dynamics simulation. As such, they have already been generated with a probability proportional to their Boltzmann factor, which depends on each specific force field. For a set of M observables q = {qi=1,M}, Pitera and Chodera showed that the application of the MaxEnt principle resulted in a reweighting of the probability of each structure j by a term:28
(1) |
(2) |
(3) |
(4) |
Because the scaling adds one degree of freedom, the set of λ = {λi} that minimize f1 lies on a 1-dimensional curve. Based on the MaxEnt principle, we seek λ that minimally modifies the ensemble. By eqn (1) these are the λ as close as possible to 0. Therefore we add a penalty term:
(5) |
Our implementation converges in less than 10 seconds for the ensembles used in this work in a 1 processor Xeon machine. This is to be compared with the Bayesian method developed by Stultz,36,37 which being their most efficient method takes about 30 minutes in an 8 processor Xeon machine with an ensemble of 299 structures. At the time of writing this paper, Das et al.38 published an interesting paper with a full Bayesian approach (called FitEnsemble) based on Monte Carlo sampling and implemented in pyMC.39 In the results section we compare our method with theirs and we show that the full Bayesian approach does not convey any essentially new information. At present their method cannot deal with scale-invariant quantities such as RDCs, but we do not see any fundamental reason why it could not be extended to treat them and we plan to explore this possibility. That would allow a cleaner way to introduce the uncertainty of RDCs' prediction and the experimental error, which are cumbersome to include in a maximum entropy formalism40 in an ad hoc manner. As the comparison with FitEnsemble38 will show, both of these terms are small for RDCs and the MaxEnt principle results in a fast algorithm. The extension of generative probabilistic models40,41 or maximum likelihood approaches42 to IDPs is also an attractive alternative, but it is beyond the scope of this work to evaluate them. The MaxEnt principle gives results in agreement with the Sparse Ensemble Selection algorithm,43 but the latter is computationally more expensive and needs some further development to be applicable to IDPs.43
We take T1 = 325.6 K as our reference or “experimental” ensemble. We calculated the RDCs for 8000 uncorrelated structures with PALES48 using steric alignment, because of the NMR setup used (see ESI† for the PALES options used). Then, we have used the ensembles of structures at T0 = 317.0 K to fit the RDC data at T1.
Because we have simulated both ensembles, we know that the weight of a given structure j with energy Ej from the T1-ensemble at temperature T0 is given by the Boltzmann factor, namely
(6) |
The most interesting part corresponds to residues 18 to 34, because of their tendency to form partial α-helices, also known as MoRFs.7,8
These data have been simulated with two different coarse-grained force fields: Profasi45–47 and Campari.51 Profasi was chosen for its focus on reproducing the folding behaviour of proteins based on physical terms. We think that using a physics-based force field is important to work with IDPs as knowledge-based force fields are biased towards folded proteins. Profasi has also been applied to IDPs.52,53 The choice of Campari is justified because it was specifically designed to work with IDPs and has been applied in several studies.51,54 The Campari system contains 9 sodium ions to neutralize the charge.
The RDCs were calculated from the PDBs with the PALES software.48 As the alignment media, poly(ethylene glycol), is dominated by steric interactions, we used the steric alignment in PALES (see the ESI† for further details).
Fig. 1 shows the error in the test set when using different number of structures for the training set to fit N–H RDCs with the Campari ensemble. We can see that for training sets smaller than several thousands, the errors in the test set remain very large, and increase as we improve the fit in the training set. In other words, the optimized {λi} are not transferable. This shows us that we need training sets at least of 7000 structures to determine parameters that do not overfit the experimental results until an RMS error of approximately 1 Hz. Because this number is close to the experimental error, we consider ensemble sizes of 7000–10000 as adequate.
Alternatively, we can estimate the error when calculating the mean value for an RDC: the standard error of the mean. There is certain ambiguity in this value as RDCs can be scaled, but we take here a fixed scale factor obtained from the fit of the 10000 structures (α = 2.08). Fig. S2 (ESI†) agrees with our conclusion that several thousands of structures are needed to get a mean RDC value of the same order of the experimental error. This result is independent of the residue we are measuring: the convergence of all RDCs is the same. Other studies have also found that the underlying ensembles are more heterogeneous than what the measured mean value may suggest.56–58
Several previous studies used a smaller ensemble size31,32 to successfully simulate IDPs. The size of the ensemble in these MD restrained simulations depends not only on the dispersion of the measured property but also on the other parameters used for the restrain, namely its force constant.30,31 These studies run simulations in parallel and were limited by computational resources, but formally their results are exact only when the number of replicas tends to infinity. Other computational methods are expensive, thus limiting the size of the ensembles.4,14,15,17,37 Our method is efficient for thousands of structures so that we prefer to use the full simulated ensemble.
A second important reason to limit the size of the ensembles is to reduce the overfitting. This is an issue when the weights of the structures are the parameters to be optimized, because new structures introduce new parameters, with the obvious risk of overfitting. With the MaxEnt algorithm, the number of parameters is fixed by the number of experimental data and not by the number of structures in the ensemble, which again does not prevent the use of large ensembles.
To analyse the secondary structure (SS) content of the ensemble, we use SS-map.59 SS-map is a software that plots the SS fraction of a given residue on the y axis and the length of the SS element on the x axis, thus providing a picture of the SS distribution of an ensemble with the information of the cooperativity of different SS of individual residues. By plotting both the fraction of SS and its length, it allows to distinguish, for example, a fully formed helix of 10 residues present 50% of the time from 2 fragments of 5 residues spanning the same range.
The ensemble at T1 represents what in a real situation would be the unknown ensemble, from which we only know the measured RDCs. T0 is a calculated ensemble that presumably will be similar, but does not have to reproduce the data exactly. MaxEnt should be able to reweight the T0-ensemble so that it fits the “measured” RDCs. Will the T0 re-weighted ensemble be more similar to the T1 ensemble?
Fig. 2 shows the SS-map of the synthetic ensembles at temperatures T0 and T1 and the re-weighted T0-ensemble to fit T1 N–H RDCs. Because T0 is a lower temperature, this ensemble presents longer helices. Fig. 3 shows that the application of the MaxEnt principle returns a set of weights that can reproduce the final RDCs.
The re-weighting needed to fit the data gives a set of weights that are closer to 1 than the exact Boltzmann reweighting (see Fig. 3). In other words, although the exact Boltzmann weights can reproduce the RDCs of the objective T1-ensemble (see Fig. S3, ESI†), the MaxEnt principle tells us that, based on the data, we do not need to change the weights that much, and that a lower modification of the ensemble is enough and consistent with the data.
As Fig. 3 suggests, the energy distribution of the reweighted T0-ensemble is still closer to that of the T0-ensemble than to that of the objective T1. On average the energy increases but remains lower than the T1-energy distribution (see Fig. S4, ESI†). Fig. S5 (ESI†) shows that most of the structures do not get re-weighted, and only a few do. For those that get re-weighted there is a certain correlation between the Boltzmann re-weighting and the re-weighting given by the N–H RDCs. Of course, if more data are used, for example Cα–Hα RDCs, the reweighting will increase, but even when doubling or tripling the number of experimental data, the degrees of freedom of the ensemble are much higher. We explore this in the following section.
The N–H RDCs do not give information on the energy but on the SS content of the structures; thus we expect the re-weighting to change the SS distribution. Fig. 2 and Fig. S6 (ESI†) reveal that the re-weighting of the data produced goes in the expected directions: the T0 ensemble gets depleted from the long helices that give too large RDCs. But these figures also show that the SS-map of the resulting ensemble remains different from that of the objective T1-ensemble. There are still regions of long helices much less populated in the T1-ensemble. In the following section we will give a reason why the reweighting is not complete and only affects some of the structures.
The results from this section suggest that the RDCs give some information on the SS content of an ensemble, but this information is limited and cannot fully determine the helical propensity nor the helical lengths of an ensemble.
The Profasi ensemble fits the N–H RDCs reasonably well, but shows a region, around residue 35, of too much alpha helices. The MaxEnt algorithm produces a small reweighting of this ensemble, with most of the structures retaining a weight close to one. Therefore the SS-map of the ensemble is visually indistinguishable from the one shown in Fig. 2.
We can use the Cα–Hα RDCs to cross-validate this refined ensemble. The Cα–Hα RDCs are very similar to the original ones, showing that we did not incur overfitting, but differ significantly from the experimental RDCs (Fig. S7, ESI†). This shows that Cα–Hα and N–H RDCs are not correlated, and depend on different structural properties of the ensemble. The lack of agreement with Cα–Hα indicates that the Profasi ensemble does not correctly represent the real structural ensemble.
As fitting one set of RDCs does not affect the other, we can use MaxEnt to also fit Cα–Hα RDCs The resulting ensemble is reweighted to a stronger extent and correctly fits the 56 RDCs (Fig. S8, ESI†). However Fig. 5 shows that despite the use of the additional 25 Cα–Hα RDCs, the fitted Profasi ensemble has only changed its composition slightly (compare with Fig. 2). This change went in the expected direction, increasing the long helices in the region of residues 20–27 and depleting the ensemble from helices in the region 31–39 (Fig. S9, ESI†). However this change was minor compared to the overall composition of the ensemble. Thus, even the use of 56 RDC data does not qualitatively change the Profasi ensemble and hints that it is still far from the real ensemble. We believe that this information can be used by developers to improve the quality of this force field. The spread of IDPs' energy landscape makes them a good target to find the balance between secondary structure populations and lengths versus random coils.
Fig. 5 SS-map of the MaxEnt re-weighted Profasi (left) and Campari (right) ensembles using 31 N–H RDCs and 25 Cα–Hα RDCs. Both ensembles fit the experimental RDCs to the same accuracy. |
The Profasi ensemble differs from the ensemble deduced by Blackledge and co-workers,11,44 which was mainly composed of random coil regions and three long helices. Their helices add up to 75% of the ensemble, and the longest helix has a population of 11% and ranges from residue 20 to 35. The robustness of their choice was checked by statistically significant improvement compared to other helical combinations. Despite Profasi being able to reproduce the folding of peptides and small proteins ab initio,45,60 it does not predict the long helical elements suggested by Blackledge and co-workers.
The introduction of the experimental data does not reweight all the structures equally, because the weight of a structure depends on its RDC values.
The set of RDCs forms a 31-component vector that is difficult to compare to weight of the structures. We can compress the information of this vector in its root-mean-square (RMS) value. If we plot the optimized weights vs. the RMS of the RDC vector for each structure, a clear trend appears (Fig. 6): the higher the RMS(RDC) the more reweighted the structure is. This makes sense, as reweighting a structure with small RDCs does not improve the fit. In other words, MaxEnt (or any other fitting procedure) is blind to structures that have low RDCs. Because RDCs can be scaled, “low” or “high” RDC refers to the value with respect to the other structures. As is well known, large RDCs correspond to long helices, and these structures are the ones MaxEnt finally re-weights to a larger extent.
Only 208 structures out of 8000 have a weight lower than 0.75 (see Fig. 7) when fitting N–H RDCs. Just by removing these structures from the ensembles and letting the others unchanged, the fit is almost as good as the optimized one in Fig. 4 (RMSD = 1.96 Hz compared to the optimized 1.00 Hz, Fig. S10, ESI†). The SS-map of these structures (Fig. S11, ESI†) reveals that these 208 structures are mainly long helices in the region of residues 32–40, just where the original Profasi ensemble gives RDCs that are too large. Thus the MaxEnt re-weighting agrees with our biophysical intuition.
Fig. 7 Left: ME fit (red) of the Campari ensemble to the experimental RDCs (blue). The unweighted ensemble is shown in green. Right: distribution of the weights after the optimization. |
We now turn to the comparison with the Campari ensemble. This comparison is illustrative because it allows disentangling the fitting procedure with prior distribution of the ensemble. Indeed, the comparison we did with Blackledge and co-workers was comparing a different ensemble and a different fitting procedure. This is a common practice in this field: different groups have developed sampling force fields and fitting procedures and the results contain information of both. For example, Forman-Kay group results are based on their ENSEMBLE selection procedure15,17 from a TRaDES force field16,20 generated structures. The present comparison will shed light on the information RDCs provide giving two different ensembles and the same fitting procedure.
The temperature of the Campari force field is better defined than that of Profasi, because the best fitting temperature corresponds to the experimental temperature. However, the initial ensemble has a worse agreement with the experimental N–H RDCs and therefore it needs a larger re-weighting (Fig. 7).
The secondary structure of this ensemble is considerably different from that of Profasi. It lacks the very abundant short helices of the Profasi ensemble and contains mainly helical fragments in the region of residues 22–32. This is, indeed, the region that the RDCs suggest should have helical fragments, and the region where Blackledge and co-workers deduced the helices were. There is a quantitative difference because the amount of helices in the Campari ensemble is lower than that obtained by Blackledge11 (see also Fig. 4 in ref. 59). However, it is true that both convey a similar ensemble, whereas the Profasi one is qualitatively different. Despite the differences, the Campari and the Profasi ensemble to fit N–H RDCs have similar scaling factors (α = 3.97 and 3.67, respectively).
As before, the initial ensemble is similar to the optimized one, so that because the original Profasi and the Campari ensemble differ, the optimized ensembles still differ, even qualitatively. Even using the same fitting procedure, the starting ensemble has a pervasive influence in the optimized one. This is because the MaxEnt principle minimizes the modifications to the original ensemble, but this is a positive quality because it avoids overfitting or biasing the optimization procedure.
Again, we can introduce the Cα–Hα RDCs to increase the number of experimental data. As with Profasi, the reweighting increases, but the final ensemble is qualitatively very similar to the original. The cross-validation with Cα–Hα RDCs shows that the Campari predicted values are closer to the experimental ones. In spite of being closer, the N–H RDC reweighted ensemble does not improve the Cα–Hα (Fig. S7, ESI†) in agreement with the results of Profasi, and suggesting that the Cα–Hα are independent of the N–H RDCs.
Despite the difference between the Campari and Profasi ensemble, it is worth emphasizing that both are able to reproduce the positive N–H RDCs in the central region, and that the MaxEnt re-weighted ensemble does not differ significantly from the original ones. This may seem disappointing – if we expected them to collapse to the same final ensemble – but it also shows that the initially generated ensembles are physically reasonable. Based on the relation between energy and probability ΔEi = −RTlog(wi/w0i), where w0i = 1/N, the energy difference for a reweighting of 0.5 is only 0.4 kcal mol−1. Unfortunately, if we want to predict secondary structure elements we need these force fields to do better, and the RDC data can be used to improve them. The weight distribution of IDP structures is not peaked as with folded proteins, and thus can be easily reweighted to fit experimental data. Therefore agreement with experimental data does not guarantee a real structural ensemble. If we expect insights from the simulated ensembles we need force fields to have more predictive power. Campari seems to be more successful in this respect.
The Campari ensemble is “simpler” to interpret, but this does not seem to us a valid reason to favour it. In contrast, the Profasi ensemble needs less re-weighting and thus has more predictive power. It is true, however, that the use of an artificially high temperature in the Profasi ensemble is introducing a parameter that the Campari force field predicts to a good accuracy and this can also be the cause for the higher errors of the Cα–Hα in the Profasi ensemble. The Profasi temperature was originally defined as the correct scaling parameter of the energy to reproduce the melting temperature of the Trp cage peptide.45 For IDPs maybe this parameter can be slightly scaled and it is then transferable to other sequences or maybe rescaling some of the energy terms results in a shifted temperature. Further systems need to be tested but our preliminary results suggest that the higher temperature is transferable among IDPs.
If, as before, we remove the structures that have w < 0.75 and leave the remaining unweighted, the fit of the Campari ensemble is very good (Fig. S12, ESI†). In this case, the number of structures removed is larger, 2074 out of 8000 (Fig. S13, ESI†). As with the Profasi ensemble, the structures that get a larger re-weighting are the ones that have larger RDC norm. The consistency of the re-weighting starting from different ensembles with different RDCs strengthens our confidence in the validity of the MaxEnt algorithm that we present.
Ideally, one wishes to start with a large pool of structures and let the data select the ones that agree with the ensemble. Different initial distributions should swamp to the same re-weighted distribution. Unfortunately, this is not the case: not even for folded proteins!41 RDCs do not convey enough information to make the initial distribution irrelevant. Our perspective is that the biophysical community has made heroic efforts in developing experimental techniques to probe IDPs, and then has hoped the data to speak by themselves, overlooking the influence of the prior distribution that the force fields produce.
Profasi and Campari can predict secondary structure elements in IDP ensembles based only on first principles, i.e. they can go beyond random coil force fields. But the ensembles they generate are different, and the RDC fitting cannot make them equal, not even similar. They do have an influence on the final ensemble that can fit the RDC data. This is not to say that the RDCs are not informative, but that the ensembles that fit the data combine the information from the RDCs with that of the force fields. Efforts should be made to improve both experimental methods and force fields. Indeed, we believe that the efforts in the latter field lag behind the experimental developments attained in the IDP world.
The agreement with both methods is very high (Fig. 8). We also see that the uncertainty in the weights is low compared to its dispersion. That confirms our assumption that this is not a key parameter. We found that the resulting FitEnsemble fit has much lower errors than the introduced experimental uncertainty. In particular, for an uncertainty of 1 Hz, the fit has a root-mean-square error of 0.2. Therefore we optimized our MaxEnt to a threshold of 0.1. For the FitEnsemble, we used a regularization strength of 3, as suggested by the authors but we checked that values of 0.3 and 30 essentially produced the same average results and the same dispersion.
Fig. 8 Comparison of the fitting of MaxEnt and FitEnsemble.38 FitEnsemble results include the estimated error from the Bayesian procedure, but it is of the order of the point size. |
The extension of FitEnsemble to include a scale parameter seems to be an interesting approach. Still, questions about the convergence of MCMC for RDC ensembles need to be addressed, as well as ensuring that it remains a computationally affordable method.
It has been claimed that RDCs are one of the best probes of IDPs' residual secondary structure,12 but other studies have questioned the relevance of RDCs in IDP modelling.17 Our results, with both a synthetic and an experimental data set, suggest that RDCs can shift the ensembles' secondary structure composition, but only to a limited extent. Different sets of RDCs – N–H and Cα–Hα – give complementary information and improve the reweighting; however the vast conformational space that IDPs can sample makes it a complex case of inferential structure determination,22 so that even with the large number of RDC experimental data, the amount of data is sparse compared to the size of the ensemble.40
Neither all-atom nor coarse-grained force fields have the precision to describe an IDP ensemble,61 as errors of 1 or 2 kcal mol−1 can significantly shift the populations of helices or other secondary or tertiary structure elements. Therefore the need to use experimental data to improve these ensembles is mandatory. But the experimental data is insufficient to fully determine this ensemble, and the pervasive influence of the force field cannot be overlooked, if we wish to have consistent representations of IDP ensembles.
Even though both Campari and Profasi predict certain secondary structure elements, their ensembles are qualitatively different. That determines the composition of the MaxEnt reweighted ensembles. The combination of Cα–Hα and N–H RDCs suggests that Campari is more suitable to describe IDPs than Profasi. We still need further work to test other force fields, improve them, and check other complementary sources of data that help up further select the ensembles. One of our future goals is to include SAXS and chemical shifts in our maximum entropy code.
For the sake of simplicity we will derive the gradient of f1 piecewise. We only consider when the argument in eqn (3) is larger than the threshold; otherwise the gradient is the null vector. The gradient of the average RDC is
Footnote |
† Electronic supplementary information (ESI) available: Details of the PALES calculation and Fig. S1 to S9. See DOI: 10.1039/c4cp03114h |
This journal is © the Owner Societies 2014 |