Adolfo
Bastida
*a,
José
Zúñiga
a,
Federico
Fogolari
b and
Miguel A.
Soler
*b
aDepartamento de Química Física, Universidad de Murcia, 30100 Murcia, Spain. E-mail: bastida@um.es
bDipartimento di Scienze Matematiche, Informatiche e Fisiche, Università di Udine, 33100 Udine, Italy. E-mail: miguelangel.solerbastida@uniud.it
First published on 19th August 2024
The characterization of the statistical ensemble of conformations of intrinsically disordered regions (IDRs) is a great challenge both from experimental and computational points of view. In this respect, a number of protocols have been developed using molecular dynamics (MD) simulations to sample the huge conformational space of the molecule. In this work, we consider one of the best methods available, replica exchange solute tempering (REST), as a reference to compare the results obtained using this method with the results obtained using other methods, in terms of experimentally measurable quantities. Along with the methods assessed, we propose here a novel protocol called probabilistic MD chain growth (PMD-CG), which combines the flexible-meccano and hierarchical chain growth methods with the statistical data obtained from tripeptide MD trajectories as the starting point. The system chosen for testing is a 20-residue region from the C-terminal domain of the p53 tumor suppressor protein (p53-CTD). Our results show that PMD-CG provides an ensemble of conformations extremely quickly, after suitable computation of the conformational pool for all peptide triplets of the IDR sequence. The measurable quantities computed on the ensemble of conformations agree well with those based on the REST conformational ensemble.
On the experimental side, nuclear magnetic resonance (NMR) spectroscopy is the most widely used tool for probing the properties of proteins with IDRs at atomic resolution.9,10 Chemical shifts (CSs), scalar couplings (SCs), and residual dipolar couplings (RDCs) provide averaged information about the backbone dihedral angle distributions and/or the long-range contacts between distant parts of the molecule. NMR data can be complemented by small angle X-ray scattering (SAXS) or small angle neutron scattering (SANS),13 which probe the apparent size of proteins in solution. Other techniques10 such as Förster resonance energy transfer (FRET) and nuclear paramagnetic relaxation enhancements (PREs) have been used much less extensively. In practice, the number of experimental observables is always many orders of magnitude smaller than the size of the conformational ensemble, so the IDR ensemble reconstruction from the measured data14,15 is a task full of uncertainties complicated by the fact that different ensembles may provide the same experimental results within the error resulting after averaging.10,14,16–19
On the theoretical side, molecular dynamics (MD) simulations are the standard tool to simulate the dynamics of a protein with IDRs. Their limitations are mainly concentrated on two independent factors,20–22 which are the inaccuracies of the force field and the difficulties in achieving statistical convergence of the structural sampling of the IDR. The recent optimizations of force fields specifically for IDRs have proved their reliability to reproduce accurately experimental structural data when the statistical sampling of the simulations is adequate.18 Therefore, the most important challenge currently facing MD is to achieve the statistical convergence of the generated conformational ensembles.
In this work, we compare the reliability and computational efficiency of different MD-based methods to build conformational ensembles, using as a quality criterion their ability to converge the NMR and SAXS variables. We selected these particular descriptors since they represent the true linkage with the experimental information of IDRs.18,22–26 As other collective variables, NMR and SAXS variables reduce the dimension of the conformational ensemble while keeping some representative structural information of the IDRs (see the Methods section 2.1). Extensive comparative analysis of these descriptors with other structural variables allows us to determine in this work the degeneracy of the conformational states and the loss of structural information associated with their employment.
In order to suppress the force field accuracy factor from our comparative study, we use as references the computational results for NMR and SAXS data obtained from replica exchange solute tempering (REST)27 simulations, since they provide at present the most accurate statistical sampling in IDRs.18,28,29 The other MD-based methods evaluated in this work are standard MD simulations, the Markov state model approach (MSM),30,31 and a novel method, called probabilistic MD chain growth (PMD-CG), both based on the flexible-meccano23,32–34 and the hierarchical chain growth24,26,35 approaches. Likewise, we have discarded in our comparative study coil libraries-based approaches since the force field accuracy as well as the statistical robustness of the coil libraries for certain amino acid triplets would hinder the direct comparison between the method performances and accuracy.
The IDR selected in this work is a 20-aa region (364–383) from the C-terminal domain of p53 tumor suppressor protein (p53-CTD). This region has been extensively studied36–39 due to its essential role in the functionality of p53 as a transcription factor. Nevertheless, the regulation mechanisms of p53-CTD are still a work-in-progress because of its structural versatility in adopting different secondary motifs when bound to receptors, while its structure remains disordered in solution.37,38
Overall, our work aims to be a practical guide for researchers who need to interpret experimental NMR and SAXS data of proteins with IDRs, assess the quality of the force fields used, analyze the effect of mutations in the protein chain or predict the presence of particular motifs in conformational ensembles, among other possible applications.
Our simulations also show that the averaged time required for at least one residue to change its conformation, is of the order of tens of picoseconds. Accordingly, the hypothetical simulation time required to visit at least once each molecular conformation is of the order of milliseconds. Considering that each conformation would have to be visited many times to obtain reasonable statistics, since the molecular conformations are not equiprobables, we conclude that generating a complete conformational ensemble of IDRs is computationally prohibitive.
The alternative is to perform a dimensionality reduction, i.e. to generate reduced conformational ensembles representative of the whole conformational ensemble and able to reproduce a given set of properties. The choice of the experimental observables described in the introduction as the average values to be reproduced seems a practical option. In fact this strategy has been used to evaluate the quality of conformational ensembles generated using different approximations.18,22–26
The second MD-based method is simply to perform standard MD simulations of the IDR molecule. We are aware that the accumulated simulation time (2 μs) employed for every approach for the sake of comparison in this work, may not be long enough to obtain an accurate statistical sampling18,29 in certain physical variables. However, this circumstance will allow us to verify the influence of two important factors on the sampling of the conformational space of the IDRs which are the choice of the initial configurations of the molecule24 and the time length of the individual trajectories.22,42 From a computational point of view, performing many short simulations instead of fewer longer ones is always advantageous since the calculations can be straightforwardly distributed without any lack of computational performance.
The third MD-based method considered is a Markov state model (MSM).30,31 MSMs have been used to describe conformational ensembles in earlier studies29,43–48 following a similar strategy. First, a long MD trajectory is carried out using an enhanced sampling MD simulation, and then the structures generated are clustered to create a MSM using some collective variables (CVs) from which the weights of each cluster of structures is evaluated. Of course, the success of this methodology strongly depends on the ability of the initial MD simulation to extensively explore the molecular conformational space, which is always questionable in proteins with IDRs. In this work we prefer to use the initial formulations of the MSMs based on the exploration of the conformational space using short simulations,49–51 where the collective variables are chosen a priori and an adaptive sampling method52 is used to force the system to visit conformational regions with low probabilities, that would be rarely visited in an unrestricted MD simulation.
The fourth MD-based method consists of using an enhanced sampling MD approach. In this case we focus on replica exchange solute tempering (REST) simulations, which have already been used to successfully generate conformational ensembles of IDRs.18,28,29,44 The REST method has been used27,53 as an efficient alternative to the more traditional replica exchange (or parallel tempering) MD (REMD) approach,54 in which the simulation of several replicas of the same system although at different temperatures is performed in parallel. After a certain period, the exchange between the coordinates of neighboring replicas is attempted with a Monte Carlo procedure. A known issue of the REMD method27 is that the number of replicas necessary to cover a certain range of temperatures grows with the system size (including the solvent), since the whole system is being accelerated by increasing the temperature. In the REST approach, only the protein–protein and protein–water interactions are scaled, so that the mimicked temperature is increased only on the protein while the temperature of the solvent is maintained. Accordingly, the replica with the lowest temperature (the unbiased replica in which protein and solvent have the same temperature) is the only one that provides a correct statistical distribution.
MD simulations were carried out with the molecules dissolved in water using the GROMACS package v2021.2.55,56 Each solute molecule was surrounded by a number of water molecules ranging from 900 to 12000 (depending on the length of the peptide) and placed in a cubic box of a size chosen to reproduce the experimental density of the liquid at room temperature. All the molecules were described using the CHARMM36m57 force field and the flexible TIP3P model was used for the solvent water molecules. This force field has been shown58,59 to provide a good representation of proteins with IDRs. Periodic boundary conditions were imposed in the simulations using the Particle–Mesh Ewald method to treat the long-range electrostatic interactions. All H–X bonds were kept fixed using the LINCS algorithm.60 The equations of motion were integrated using a time step of 2 fs. All simulations were carried out in a NVT ensemble by coupling to a thermal bath using the stochastic velocity rescaling method by Bussi et al.61
• An extended initial conformation was built using open-source Pymol v2.3.0.62
• Values of the ϕ, ψ and ω dihedral angles and the C–N, Cα–C and N–Cα distances were randomly selected from the corresponding histograms of the tripeptides taking into account the identity of the nearest neighbour residues. Angles and distances were modified using Pymol62 and Chimera v1.16,63 respectively.
• Non-backbone atoms were erased and the side chains were placed again using SCWRL4.64 Some tests were also performed using FASPR,65 showing that the results were largely unaffected by the choice of either of the two packages.
• The resulting molecular structures were tested for the presence of steric clashes using Chimera with default parameters. If a clash was detected, the structure was discarded. Approximately one third of the structures was found to pass the clash test.
These steps were repeated until generating 40 000 molecular conformations. A summary work-flow is shown in Fig. 2 to help understand how the PMD-CG method works in practice.
In principle, it is expected that the conformational ensembles from the three methods are equivalent at infinite simulation times but they may reach convergence at completely different time scales. The average values of J-couplings, chemical shifts, RDC and SAXS intensities obtained from each method are shown in Fig. 3. For the sake of clarity, the standard deviation bars are hidden for each average value in Fig. 3, while the average standard deviations of each experimental-based variable are shown in Fig. 4a for all three methods.
![]() | ||
Fig. 4 (a) Average standard deviations, and (b) root mean square errors (RMSE) of the NMR and SAXS variables obtained from REST, MD and Rg–γ MSM using the S-MD structures and PMD-CG approaches. RMSE values have been calculated with respect to the REST results. The standard deviation and RMSE units are the same as each variable unit as given in Fig. 3. The corresponding values for RDC have been divided by 10 and those for SAXS have been multiplied by 500. |
The variables studied plotted in Fig. 3 show, in general, similar profiles for the three different approaches. However, significant differences are found in some variables, such as the RDC and the SAXS intensities (Fig. 3f and j), in which the REST data clearly differ from those obtained from MD and MSM trajectories. The standard deviation calculated by block analysis is an indicator of the convergence of the associated average value.20 Comparison of the standard deviations associated with these (and indeed for all) variables in Fig. 4a shows that the REST results have the lowest values. This agrees with previous computational works in IDRs that demonstrate the high efficiency of the REST method to provide accurate conformational ensembles in affordable computational times.18,29,44 Also, the poor convergence obtained in other variables extracted from MD trajectories is expected, in agreement with the same previous works already showing that longer simulation times (from 5–30 μs) are needed to obtain convergent results from standard MD trajectories for IDRs of similar length. MSM methods have the highest average standard deviations in most of the variables.
We therefore conclude that the conformational ensembles of p53-CDT obtained from REST trajectories can be considered as the reference for the comparison with the other MD-based approaches. Accordingly, the results obtained from REST trajectories were used as a reference to calculate the root mean squared error (RMSE) for all the NMR and SAXS variables obtained from MD and MSM trajectories (see Fig. 4b).
3 J-Couplings measure the spin–spin interaction between atoms at a distance of 3 bonds, and they can offer local structural information via Karplus equations. Four backbone and one side-chain 3J-couplings have been evaluated from our trajectories. Certain J-coupling variables show higher differences than others. In particular, the 3JCi−1Hα, which involves the peptide bond between two residues, are those who show the highest RMSE among all J-couplings. The differences between the RMSE values of the MD and MSM results are small and not significant as accounted for by their standard deviations.
Chemical shifts from 13C NMR spectra have been used extensively in the structural studies of proteins. Their dependence on the protein structure is very sensitive and complex. Many structural factors, such as the backbone and the side-chain angles, the identity of neighboring residues, the interaction with aromatic groups, the hydrogen bonding, the solvent exposure or the geometric distortions, affect the chemical shift values. Among the 13C chemical shifts analyzed, the highest differences are found in Cβ where the values extracted from MD and MSM trajectories are consistently lower than those obtained with REST trajectories. Also, in this case the differences between the RMSE values extracted from the MD and MSM results are not significant.
The dipolar couplings between nuclear spins are typically averaged to zero by molecular tumbling in isotropic media. In weakly orientating media or as an effect of the molecular magnetic susceptibility anisotropy, residual dipolar couplings (RDC) can be present and measured in specific NMR experiments. Also, software based on the molecular geometry is available to predict the RDC measured for the weakly aligned molecule. The interesting feature of the RDCs, similar to that of the SAXS intensities, is that they depend on non-local conformational features, in contrast with the chemical shifts and the J-couplings, which depend mostly on the local conformation. The RMSE values are similar for both methods and both underestimate the RDC values, although the shifts are more erratic for those from the MSM values.
The scattering intensity obtained from the small angle X-ray scattering (SAXS) method provides information about the overall shape and size of the proteins. The SAXS profiles obtained from MD and MSM are very similar, although both differ from the reference REST results at low values of the scattering vector. The curve profile in this region is related to the globular, partially or completely unfolded shape of the protein. The Kratki representation of the data in Fig. S1 (ESI†) is generally used to better understand the folding state and flexibility of the proteins.75 The profiles obtained from MD and MSM correspond to pure unfolded protein profiles. On the other hand, the shallow maximum before the plateau in the REST profile may indicate that in these trajectories the peptide has explored certain folded conformations with higher probability.
In order to assess the significance of the errors affecting the predicted NMR observables, we considered the median of the experimental errors of the same observables as stated by the authors of entries deposited in the biological magnetic resonance database (BMRD).76 All entries were downloaded and the statistics of the stated errors in the relevant quantities were assessed. Apparently wrong – very large or non-numeric – values were removed from the list of stated errors and their mean and median were computed. Although the median and the mean were very close in all cases, the average is reported here as it is the most representative of the majority of frequently declared errors. These results are given in Table 1.
Exptl. quantity | Error |
---|---|
3 J Ci−1Hα | 0.35 Hz |
3 J HNC | 0.35 Hz |
3 J HαNi+1 | 0.25 Hz |
3 J HNHα | 0.5 Hz |
3 J HNCβ | 0.35 Hz |
δ Cα | 0.2 ppm |
δ Cβ | 0.2 ppm |
δ C | 0.2 ppm |
RDC (HN-N) | 1 Hz |
It is apparent that for the J-couplings, the RMSEs behave well with respect to the REST results, within the mean experimental errors, which means that all the approaches reproduce experimentally the distribution of the intervening torsion angles. Similarly, the RMSE of the selected carbon chemical shifts are within the median error (0.2 ppm). For the RDCs the RMSE of MSM and MD methods are circa three times larger than the median error value of BMRB entities, showing that the differences are significant. As for the SAXS measurements, the estimation of the experimental error is complex and many mathematical models have been proposed for this task. A recent model77 that takes into account the measurement process and the setup geometry, estimates the experimental relative error under different measurement conditions to always be less than 1% for small scattering angles, i.e. where our results show the largest differences. The SAXS RMSE values of MD and MSM methods show significantly highest values (4 × 10−4 a.u.) than the 1% of the intensity values obtained at q < 0.2 Å.
To better understand the relationship between the NMR and SAXS variables and the structural ensembles provided by the MD-based trajectories, three structural collective variables, the radius of gyration (Rg), the end-to-end distance and the asphericity, are depicted in Fig. 5.
The results obtained for the Rg and the asphericity contrast with the general trend observed for the NMR and SAXS variables, which show similar RMSE results for the MD and MSM trajectories. The curves extracted from the MD trajectories show a shoulder at low Rg values while achieving their maxima at higher values than the curves provided by MSM trajectories. This profile is similar to that obtained from the reference REST simulations. A similar behaviour is found for the asphericity curves. Both variables provide slightly different structural information. Rg focuses on the compactness of the molecule, with 0 corresponding to the maximum of compactness, while the asphericity indicates how much the protein deviates from the spherical shape, with 0 corresponding to a perfect sphere and 1 corresponding to a thin rod. Therefore, the population of extended structures seems to be significantly higher in the conformation ensembles obtained from MD and REST trajectories than the one obtained using the MSM method. Moreover, the maxima of the end-to-end distance curves show a shift towards higher distances for the REST curve, as compared with those from the other methods, indicating that there are certain differences in the conformations explored via the different methods.
In order to complete our analysis, three additional structural descriptors, i.e. the solvent-accessible surface area (SASA), the number of H-bonds and the secondary structure content (helix and strand), were evaluated along the trajectories for each method (see Fig. S2, ESI†). A comparison of the distribution of the solvent-exposed surface values among the three protocols is complementary to the end-to-end distance results, in which the maxima of MD and MSM curves are shifted with respect to the reference one (REST). The distribution of the number of H-bonds and the secondary structure content give specific, meaningful information about the probability of the formation of compact, structured conformations inside the conformational ensembles generated with each MD-based protocol. Both descriptors show the higher probability of secondary structures containing a high number of H-bonds (4–6) in the REST conformational ensemble with respect to those obtained from MD and MSM.
Overall, the MD and MSM results provide similar descriptions of the variables considered, with reasonable estimates of the J-couplings and the chemical shifts which are related to local structural information, whereas the RDCs and SAXS profiles, which are related more to the global shape and orientation of the disordered peptide, show significant deviations from the reference REST results.
As observed in Fig. 4, the backbone J-coupling RMSE values are similar to those from the other methods. However, for two J-couplings, the RMSE values are higher than those obtained from the MD and MSM trajectories. As for the chemical shifts, the RMSE values are greater for the 3 C atoms. Moreover, Fig. 3 shows a general shift towards higher values in δCβ, while the shift occurs towards lower values in δC. There are also shifts in δCα with respect to the reference values but without any clear trend. The overall differences in chemical shifts and J-couplings are consistent with lesser sampling of compact conformations, such as alpha-helix structures. For the RDC and the SAXS variables, the PMD-CG conformational ensemble is significantly more accurate than the ensembles obtained from MD and MSM.
Likewise, we have calculated the three structural variables from the PMD-CG conformational ensemble, with their curves included in Fig. 5. For the Rg and end-to-end curves, their maxima are located at similar values compared with the reference maxima in the REST method. However, significant differences are found in the shape of the curves, with the narrower PMD-CG curves showing a significantly lower population for the compacted conformations, in consonance with the differences observed in the chemical shifts and the J-couplings. The same behavior appears in the SASA distribution curves in Fig. S2 (ESI†), in which the lowest values of SASA correspond to the compacted conformations. The analysis of the distribution of H-bonds and secondary structure content probability in the PMD-CG ensembles (Fig. S2b and c, ESI†) just indicates the very low H-bond formation probability. This result is expected due to the stochastic nature during the construction of the IDR conformation in the PMD-CG protocol, since the creation of specific H-bonds between residues within the distance and angle cut-offs would require the support of post hoc, refinement protocols, such as energetic minimizations, or short MD simulations. In contrast, the asphericity data show that the shape distribution is similar to those from the REST and MD trajectories.
It should be highlighted that the differences between the PMD-CG and REST results are not due to convergence issues (as seen in Fig. 4a), as partially occurs with MD and MSM results, but rather to the accuracy of the PMD-CG approach. Two different factors can mainly influence the quality of the results, the prediction of the backbone conformational distribution from the tripeptide library, and the prediction of the side-chain distribution by a side-chain predictor, i.e. Scwrl4. To evaluate separately both factors, we have analyzed the distribution of the dihedral angles ϕ, ψ, and χ1 of the peptide in the different conformational ensembles. In Fig. 6 we depict the Pearson correlation of the Ramachandran histograms and the χ1 histograms from MD, MSM, REST and PMD-CG ensembles with respect to the tripeptide histograms. These results show that the backbone dihedral distribution of the tripeptides and the p53-CTD obtained from the 4 methods are similar (Fig. 6a), in agreement with our previous work.40 Therefore, the lower representation of more compacted conformations in the conformational ensemble obtained from the PMD-CG seems to be uncorrelated with possible inaccuracies to predict the backbone conformational ensemble of the peptide.
The correlations of the χ1 distributions obtained from MD, MSM and REST with respect to the tripeptide distributions are also very high (Fig. 6b). However, the distribution obtained from the PMD-CG method shows very low correlations for all the p53-CTD residues, indicating that the side-chain dihedral distributions obtained from the Scwrl4 are significantly different compared with the distributions obtained from the other methods and also of those calculated with the tripeptides. Direct comparison between the χ1 distributions from the REST and PMD-CG conformational ensembles in Fig. S3 (ESI†) shows that for both distributions the maxima of the curves are located in similar χ1 values, but the curve widths of the PMD-CG ensemble are significantly narrower than those obtained from the REST method, as a consequence of building side chains from a library of rotamers. A similar conclusion holds for independent molecular dynamics simulations when side-chains are reconstructed on a set of 55 diverse proteins78 and the predicted side-chain dihedral angles are compared with the experimental values. The distributions for the χ1 angles of the Scwrl4 side-chain peaks at about the same values as the original distributions, although with a significantly lower dispersion (see Fig. S4, ESI†).
The differences in the conformational distribution of the amino acid side-chains may certainly affect the accuracy of the J-coupling and chemical shift results, which take into account the local structural environment. However, the lower population for the compacted conformations observed in the end-to-end descriptor only depends on the backbone of the conformational distribution. This behavior could be related to the rejection of locally clashing compact conformations in the peptide building phase of the PMD-CG method (see the Methods section 2.3.2). To confirm this, we have performed further analysis of the distances between the backbone Cα atoms of residues separated by a gap of 2 amino acids (i.e. between ith and ith + 3 residues). The probability distribution of interatomic distances in Fig. S5 (ESI†) indeed shows high similarities between the REST and the PMD-CG ensembles except for the lowest values, associated with compacted peptide structures, for which the REST ensemble shows probabilities slightly higher than those from the PMD-CG curves. These results confirm the under-population of the most compacted peptide structures in the PMD-CG method, due to their rejection during the clashing test phase. When assembling the fragments, we note that very small variations – of a few units of degrees – in the local geometry can lead to significant displacements along the peptide chain, and therefore to clashes. This issue is enhanced when constructing compact conformations, such as the α-helix structure, in which the interatomic distance windows are significantly narrow.
First, we observe that the standard deviations of the MD and MSM results obtained using the PMD-CG structures are smaller than those coming from the S-MD ones, that is, there is an improvement in the convergence. And more importantly, the RMSE values for the RDC and SAXS magnitudes are substantially reduced when employing the PMD-CG conformational ensemble as a pool of MD starting structures and are now similar to the mean experimental errors. Indeed, the curves of the three structural variables obtained from MD and MSM trajectories starting from the PMD-CG conformations (see Fig. S6, ESI†) are more similar to those from the REST trajectories than the values obtained from the original MD and MSM trajectories starting from the S-MD structures. On the other hand, J-couplings and chemical shifts, which account for local structural information, show, in general, minor variations in their performance. Nevertheless, the RMSE values of those variables previously showing the highest values, such as 3JCi−1Hα and δCβ, now decrease. We have also observed (results not shown) that the use of PMD-CG starting structures, instead of the S-MD ones, in the REST method barely affects the results as expected from the ability of the method to explore large regions of the conformational space independently of the initial structures of the replicas.
Other parameters that can be explored to optimize the performance of the MD simulations in IDRs are the number of trajectories and the trajectory lengths, while keeping constant the accumulative simulation time (2 μs). The use of many independent short MD simulations rather than few long-time trajectories in biomolecular systems has already been recommended.22 Nevertheless, the employment of very long MD time lengths in IDRs is still common practice.18,29 Accordingly, two additional sets of MD simulations, i.e. 20 trajectories at 100 ns and 2000 trajectories at 1 ns, have been performed. Likewise, both MD sets have used the PMD-CG conformational ensemble as a selection pool of starting structures. The statistical sampling evaluation of the NMR and SAXS variables have been added to Fig. 7c and d. These results show that the different time length of the trajectories indeed influences the sampling quality of NMR and SAXS variables. The standard deviations obtained from the 100 ns trajectory set are clearly the highest. On the other hand, the 1 ns trajectory set has the lowest standard deviation values in all calculated variables. This confirms that the use of a larger number of different initial structures favors the exploration of the conformational space and therefore accelerates the convergence of the results. The comparison with the reference average values (Fig. 7d) shows that the accuracy of certain variables to reproduce the reference values are more sensitive than others from the different simulation time lengths. Thus, RDC and SAXS intensities show again the greatest differences among all the different conformational ensembles, following the same trend of the standard deviations, i.e. the shortest the time length and the highest the number of independent trajectories the lowest the RMSE values with respect to the REST values. Regarding the NMR variables offering local structural information, the influence of the time length is just minor, since similar RMSE values are obtained in all J-couplings and chemical shifts for the 10 and 1 ns trajectories. Nevertheless, δCβ shows again certain differences, since the RMSE obtained from 100 ns trajectories is higher than the other two.
As far as the MSM approach is concerned, the employment of different pairs of collective variables to map the conformational space has also been explored to optimize the statistical sampling of the MSM conformational ensembles. Thus, two additional sets of MSM trajectories have been performed by using the pairs of CVs Rg–dee and dee–γ, respectively. The resulting convergence study of the NMR and SAXS variables depicted in Fig. S7 (ESI†) shows that the employment of different CVs in the MSM approach barely affects the accuracy of the analyzed variables.
We note that the computational effort of the PMD-CG method increases linearly with the number of residues in the molecule, while the computational effort required in MD-based simulations increases much faster,79 so the 1.4 factor increases with the length of the peptide. We also recall that running many short-time trajectories instead of fewer long-time trajectories allows us a more effective distribution of the computational effort in any multicore system.
The great potential of the PMD-CG method performance is apparent when considered under the scenario of mutagenesis studies: a single point mutation in p53-CTD implies, in MD, MSM and REST approaches, the simulation re-run of the p53-CTD mutant (the same computational times for each mutation). However, a single-point mutation involved in the PMD-CG method requires the re-run of only three new tripeptides, i.e. 0.17 × tPMD-CG. Moreover, if a MD conformational ensemble database of all the 800 possible tripeptides was available for the community, a complete mutagenesis study of an IDR could be done in a number of hours. IDRs can tolerate, in general, a high number of mutations without substantial loss of flexibility and function. However, there are certain molecular recognition features within IDR sequences that are highly conserved and seem essential for the correct function of the protein.37 The complexity of the problem could be addressed using an exhaustive computational mutagenesis approach enabling fast identification of pathogenic mutations.
Using the 20-residue region from the C-terminal domain of p53 tumor suppressor protein as a reference system, the results obtained from standard MD and MSM protocols provide reasonable values for the J-couplings and chemical shifts, although fail to describe the RDC and SAXS profiles. The PMD-CG method provides a good representation of the calculated RDC and SAXS observables but lower quality values for two 3J-couplings and especially for the chemical shifts. The origin of this failure is the limited representativity of the distributions of the χ1 dihedral angles provided by the libraries used during the side-chain construction, and the generation of clashes during the chain building procedure that decrease the relative presence of more compact structures. In future work we intend to address the above limitations of the approach by employing structural minimization and/or short MD simulations that can correct the possible inaccuracies of the method for building the chains leading to misrepresentations of more compact conformations. Moreover, additional non-homologous IDR sequences with different lengths should be tested in future to prove that our conclusions have validity for IDRs with different structural features and functionalities.
The choice of the PMD-CG structures as the starting point in the MD and MSM calculations is shown to be a good strategy that greatly improves the results, especially for the RDC and SAXS profiles, as previously argued by Hummer and collaborators.24 Moreover, the use of many short-time trajectories in MD simulations is advantageous with respect to the use of fewer long-time ones, since they provide a wider exploration of the conformational space. Under these circumstances, the employment of a representative conformational ensemble as a pool of starting structures for many short MD simulations should become the “best practice”, rather than the run of few, very long MD simulations. Being aware of the computational cost required to construct the conformational ensemble pool, the use of the PMD-CG method, as proposed in this work, would greatly enhance the study of IDRs, due to its efficient performance.
Footnote |
† Electronic supplementary information (ESI) available: Kratky plot obtained from REST, MD, MSM and PMD-CG conformational ensembles (Fig. S1), distribution of SASA, number of H-bonds and secondary structural elements obtained from REST, MD, MSM and PMD-CG conformational ensembles (Fig. S2), histograms of χ1 dihedral angles obtained from REST and PMD-CG conformational ensembles (Fig. S3), histogram of χ1 angles of all amino acids obtained from the experimental 55 protein structures and after side-chain reconstruction using Scwrl4 (Fig. S4), distribution of Cα–Cα interatomic distances obtained from REST and PMD-CG conformational ensembles (Fig. S5), structural variables obtained from REST, MD and MSM trajectories starting from PMD-CG structures (Fig. S6), and comparison of the average standard deviations and RMSE of the NMR and SAXS variables obtained from MSM trajectories that employ different pairs of collective variables (Fig. S7). See DOI: https://doi.org/10.1039/d4cp02564d |
This journal is © the Owner Societies 2024 |