Dániel Kovács
ab and
Andrea Bodor
*a
aELTE, Eötvös Loránd University, Institute of Chemistry, Analytical and BioNMR Laboratory, Pázmány Péter sétány 1/A, Budapest 1117, Hungary. E-mail: andrea.bodor@ttk.elte.hu
bEötvös Loránd University, Hevesy György PhD School of Chemistry, Pázmány Péter sétány 1/A, Budapest 1117, Hungary
First published on 31st March 2023
In studying secondary structural propensities of proteins by nuclear magnetic resonance (NMR) spectroscopy, secondary chemical shifts (SCSs) serve as the primary atomic scale observables. For SCS calculation, the selection of an appropriate random coil chemical shift (RCCS) dataset is a crucial step, especially when investigating intrinsically disordered proteins (IDPs). The scientific literature is abundant in such datasets, however, the effect of choosing one over all the others in a concrete application has not yet been studied thoroughly and systematically. Hereby, we review the available RCCS prediction methods and to compare them, we conduct statistical inference by means of the nonparametric sum of ranking differences and comparison of ranks to random numbers (SRD-CRRN) method. We try to find the RCCS predictors best representing the general consensus regarding secondary structural propensities. The existence and the magnitude of resulting differences on secondary structure determination under varying sample conditions (temperature, pH) are demonstrated and discussed for globular proteins and especially IDPs.
SCSAi = δAi,measured − RCCSAi | (1) |
Many IDPs/IDRs have been shown to play important biological roles11–14 and this created the need to study and assess their physical, chemical and biological behavior. An important goal is to make IDPs pharmaceutical targets.15–24 In this respect – despite its inefficiency – still the earlier sequence–structure–function paradigm is utilized as a starting point of the characterization.25 This means, that investigation is focused on finding remnants of structural features. Even though IDPs are generally disordered, they can exhibit inherent structural preferences26 that are referred to as secondary structural propensity, residual structure, transient structure. Apart from an inclination toward the well-known, another reason for identifying regions with structural propensities is that such regions are usually the ones involved in protein/protein or protein/membrane interactions. Typical sequential regions with detectable structural propensities are the so-called preformed interaction prone fragments, preformed structural motifs and short linear motifs which are generally crucial for the function of IDPs/IDRs.27–31 Thus, the regions of modest structural propensity are also expected to be key to understanding interactions and achieving the druggability of IDPs. For this, an efficient experimental characterization of these regions is necessary, which most tools of protein research are unable to accomplish. Due to the high flexibility of IDPs,32 attempts via classical methods, based on a rather rigid three-dimensional chemical structure, are inadequate.33 The necessity of describing IDPs in terms of structural ensembles instead of single structures especially calls for multiple sources of experimental data34 further increasing the importance of NMR. Consequently, the correct interpretation of the SCS is necessary.
A typical evaluation of the SCS values calculated based on eqn (1) is via the graphical representation as a function of the amino acid sequence. The positive or negative sign indicates the type of secondary structural propensity, while the amplitude shows the strength of this propensity. Data can also be interpreted indirectly via calculating some function of the SCSs, that can be (i) the probability of the presence of different secondary structural elements,35 (ii) the so-called CheZOD Z-score,36 (iii) the secondary structure propensity score (SSP),37 (iv) the neighbor-corrected structural propensity score,38 (v) the chemical shift index (CSI),39–42 (vi) the random coil index (RCI)43–45 and (vii) probability-based secondary structure identification (PSSI).46 Lately, attempts have been made to base disorder prediction exclusively on Z-score values, and thus exclusively on SCSs.47,48 Moreover, SCSs are used in the software such as SHIFTX,49 NMRView50 PESCADOR51 and DIPEND.52 The existence of such advanced methods indicates the importance of calculating SCS values appropriately. However, they are ambiguous quantities, where the ambiguity arises already from the definition (see eqn (1)), by the involvement of RCCSs. The real value of the RCCS of a given atom, for a given amino acid, under given experimental conditions is not well-defined. This is proved by the number of RCCS calculation methods that have been proposed in the last few decades by several authors.42,46,53–68 This lack of consensus on RCCS values causes the aforementioned ambiguity of SCSs. On the other hand, experimental aspects such as CS referencing, and signal assignment also contribute to the uncertainty of SCS values.69,70 Based on all this, the arising questions are: how much does the RCCS-related ambiguity influence secondary structure determination and what can be done to eliminate or at least reduce this effect? To address this problem, a comparative study of different RCCS datasets and calculation methods is necessary. Only very few works focus explicitly on the comparison of different RCCS prediction methods.71–75 Usually, such issues constitute marginal parts of the papers introducing new predictors.58,60,63–66,76
We intend to fill this gap and we propose to discuss RCCS predictor development as a calibration problem. We give an overview of the theoretical and experimental background of the presently available RCCS predictors, focusing on their differences. Further on, we provide case studies demonstrating how the different RCCS prediction methods influence the secondary structure or structural propensity assessment of a protein. As examples, we chose well-known and extensively studied proteins: the folded ubiquitin, and α-synuclein and the transactivation domain of p53 as IDPs, highlighting at the same time the effect of experimental conditions at various pH values and different temperatures. By means of statistical inference, we try to determine which RCCS predictor, if any, best represents the consensus of multiple predictors for a given experimental dataset. In the light of all these, one can choose and apply predictors simultaneously, a so far uncommon – but useful – practice.
Historically, two main conceptual approaches were considered for the calibration of RCCSs. One approach is based on designing small peptides whose behavior is assumed to best represent the most disordered state any polypeptide might adopt.53–56,58,64,68,76 The other approach involves compiling a protein chemical shift database followed by a statistical analysis of the data.42,46,57,59–63,65–67 This approach has become increasingly popular with the growing number of IDP-related datasets in the Biological Magnetic Resonance Databank (BMRB).
Following the choice of suitable model systems, another issue is how to take experimental conditions into account. So far, according to the literature, temperature and pH, have been directly and ionic strength indirectly considered.62,65,66,76,83 Besides these experimental parameters, the local amino acid sequence has an impact on the CSs of an individual residue in the polypeptide chain and has been accounted for in some methods.
The small peptide approach has the advantage that the tabulation of RCCSs is very straightforward and requires no or very little computation. Also, one has extensive control over experimental conditions as the respective values of pH, temperature and ionic strength may all be precisely adjusted. On the other hand, it is much more difficult to cover the local compositional space of proteins. For example, if 20 amino acids and only the nearest neighbors are considered, 203, meaning 800 combinations must be examined. Accounting for the neighboring ±2 amino acids, this number jumps to 3.2 million. As it would be very time-consuming and costly to produce so many different peptides, in studies done so far, authors designed a given polypeptide frame (for example Gly–Gly–Xxx–Ala–Gly–Gly)56 and varied a single amino acid in a central position. The sequential effect of the different amino acids on their neighboring partners is evaluated by their effect on the amino acids of the polypeptide frame, in the abovementioned case on glycines and alanine. On the other hand, the effect of a given amino acid on its neighbors depends also on the identity of the neighbors. Such pairwise and n-wise relationships are impossible to account for by the small peptide approaches utilized so far.
In contrast, database-related statistical approaches have the opposite strengths and weaknesses. With large numbers of CSs available for extensive numbers of proteins, the compositional space is much better covered than in peptide-based studies. The same local sequence may appear numerous times, therefore chemical shift values for all the involved amino acids are observed numerous times as well. A large enough database even enables the determination of pairwise or n-wise correction terms for the effect of the local sequence. The drawback is, that the effect of experimental conditions is generally difficult to account for, as these parameters usually vary from entry to entry. Also, since database approaches directly use chemical shifts of proteins for calibration, it could be argued that the resulting RCCS values are more appropriate for studying proteins than RCCSs originating from small peptide studies. However, even if chemical shift data of IDPs are used,63,65,66 it is not guaranteed that all CS are RCCSs because of residual structural motifs in IDPs.84 This requires authors to filter the data in some manner that decomposes the measured CSs into RCCSs and the different contributions of all the experimental conditions, the local sequence and, most importantly, residual structure.63,66,85,86 Loop regions, denatured proteins and even some peptides have been shown to not be completely disordered.87–102
In the light of the above, in Table 1 we summarize the works that have been carried out with the aim of calibrating RCCSs. One can observe that database-derived, statistical approaches have recently been gaining popularity. It is interesting to note, that, except for the work of Bundi et al.,54 systematic pH and temperature corrections only became available in the 2000's, while the effect of local sequence was already considered in the work of Braun et al.55 in 1994. Sequence corrections became more elaborate in later RCCS predictors.
Method | Year | Ref. | Type of system | Corrections | Atom types | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Sequence | Temperature | pH | HN | Hα | Cα | Cβ | C′ | N | ||||
McDonald | 1969 | 53 | Free amino acids, different small peptides | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ |
Howarth | 1978 | 67 | Peptides and denatured proteins in D2O | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ |
Richarz | 1978 | 68 | Small peptide (GG-X-A) | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ |
Bundi | 1979 | 83 | Small peptide (GG-X-A) | ✗ | ✗ | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ |
CSI | 1992, 1994 | 41 and 42 | Globular proteins | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Braun | 1994 | 55 | Small peptide (GG-X-A) | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
Wishart | 1995 | 56 | Small peptide (GG-X-A/P-GG) | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Lukin | 1997 | 57 | Database (BMRB and literature) | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Schwarzinger | 2000 | 58 | Small peptides, in acidic 8 M urea | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ | ✓ | ✓ |
PSSI | 2002 | 59 | Database (selected BMRB) | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Wang | 2002 | 60 | Database (selected BMRB) | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
RefDB | 2003 | 61 | Selected BMRB data (database) | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Wang L. | 2006 | 46 | Proteins from refDB | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ | ✗ | ✗ |
Camcoil | 2009 | 62 | Loop regions of globular proteins (selected BMRB) | ✓ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
ncIDP | 2010 | 63 | IDPs (mostly BMRB) | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Kjaergaard | 2011 | 64 and 76 | Small peptides | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Prosecco | 2017 | 65 | IDPs (selected BMRB) | ✓ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Potenci | 2018 | 66 | IDPs (ncIDP extended) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
In this work, we focus on investigating free proteins in aqueous solutions and under the typical conditions of NMR studies. We are aware of works concerning RCCSs under high pressure,103–107 for phosphorylated,108 posttranslationally modified109 amino acids and in the presence of organic solvents,110,111 however we are not discussing these specialties here.
Below, we provide a brief overview of the approaches regarding RCCS corrections for the three aforementioned factors: local sequence, pH and temperature.
The first sequence corrections of RCCSs were suggested by Braun et al.55 Considering the Gly–Gly–Xxx–Ala and Gly–Gly–Gly–Ala sequences, corrections were defined as:
ΔδX = δGGXAA,N − δGGGAA,N | (2) |
Although the calculation is very straightforward, this approach assumes that the local sequence effect experienced by alanine because of residues X is equal to what the remaining 19 amino acids would experience, which is not necessarily the case. Also, taking only the preceding residue into account might be plausible for amide nitrogen, but not for other atom-types of the peptide backbone.66
A similar approach was used in the work of Wishart et al.56 considering Gly–Gly–Xxx–Ala–Gly–Gly and Gly–Gly–Xxx–Pro–Gly–Gly hexapeptides. Sequence correction was given as:
ΔX = δXA − δXP | (3) |
In 2000, Schwarzinger et al. used Gly–Gly–Xxx–Gly–Gly constructs and provided sequence correction terms for all four of the i − 2, i − 1, i + 1 and i + 2 positions.58 These correction terms were respectively calculated as the CS difference of Gly1, Gly2, Gly4 and Gly5 in the Gly–Gly–Xxx–Gly–Gly and the reference Gly–Gly–Gly–Gly–Gly peptide according to the set of equations below
A = δ(G1) − δ(G1ref) | (4) |
B = δ(G2) − δ(G2ref) | (5) |
C = δ(G3) − δ(G3ref) | (6) |
D = δ(G4) − δ(G4ref) | (7) |
δR(corrected) = δrandom(R) + A + B + C + D | (8) |
The first statistical approach in sequence correction was given by Wang et al.60 They used more than 200000 chemical shifts from BMRB to calibrate average random coil chemical shifts, pairwise sequence correction terms, average secondary structural chemical shift terms and terms for pairwise sequence effects in the random coil, β-strand and α-helical states. We note, that in this case the definition of RCCS heavily depends on the identification of residues in the random coil state by means of VADAR, DSSP and PSSI.59,116,117 Using this definition of disorder, parametrization of eqn (9) and (10) could be performed:
Δ(XY)n,s = 〈δn,s(X)〉 − 〈δn,s(w/o X)〉 | (9) |
Δ(YZ)n,s = 〈δn,s(Z)〉 − 〈δn,s(w/o Z)〉 | (10) |
The next method including sequence correction terms in RCCS calculation was Camcoil,62 published by De Simone et al. in 2009. Therein, a database of 1772 BMRB entries belonging to proteins possessing PDB entries was used. Residue specific RCCSs were calculated as the average CSs of the given residue found in the coil regions of the considered proteins. Pairwise correction contributions for the preceding and succeeding residue were calculated by averaging for CSs of the appropriate residue pairs in the database. Moreover, a set of atom-type dependent weight factors for these correction terms was proposed for the predicted RCCS of atom i of residue A in a BAC peptide triplet:
δRCiA = δ0iA + α−iδ1iBA + α+iδ1iAC | (11) |
The ncIDP method by Tamiola et al., includes sequence corrections.63 This is achieved by solving a set of equations of the following form:
δn(x,a,y,i) = δRCn(a) + Δ−1n(x) + Δ+1n(y) + εn(i) | (12) |
Here, the tripeptide xay is considered, and δn(x,a,y,i) is the CS of atom-type n of the i-th residue, δRCn(a) is the uncorrected RCCS of a, Δ−1n(x) and Δ+1n(y) are sequence correction terms for the preceding and succeeding residue, while the εn(i) term accounts for any deviation caused by pH, temperature or CS referencing of individual datasets. Thus, the resulting sequence correction terms – despite originating from a database approach – are not pairwise and depend only on the identity of the preceding and succeeding residue. In practice, ncIDP is available online.113
In 2011, Kjaergaard et al. provided an RCCS dataset including sequence correction terms derived from the investigation of Gln–Gln–Xxx–Gln–Gln peptides.64 The procedure of Schwarzinger et al. was adopted, however Gln–Gln–Gln–Gln–Gln was defined as the reference peptide. Correction terms for sequence positions i − 2, i − 1, i + 1 and i + 2 and for the HN, Hα, N, Cα, C′ and Cβ atom types were determined. RCCS calculation follows eqn (4)–(8) with the adjustment of the reference peptide. Problems mentioned earlier, regarding the validity of correction terms for the two terminal positions, are present here as well. This RCCS calculation method is also available as a web application.119
In the first attempt to apply an advanced machine learning approach for RCCS calibration, Sanz-Hernández and De Simone built the Prosecco neural network model, which uses sequence correction terms.65 A sufficiently large dataset of more than 20000 CSs of IDPs and IDRs from BMRB was used, and determination of pairwise correction terms for the i − 2, i − 1, i + 1 and i + 2 positions was achieved. The calculation involved the use of smoothed empirical probability density functions of CSs derived by applying Gaussian kernels according to eqn (13).
(13) |
ΔδAi,j = δAi,j − δAi | (14) |
The correction terms of Prosecco are not used directly but are multiplied by weight factors wAi,k−2, wAi,j−1, wAi,l+1, wAi,m+2 defined as the corresponding empirical negative overlap integral of the corresponding primary and pairwise probability density functions of the CSs of the central amino acid in a quintuple of residues:
(15) |
Presently the latest RCCS prediction method with sequence correction terms is Potenci by Nielsen and Mulder, published in 2018.66 The local sequence effect is accounted for from position i − 2 to position i + 2. The formulation of sequence correction terms combines an earlier idea with a novel one. As seen in both small peptide studies and the practical applications of the Wang method, general correction terms could theoretically be assigned to all 20 amino acids for all atom-types. Nielsen and Mulder determined a set of such general correction terms for all 6 canonical atom-types, and Hβ. Their novel idea was that the meticulous determination of pairwise correction terms can be neglected, instead, so-called correlated amino acid or second order contributions are determined. To achieve this, the 20 amino acids were divided into 7 groups according to the properties of the side-chain: G Gly, P Pro, r aromatics (Phe, Tyr, Trp), a aliphatics (Leu, Ile, Val, Met, Cys) and A, “+” positive (Lys, Arg), “-” negatives (Asp, Glu), while p polar residues (Asn, Gln, Ser, Thr, His). The direct neighbor correction and next-neighbor correction terms were defined using a principal component representation of amino acids suggested by Georgiev121 and corresponding wkj tunable weights:
(16) |
(17) |
In determining correlated contribution terms, a similar approach was followed. For each atom-type, ωkl,m was defined as the contribution of residue type m to the CS of residue type l in position i when m is in relative position k with respect to i:
(18) |
(19) |
In 3 early works Richarz and Wüthrich,68 Bundi and Wüthrich54 and Braun et al.55 respectively published 1H, 13C and 15N chemical shifts of Gly–Gly–Xxx–Ala tetrapeptides. These three works established the first so-called binary pH-correction scheme for RCCSs. That is, respective pairs of RCCSs were proposed for Glu, Asp and His residues: one for pH values at least 1.5 smaller than the pKa of the given residue, and one for pH values at least 1.5 larger than the corresponding pKa. This is equivalent to providing an RCCS value for the completely protonated and the completely deprotonated state of the residue, respectively. For Asp and Glu residues, this might be a smaller issue as peptide and protein studies are either carried out under very acidic condition (pH < 3), where even the acidic sidechains of these two amino acids are almost completely protonated, or, at neutral pHs, where both are close to completely deprotonated.122–125 In contrast, histidine has typical pKa values between 6 and 7 and is therefore partially protonated under close to physiological conditions, making it fall into a “blind range” of binary pH-correction schemes.
Camcoil, published in 2009 (ref. 62) also has a binary pH-correction scheme. That is, spectra recorded under either acidic (pH < 2) or close to neutral (average pH = 6.1) conditions were used. The correction terms for acidic conditions, nominally pH = 2, were optimized by fitting chemical shifts of aspartic acid, glutamic acid and histidine residues of two BMRB entries acquired under denaturing conditions at acidic pH values.
In 2011, Kjaergaard et al. took a more sophisticated approach to performing pH-correction of RCCSs76 using Gly–Gly–Xxx–Gly–Gly pentapeptides similarly to Schwarzinger et al.58 In contrast to the original work, Kjaergaard et al. determined RCCS values at pH = 6.5 instead of pH = 2.5 in 8 M urea. Moreover, they performed pH-titration of the peptides with Xxx = Asp, Glu and His and determined the side chain pKa values of these residues by non-linear fitting of the titration curves. The equation for the δ(pH) pH-corrected RCCS is the linear combination shown by eqn (20).
(20) |
On the basis of eqn (20), originating from the classical equation for chemical equilibria, a continuous pH-correction is enabled.
Such an approach is dependent on the knowledge of side chain pKa values for individual residues in the protein.
Another limitation is that in practice, the pH-dependence of the chemical shifts of titratable residues in proteins often follows a Hill-model,122,124,126 including an extra parameter called the Hill-coefficient. The advantage is the lack of a “blind range” as opposed to the earlier, binary corrections.
The Prosecco server provides a binary pH-correction with the same parametrization of pH correction terms as Camcoil.62,65 In Prosecco, 6 BMRB entries were used to optimize the correction terms, with chemical shifts corresponding to nominal pH values of 2.8 and 6.4. During optimization the minimization of the absolute difference between calculated and experimentally determined chemical shifts of the 6 analyzed IDP entries was performed.
Being a very synthetic approach, the pH-correction of Potenci utilizes earlier ideas, but introduces some novelty to the field.66 Potenci's uncorrected RCCSs correspond to pH = 7.0. Therefore, pH-correction is only needed if the pH differs from this value. The correction considers a linear combination approach similar to eqn (20). The novelty is, that the Hill model is utilized, and the fHA relative concentration of the protonated side chain is calculated as in eqn (21).
(21) |
Further on, using the determined relative concentrations, the final pH-correction is acquired according to eqn (22).
εk(ai+k,pKai+k,pH) = ΔδkHA–A(ai+k)(fHA(pH) − fHA(pH = 7)) | (22) |
The first continuous temperature-correction terms in RCCS calculation were introduced by Kjaergaard et al.76 The chemical shifts of Gly–Gly–Xxx–Gly–Gly peptides were determined at 5, 15, 25, 35 and 45 °C. As in all cases, a linear dependence was observed, the slopes of the corresponding curves were obtained from linear fitting. Thus, temperature coefficients for the HN, Hα, N, Cα, C′ and Cβ atom types of the 20 amino acids are available and the temperature correction is performed as:
dTcorr = a × (T − Tref) | (23) |
Once the reference temperature of any RCCS dataset is available, it is theoretically possible to transfer temperature coefficients from this RCCS prediction method to others – in stark contrast to sequence correction terms. Nielsen and Mulder performed such a transfer of temperature coefficients published by Kjaergaard et al. when developing Potenci. Accordingly, Potenci uses eqn (23) with Tref = 298 K for temperature correction.
Comparison of RCCS predictors was performed in two different ways. First, by visual analysis of the calculated Cα SCS plots where one has to focus on regions showing meaningful differences depending on the RCCS predictor, and therefore leading to the controversial identification of secondary structural propensities. Second, via comparison of the predictors by a versatile non-parametric statistical tool: the sum of ranking differences and comparison of ranks to random numbers (SRD-CRRN) method augmented by one-way analysis of variance (ANOVA) with post hoc Bonferroni pair-wise tests.141–144
Fig. 1 The amino acid sequence of human ubiquitin; structures determined by different methods, where secondary structural elements are represented by red wave (helix), green arrow (sheet) and black line (flexible loop) below the sequence; as well as the aligned 3D structures (PDB: 1D3Z, cyan, PDB: 1UBQ, dark blue). Regions Gln40–Glu51 and Leu56–Thr66 are highlighted by dotted grey boxes. |
Using the chemical shift information, the secondary structural elements can be assessed, and for this purpose we prepared the Cα SCS plots of ubiquitin (BMRB: 4769) with the selected 8 predictors (see Fig. 2). A simple visual inspection shows that there is little discrepancy between the different methods, as also represented in the plot showing the median values. Regions with a given secondary structure have high (±2 ppm) Cα SCS values, while the mobile loop regions show ±0.5 ppm values, that are characteristic of the behavior of IDPs. We use the median instead of the mean to represent the general trend of the 8 selected RCCS predictors because the former is less sensitive to outliers. As will be shown later, outliers do appear when SCS values with different RCCS predictors are calculated, especially in the case of IDPs.
Still, despite the general similarity of plots in Fig. 2, some differences can be noticed. The Gln40–Glu51 region, which is highlighted on the sequence in Fig. 1, is suggested by structure elucidation to host two β-structures separated by a short loop. On the SCS plots, this structural motif is supported by most methods, appearing as an “inverse valley”, but this pattern is visually less well-defined for the Camcoil predictor. The discrepancy originates from the SCS values of Gly47 being either positive or very small negative by all predictors, but Camcoil yields a negative value of −0.40 ppm – leading to the interpretation that a single, unbroken β-structure is present. Moreover, differences arise in how pronounced the second β-motif is. The ncIDP and Wishart methods give very small amplitude SCSs for the Gln49 residue which makes noticing the motif more difficult compared to other predictors. The Wang and Schwarzinger methods give relatively larger negative SCSs for Arg42 of the first β-motif, suggesting an inclusion of Arg42 in the structure which is not as obvious with the other predictors. We note that even the two 3D structures of Fig. 1 agree on the presence of two β-structures separated by Ala46 and Gly47 but differ in the exact length.
Another region with spectacular differences is the Leu56–Thr66 part. Both 3D structures of Fig. 1 report a helix followed by a flexible loop. In the Cα SCS plots of Fig. 2, one can see relatively large, positive values for Leu56–Asp58, followed by positive values with differing amplitudes between Lys63–Thr66. In-between a set of SCSs is characteristic of a loop. Gln62 gives a relatively large negative value according to all methods, and such spikes are known to occur at the beginning and the end of well-defined structural elements. An interesting feature is that with the Schwarzinger method, SCS(Asp58) > SCS(Leu56). This is because the Schwarzinger method overestimates the SCS of aspartic acid at a close to neutral pH, due to its acidic calibration. The Camcoil method also features something very peculiar regarding this region. Similarly to the Wang method, Camcoil indicates a relatively high SCS for Thr66, suggesting a helix between Lys63–Thr66. With Camcoil, all SCS values between Leu56 and Thr66 are positive, with only the negative spike of Gln62 breaking the tendency. Therefore, one can assume that there is a single helical structure between Leu56–Thr66 and the experimentally determined CS of Gln 62 was assigned incorrectly. Such a conclusion would obviously call for a reinspection of spectral assignment on the part of a researcher using Camcoil, whereas none of the other methods indicate any need for such a procedure. One must be aware that the choice of an RCCS predictor might result in such ambiguities even in the case of globular proteins with well-defined secondary structural elements along the sequence.
From the available chemical shift information (BMRB 27348) the SCS plots were constructed for each of the eight predictors and the median. The trends correspond to white noise with some added effects of residual structure. This is clearly shown by the fact that all but one value in the median plot in Fig. 4 are within the ±0.5 ppm range. Therefore, the importance of both the choice of an adequate RCCS predictor and the differences between RCCS prediction methods increases. As can be seen in Fig. 4, different structural tendencies may be proposed just by visual assessment of the Cα SCS plots. Obviously, all trends and related conclusions are much less sound than for structured proteins. However, in IDP research, this is the information that is available and that all advanced structural propensity calculation methods, irrespective of their varying degrees of sophistication, may rely on. While all the plots in Fig. 4 differ from each other to some extent, Cα SCS plots yielded by the Schwarzinger and Camcoil methods are especially unique. In the case of the Schwarzinger method, the spikes appearing at Glu and Asp residues result from the corresponding RCCS values having been recorded at pH = 2.3. The titratable side chains of these residues were completely or close to completely protonated under the circumstances of RCCS calibration, while the same side chains were completely deprotonated at pH = 6.5, where the data of BMRB entry 27348 were recorded. In the case of Camcoil, the surprising feature is that the average amplitude of the SCSs is generally larger than for the other methods, irrespective of residue type. We see no direct connection between this and the theoretical background of Camcoil. However, one must be aware of this feature when using SCSs calculated by Camcoil either directly or by deriving structural probabilities from them in the δ2D35,148 method.
For the other 6 Cα SCS plots of Fig. 4, a few general tendencies can be noticed. A set of predominantly negative values at the C-terminal of the protein are seen in the plot corresponding to the Kjaergaard method with a similar pattern being present in the Wang plot, the Prosecco plot and, to a smaller degree, in the Potenci plot, too. The Cα SCS plot made using the Wishart data set indicates some sets of consecutive negative values in this sequential region, but these are separated by short sets of consecutive positive values. The positive spike appearing in the Wishart plot belongs to His50 and can be attributed to the effect of pH. The Wishart data set was collected at pH = 5.0, where titratable side chain of histidine was close to completely protonated, while at pH = 6.5 the same side chain is already partially deprotonated, resulting in an inherent and uncorrected error of SCSs.
A part of the sequence, where a structural propensity could be assumed to be present, is the Ala18–Gly36 region according to Wishart and between Ala19–Glu28 according to the Kjaergaard method. Data obtained from the Potenci method could also be argued to mildly reinforce this tendency.
The Cα SCS plot based on the Wang method is peculiar even in itself. Up until residue Gln99, positive SCS values, many of which exceed 0.4 ppm, dominate the plot, with very few negative values of smaller than 0.2 ppm amplitude breaking the trend. From Gln99 onward, negative values, indicating β-sheet propensity dominate. Because of regular breaks in the trends and considerable variation in the amplitude of consecutive SCSs of the same sign, the Wang plot of α-synuclein is a typical example of a Cα SCS plot, which is very difficult to interpret, even qualitatively.
On the contrary, the Cα SCS plot of α-synuclein by ncIDP has amplitudes below 0.3 ppm almost exclusively, and no longer series of consecutive SCSs with the same sign are present. The plot is very similar to small amplitude white noise, as one would expect for a completely unstructured protein.
In the median Cα SCS plot of Fig. 6, the Gln16–Lys24 region of p53TAD1-60, has a set of consecutive positive values, suggesting a helical propensity. However, both the strength and exact localization of this propensity vary between the individual predictors. The strongest suggestion for residual helicity originates from Camcoil and the Wishart method, but even these two plots differ considerably in their patterns. With the Schwarzinger method there are only two positive spikes at Asp21 and Leu22, and no propensity can be supposed. The double spikes appearing at Asp41–Asp42 and Asp48–Asp49 because of the acidic calibration of the Schwarzinger method should also be noticed. In the Val31–Asp40 region neither glutamate nor aspartate residues are found, yet the Schwarzinger plot suggests a relatively convincing β-propensity. Neither of the remaining methods reinforces this tendency in comparable strength and length. Some β-motif could also be present close to the C-terminus between Trp53 and Asp57, according to the median plot, however the patterns of Camcoil and the Schwarzinger method differ from those of their peers considerably in this aspect.
Summarizing the visual inspection for all the above-mentioned biomolecules, in the case of folded ubiquitin differences between RCCS predictors are not crucial. Also, assessing α-synuclein and p53TAD1-60 to be fundamentally unstructured is possible using each of the eight RCCS predictors. However, assessment of the presence or absence of regions with secondary structural propensities – the primary aim of SCS analysis – is not at all clear. The eight selected methods differ in both the amplitudes of SCS values, and in the corresponding trends of the signs thereof. Thus, evaluation of secondary structural propensities according to this set of RCCS predictors is ambiguous, raising various questions. Which RCCS calculation method(s) is one supposed to use for a given experimental dataset? If each method has its own limitations, is there a way to use a set of different RCCS predictors to get a realistic idea about these mild propensities? Is there a single RCCS prediction method which generally best represents the consensus of all 8?
We illustrate the use of our approach on a model example, where five datasets were generated with thirty observations each. Note here, that for the SCS data, the only data pretreatment needed is the removal of missing SCS values for the missing assignments. The input data matrix contains the vectors corresponding to the methods to be compared. In the next step the SRD algorithm performs the comparison and validation by two built-in approaches. The first one is a comparison of SRD values to the SRD distribution of random vectors; and the second one is a 5-fold cross-validation. Running such an SRD-CRRN implementation results in a graph similar to the one in Fig. 7.
The theoretical cumulative distribution function of random vectors is represented by the black curve, and its numerical values are shown on the y-axis on the right. The curve is sigmoidal, as the SRD values of random vectors may be approximated by a normal distribution already for relatively small sample sizes (in our case thirty), as proved by Héberger and Kollár-Hunek.142 Lines A, B and C represent the 5th percentile, median and 95th percentile of this distribution, meaning the interval between A and C is the region of insignificance (insignificant region at α = 5% level of significance). The compared five methods from our example are represented by colorful sticks. The height of a given stick is equal to its x-coordinate, that is the SRD score of the corresponding method (i.e. 4.24 for method 1). This is the reason for the SRD score being shown on both the x and the left y-axis. In the chosen model example, the reference method also appears in the SRD-CRRN plot. Obviously, the SRD value of this is 0 by definition, as its ranking is identical to that of the reference vector, meaning itself. Generally, in the SRD-CRRN applications this is not shown. Further on, Fig. 7 shows, methods 1, 2 and 3 reproduce the reference ranking much better than random numbers, as the corresponding colorful sticks are close to zero and they are far away from the region of insignificance A–C. In contrast, method 4 falls in the insignificant region, meaning it is not linked to the reference vector by any deterministic relationship. Method 5 is related to the reference vector, but produces a ranking inversely correlated to the reference, and as a result it is situated closer to the value of 100 than to the median of the distribution of random numbers. In conclusion, Fig. 7 provides an ordering of the compared methods according to their performance. Still, one can observe that methods 1, 2 and 3 are close to each other but there is no indication whether one is significantly better than the other. This can be decided by applying ANOVA on the data provided by the built-in 5-fold cross-validation of the SRD algorithm. As p(ANOVA) in Table 2 is, <10−7, which is <5%, the test is significant, meaning not all methods perform equivalently well. This is highlighted by the mean SRD score of the methods shown in Table 2. These mean SRD score values correspond to the positions of the stick in Fig. 7. Which of the five methods are significantly better or worse can be checked by a Bonferroni post hoc test.144 The Bonferroni test provides a grouping pattern of the methods as shown in Table 2. In this example, the chosen five methods form four homogenous groups named G1, G2, G3 and G4. The grouping pattern highlights that method 1 and 2 are not differentiable at a 5% level of significance by the post hoc Bonferroni t-test. Similarly, method 2 and 3 can be treated as equivalent, however method 1 performs clearly better than Method 3 based on the Bonferroni test. Method 4 and 5 are significantly different from all other methods, therefore they form independent groups G4, G5. This model example was deliberately made to show that a method might belong to multiple homogenous groups as method 2 belongs to both groups G1 and G2; but it is also possible to have groups containing a single method like G3 and G4.
p (ANOVA) | Method | Mean SRD score | G1 | G2 | G3 | G4 |
---|---|---|---|---|---|---|
<10−7 | Method 1 | 4.24 | **** | |||
Method 2 | 5.56 | **** | **** | |||
Method 3 | 6.14 | **** | ||||
Method 4 | 70.21 | **** | ||||
Method 5 | 92.78 | **** |
Still, the RCCS predictors align in a certain order and according to ANOVA they are, even if seemingly not very different, not all equivalent. The Bonferroni post hoc test indicates that the two best methods, Potenci and that of Kjaergaard are equivalent at a 5% level of significance (Table 3). All other methods are significantly different from each other. The Schwarzinger method falling visibly behind all the other predictors is explained by its calibration under acidic circumstances. This makes Schwarzinger SCS values for residues with titratable side chains like Asp21, Asp24 and Asp32 much larger than those given by the other predictors. This effect might go unnoticed in visual assessment of Fig. 2 but is highlighted by SRD-CRRN.
p (ANOVA) | Method | Mean SRD score | G1 | G2 | G3 | G4 | G5 | G6 | G7 |
---|---|---|---|---|---|---|---|---|---|
<10−7 | Potenci | 2.71 | **** | ||||||
Kjaergaard | 2.75 | **** | |||||||
Prosecco | 3.14 | **** | |||||||
ncIDP | 3.53 | **** | |||||||
Wishart | 4.61 | **** | |||||||
Wang | 5.31 | **** | |||||||
Camcoil | 6.50 | **** | |||||||
Schwarzinger | 8.71 | **** |
Another interesting observation is, that four more recent methods – three database-derived (Potenci, Prosecco, ncIDP) and one small peptide-based (Kjaergaard) perform best, and they are followed by the small peptide-based method of Wishart. Thus, the Wishart method, which is the oldest one here, ends up in front of three methods developed later (Camcoil, Wang, Schwarzinger) that could naively be considered improvements in the field.
A strongly phrased interpretation is that in the case of α-synuclein, although it is possible to find RCCS predictors which better reproduce the consensus of the eight considered methods, no single prediction is even close to being equivalent to this consensus SCS vector.
Differences in median SRD are larger, indicating a more pronounced ordering and grouping of the predictors. This is reinforced by the ANOVA post hoc analysis results (Table 4). Potenci performs best, finishing at first place, followed by the Wishart method, ncIDP and Prosecco. As for these three, at a 5% level of significance ncIDP – located in between – is indistinguishable from both the Wishart and Prosecco methods. In turn, these latter two are not equivalent according to the Bonferroni test. The remaining four methods form no homogenous groups but are all pairwise distinguishable from each other. This result highlights the usefulness of conducting ANOVA on the SRD data, as by a simple visual inspection of Fig. 9 one would consider the Wang and Schwarzinger methods equivalent. The ordering of the methods differs from that found for ubiquitin; however, one has to observe that the first five and last three methods are the same for both proteins. Similarly, in the case of α-synuclein no clear trend can be seen based on either time of introduction or the underlying principles of the methods. Potenci, the newest database derived RCCS predictor finishes first, but the oldest small peptide-base method is the next in line.
p (ANOVA) | Method | Mean SRD score | G1 | G2 | G3 | G4 | G5 | G6 | G7 |
---|---|---|---|---|---|---|---|---|---|
<10−7 | Potenci | 25.66 | **** | ||||||
Wishart | 30.49 | **** | |||||||
ncIDP | 30.83 | **** | **** | ||||||
Prosecco | 31.65 | **** | |||||||
Kjaergaard | 34.19 | **** | |||||||
Schwarzinger | 40.81 | **** | |||||||
Wang | 41.87 | **** | |||||||
Camcoil | 45.32 | **** |
The importance of pH effects was highlighted in the comparison of RCCS predictors by the SRD-CRRN method even for folded ubiquitin. As some sort of pH correction is available in four of the eight studied predictors, we intended to investigate the effect of pH via SRD calculations. For this purpose, we used chemical shift data for α-synuclein, acquired at various pH and temperature values (BMRB: 18857). We classified the amino acid residues as titratable (including aspartic and glutamic acid) and all others as non-titratable. This way, according to the BMRB dataset out of the possible 140 residues 133 are assigned, yielding 22 titratable and 111 non-titratable residues in the 2.16–7.51 pH range, at 283 K. For the analysis we selected representative pH values of 2.16, 4.21 and 7.51. This enables us to see how the different pH-correction schemes deal with a pH value at which most of the titratable side chains are partially protonated, and what happens if the pH is close to the Asp, Glu sidechain pKa values. In order to avoid unnecessary outliers in SRD-CRRN calculations, at each pH we use only the suitable RCCS predictors. This means, for example, that the Schwarzinger method is excluded at pH = 7.51, as its RCCSs have clearly been calibrated under acidic conditions. Similarly, ncIDP and the methods of Wishart and Wang are excluded at pH = 2.16 values, as the Wishart dataset corresponds to pH = 5.0, while data recorded between pH values of 4.0 and 7.5 were used for the development of ncIDP. The Wang method was also developed for very mildly acidic and close to neutral conditions. The obtained pH-dependent SRD plots are shown in Fig. 10.
Considering the titratable residues of α-synuclein at pH = 2.16 the SRD results suggest that the Kjaergaard and Potenci methods – the two RCCS predictors with continuous pH correction schemes – perform the best. Quite surprisingly, the Schwarzinger method ends up in the in significant range and is the worst at reproducing the consensus of the five methods even under these conditions. Since the pH in this case is rather close to the calibration pH of the Schwarzinger dataset, this result is surprising. It is also interesting that the two small peptide methods (Kjaergaard and Schwarzinger) finish first and last, respectively. This confirms that it is generally not recommended to choose an RCCS prediction method solely based on its origin as datasets with very similar backgrounds can produce very different results under conditions at which both should be equally valid. The lack of consensuality between the Schwarzinger method and its peers might be explained by the small peptides used for calibration. The simple glycine frame harboring a glutamic or aspartic acid residue is extreme with respect to molecular mobility and intramolecular electronic interactions.
Regarding the non-titratable residues of α-synuclein at pH = 2.16, the order of the same five RCCS predictors is shuffled, and the grouping pattern is also different. In this case, Potenci is the most consensual of the five, while Camcoil finishes last. The Schwarzinger method is at the fourth position but is better than random numbers, indicating that the Schwarzinger method predicts more consensual RCCSs for non-titratable residues than for titratable ones under acidic conditions. The Kjaergaard method, which was the most consensual for the titratable residues, now finishes second with Prosecco closely following it. Interestingly, at pH = 2.16, all the above differences in SRD values are significant (Tables S1 and S2†).
At the intermediate pH of 4.21 and for titratable residues neutral generally neutral methods perform well, while the Schwarzinger method and the acidic version of Prosecco end up in the insignificant region (Fig. 10). The most interesting feature at this pH is the very pattern of the SRD-CRRN plots with titratable and non-titratable residues. In the latter case, all methods are grouped close to one another. This highlights that differences between RCCS predictors are generally much more pronounced for glutamic acid and aspartic acid residues than for the non-titratable amino acids. Note, that the general consensus of the predictors is generally worse at this pH, indicated by higher values SRD values. Also, at pH = 4.21, more predictors perform equivalently well based on post hoc Bonferroni test results shown in Tables S3 and S4.† Here, there are multiple homogenous groups containing more than one predictor.
For titratable residues at pH = 7.51, the Wishart and Potenci methods perform equivalently and are followed rather closely by the equivalently performing Prosecco, ncIDP and the Kjaergaard method (Table S5†). The Wang method is reasonably close to this cluster, while Camcoil is at the seventh position. Once again, Potenci as the newest database-derived method – shows similarity to the Wishart dataset which is the oldest of those used here and has been calibrated on a set of small peptides. It is also interesting that Potenci and ncIDP, where the former is an enhanced and more complete version of the latter, are not the closest to each other. Regarding the non-titratable residues at pH = 7.51 (Table S6†), ncIDP seems to be the most consensual, closely followed by Potenci and the Wishart method. Then, Prosecco and the Kjaergaard method perform rather similarly to each other with the Wang method and Camcoil loosely following them.
Generally, the consensus of RCCS predictors is weaker for Glu and Asp residues of α-synuclein, than for non-titratable ones. The identity of the most consensual predictors is dependent both upon pH and amino acid constitution of the protein.
Besides pH, the other dominant factor affecting CSs and corrected for by some RCCS prediction methods is the temperature. To study its effect, we used the five datasets of BMRB entry 18857 containing the experimental CSs of α-synuclein at 278, 288, 293, 298 and 303 K, at pH = 5.87. As the Schwarzinger method has been shown to be inappropriate at this pH especially for glutamate and aspartate residues making up more than 10% of amino acids in α-synuclein, we excluded this predictor from the analysis. The results of the SRD calculations for each temperature are shown in Fig. 11. The general pattern shows that ncIDP is the most consensual followed by a group comprising Potenci, Prosecco, and the methods of Wishart, Kjaergaard and Wang in varying order. Camcoil ends up at the seventh position in all cases. Most methods are usually significantly different from each other according to the Bonferroni post hoc test, however, there is usually a single pair of methods which are indifferentiable (Tables S7–S11†). At 278 K it is Potenci and Prosecco, at 288 K the Kjaergaard and Wang methods, at 293 and 298 K it is Prosecco and the Wishart method, while finally, at 303 K ncIDP and Potenci end up being indifferentiable at a 5% level of significance. We note that this set of methods shows considerable stability in SRD, namely, at 288 K and above the last three positions are occupied by the method of Kjaergaard, Wang and Camcoil in this exact order. Prosecco and the method of Wishart are close in all cases, switching places between 293 K and 298 K and then their earlier order is restored again at 303 K. Potenci, which performs ordinarily at 278 K, becomes very closely the most consensual method at 303 K.
Since the changes in the experimental CSs of α-synuclein are apparently not related to any change in structural propensities in this temperature range, the data further highlight the fact that the degree of similarity of RCCS datasets provided by the different predictors is much dependent on the experimental conditions. This is true even despite some general tendencies in the SRD plots displaying stability. The individual changes in the consensuality of the seven methods indicates that in order to formulate a sound conclusion from SCS data, the simultaneous use of a carefully selected set of RCCS predictors is desirable.
p (ANOVA) | Method | Mean SRD score | G1 | G2 | G 3 | G4 | G5 | G6 | G7 |
---|---|---|---|---|---|---|---|---|---|
<10−7 | ncIDP | 15.35 | **** | ||||||
Potenci | 19.92 | **** | |||||||
Wishart | 20.20 | **** | |||||||
Prosecco | 24.49 | **** | |||||||
Kjaergaard | 26.01 | **** | |||||||
Wang | 30.05 | **** | |||||||
Schwarzinger | 34.21 | **** | |||||||
Camcoil | 36.51 | **** |
In this work we intended to give an overview with corresponding theoretical background of RCCS prediction, and special attention was paid to neighbor-, pH-, and temperature correction schemes. We attempted summarizing earlier work in a way that highlights fundamentals of different approaches and the temporal evolution of RCCS predictors in the last decades. We believe, such a review might be helpful for the experimental protein scientists to maximally exploit the acquired data for identifying secondary structural propensities of IDPs. Our approach of treating RCCS prediction as a somewhat ill-defined calibration problem is expected to give hints for computational and theoretical researchers for making developments in RCCS prediction.
Our selected set of eight RCCS predictors (most of them are also the ones most abundantly referenced in publications) can, in our opinion, be plausibly used. Performing a visual and statistical comparison of the selected RCCS predictors demonstrates how the choice of any single RCCS predictor might affect the structural conclusions. Even though we focus only on the Cα environment, and the conclusions reflect the behavior of this atom type, the chosen examples highlight general tendencies and are not compromised in validity. We introduce in this field and suggest the use of the very sensitive SRD-CRRN analysis coupled with post hoc complemented ANOVA. This approach can detect statistically significant differences even in the case of a folded protein with well-characterized structure. However, these differences only slightly affect secondary structural conclusions, as we show on the example of ubiquitin. Much more importantly, using α-synuclein and p53TAD1-60 as examples, we demonstrate that the slight differences – occurring exclusively as a consequence of choosing different RCCS predictors – can be crucial in the study of IDPs. The non-equivalence of RCCS predictors could clearly highlight or mask certain potential secondary structural tendencies. Relying on the pure review part of the present work, we discuss how these differences of RCCS predictors are related to the different theoretical backgrounds of the predictors, especially correction schemes and the molecular systems used in the calibration processes. These examples and explanations were aimed at highlighting the most general pitfalls to avoid during SCS analysis. We have demonstrated that amino acid sequence, pH and temperature mutually determine which RCCS predictors should be used. However, a trustworthy selection of RCCS predictors should be based on statistical analysis, rather than intuition or habit alone. Especially, applying SRD-CRRN – one of the up-and-coming tools of chemometrics – for SCS analysis and selection of RCCS predictors is an important methodological advance.
In case of IDPs the use of multiple RCCS predictors is beneficial. Noting their individual drawbacks and peculiarities, we recommend all eight predictors used in this study, if experimental conditions permit. However, we believe that there is a need for the development of at least one composite statistical method which is capable of incorporating the information content of multiple RCCS predictors. Until the underlying problem of ill-defined RCCS calibration is satisfactorily solved, such an approach to SCS analysis would be the best and most user-friendly option in the field.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3ra00977g |
This journal is © The Royal Society of Chemistry 2023 |