Simon E. F.
Spencer
*a and
Alison
Rodger
b
aDepartment of Statistics, University of Warwick, Coventry, UK. E-mail: s.e.f.spencer@warwick.ac.uk
bDepartment of Molecular Sciences, Macquarie University, NSW 2109, Australia. E-mail: alison.rodger@mq.edu.au
First published on 7th December 2020
Circular dichroism spectroscopy is an important tool for determining the structural characteristics of biomolecules, particularly the secondary structure of proteins. In this paper we propose a Bayesian model that estimates the covariance structure within a measured spectrum and quantifies the uncertainty associated with the inferred secondary structures and characteristic spectra associated with each secondary structure type. Furthermore, we used tools from Bayesian model selection to determine the best secondary structure classification scheme and illustrate a technique for comparing whether or not two or more measured protein spectra share the same secondary structure. Our findings suggest that it is not possible to identify more than 3 distinct secondary structure classes from CD spectra above 175 nm. The inclusion of data from wavelengths between 175 and 200 nm did not substantially affect the ability to determine secondary structure fractions.
In recent years large datasets of CD spectra have been produced8–10 which enable the relationships between secondary structure and CD data to be explored through mathematical modeling and statistical analysis. In CD spectroscopy the main approach to find the secondary structure has been to deconvolute a spectrum into a weighted sum of so-called characteristic spectra by a variety of different algorthims.11–14 For a given protein the relative weight of each characteristic spectrum enables calculation of the abundance of the respective structure element from known data about the structure of the proteins making up the reference set. From a statistical perspective these approaches share the same model structure: a linear model, and use standard regression techniques to fit the model. There are a few exceptions, such as neural network model approaches,15–17 though the self-organising map neural network approaches also involve finding a best match combination of spectra and assigning secondary structures from them. The implicit assumption behind a linear model is that the errors at each wavelength are uncorrelated and have equal variance, even though evidence for a more complicated error structure has been recently pointed out,18 with variance depending on the wavelength. There are also significant amounts of correlation within some parts of the spectrum. In this work we used a Bayesian approach in which we keep the linear structure but let the covariance matrix of the CD spectra be as general as possible in order to identify crucial dependencies within the CD spectra and weight the information in the most coherent way. Firstly, our approach can be used as a secondary structure estimation method or to enhance existing algorithms. Secondly, we can use tools from Bayesian model selection to investigate the secondary structure classifications schemes that can be determined successfully from CD data. In particular, we consider which secondary structures can be assigned from CD data between 175–260 nm. We also explore possible uses of the posterior uncertainty.
cx = fxB + wx | (1) |
0 ≤ fi ≤ 1, for i = 1,…,ns | (2) |
![]() | (3) |
Each existing approach has a different method for satisfying such constraints, sometimes based on ad hoc criteria and leading to an approximate solution, e.g. the sum can be in the region 1 ± 0.05.11
A third challenge relates to the common assumptions of considering normal and independent variables for the error wx. If Σ is the covariance matrix within a spectrum, then it is usually chosen to be a diagonal matrix, not taking into account possible correlations that a spectrum is known to display.
In the following sections we discuss a Bayesian approach to inference where the fraction vector fx and characteristic matrix B are estimated jointly, making full use of the information in the reference set, i.e. without variable selection. We introduce a Dirichlet prior for the fractions fx to capture the constraints (2) and (3) and most importantly we estimate a general covariance matrix Σ for the errors, allowing the model to learn the correct covariance structure from the reference set.
C = FB + W | (4) |
We suppose that the CD spectra are normally distributed about BF with the same general covariance matrix Σ, which will be inferred from the data. Furthermore, we assume no dependence between the spectra of different proteins. This leads to the error W taking the form of a matrix normal distribution:
![]() | (5) |
![]() | (6) |
π(Ft,B,Σ) = π(Ft)π(Σ)π(B|Σ,Ft) |
The conditional dependence structure of the model is shown in the ESI (Fig. S1†). For every protein in the test set we chose independent Dirichlet-distributed priors:
fx ∼ Dir(α), for x = nr + 1,…,nr + nt |
α = [1/2,…,1/2]ns |
The Dirichlet distribution is a natural choice because the parameter space is a ns-dimensional simplex defined by the constraints (2) and (3).
For B and Σ we follow the common choice for Bayesian linear models and choose conjugate priors.20 The prior for B is the matrix normal distribution with mean M and covariance matrices U and Σ. In applications the matrix M was chosen to be the zero matrix, which is symmetrical with respect to positive and negative signals – reflecting left and right-handed chirality – and has a shrinkage effect. The matrix U represents the covariance between the secondary structure classes and a g-prior21 is chosen to account for possible relationships within the secondary structures:
U = g(FTF)−1. |
Following George and Foster,22 we set the hyper-parameter g = nr, the dimension of the reference set, this choice is referred to as the unit information prior.
The prior for the nλ × nλ covariance matrix Σ, representing the covariance structure within a CD spectrum, is the inverse-Wishart distribution. The inverse-Wishart Wn−1(δ,S) is the generalization of the inverse-Gamma distribution in n-dimensions, having density:
In applications we chose δ = nλ, representing the degrees of freedom, and S = Inλ, a covariance matrix with no correlation and unit variances. In summary, we can write for the priors:
![]() | (7) |
![]() | (8) |
![]() | (9) |
The full conditional distribution for the covariance matrix Σ is given by,
Σ|(C,Ft) = W−1(δ*,S*) | (10) |
The conjugate priors for B and Σ allow us to integrate out these parameters and obtain a closed-form expression for the likelihood:
![]() | (11) |
The prior for Ft is not conjugate and so these parameters are updated using a Metropolis–Hastings step, for each protein individually. Proposals are generated using an adaptive Dirichlet random walk algorithm. If fx is the row within Ft for protein x, then the proposal is
![]() | (12) |
The performance of an algorithm, i.e. comparing a secondary structure estimate with the X-ray experiment value, taken as “truth”, is performed with two measures commonly accepted in the CD literature: the root-mean-square deviation (RMSD) δ and the Pearson correlation coefficient r. They are defined for a protein x with true value as
![]() | (13) |
As well as measuring global performance, it is interesting to know how the algorithm behaves for each secondary structure class. The RMSD δi and correlation coefficient ri, for class i, are calculated similarly to (12) and (13), but performing the sum over the proteins x in the cross-validation set. The measures are known in the literature as performance indices of the analysis and used for comparisons between methods. A variety of more sophisticated performance measures also exist, which attempt to normalise by the amount of variation inherent within each class. One such measure is ζi = σi/δi,28 where σi is the standard deviation of the secondary structure fractions for class i. If ζi is greater than 1 then the average error is less than a random choice from the reference set.
Our Bayesian approach produces a posterior distribution over secondary structures rather than a single estimated secondary structure vector. Unless otherwise stated we have used the posterior mean secondary structure as the estimated structure.
![]() | (14) |
There is no closed form available for the marginal likelihoods for these models. However, due to the conjugate priors for B and Σ, we can integrate out these two parameters to obtain an expression for π(C|Ft), see eqn (11).
To obtain an estimate of the full marginal likelihood
π(C) = ∫ π(C|Ft)π(Ft) dFt |
![]() | (15) |
It is desirable to make q(Ft) over-dispersed relative to the true posterior, to make the variance of the importance sampling estimator as small as possible. This can easily be achieved by replacing q(.) with a multivariate t-distribution or a mixture of the multivariate normal and the prior π(Ft). For full details of this methodology see Touloupou et al.29
Method | α-helix | β-sheet | Other | |||
---|---|---|---|---|---|---|
δ | r | δ | r | δ | r | |
Bayesian | 0.061 | 0.96 | 0.127 | 0.77 | 0.137 | 0.65 |
SELMAT3 | 0.063 | 0.96 | 0.083 | 0.86 | 0.078 | 0.70 |
PCR | 0.057 | 0.97 | 0.069 | 0.91 | 0.066 | 0.80 |
PLS | 0.053 | 0.97 | 0.073 | 0.90 | 0.068 | 0.78 |
NN | 0.055 | 0.97 | 0.067 | 0.91 | 0.062 | 0.82 |
SVM | 0.057 | 0.97 | 0.069 | 0.91 | 0.066 | 0.79 |
Table 2 (which is rotated with respect to Table 1) shows the leave-one-out cross validation results using the SELCON secondary structure scheme, over the SP175 reference set. This time the Bayesian approach does not perform well, particularly for the classes turn and other. Results for the normalised measure ζ are given in the ESI,† where a value above one indicates an improvement above choosing at random from the reference set. For both SELMAT3 and PLS the value of ζ is just 1.04 for turn. The results of Table 2 and the normalised measure ζ in the ESI† indicate that there is not enough information, even in spectra down to 175 nm to differentiate the 6 SELCON secondary structure motifs.
Structure | Bayesian | SELMAT3 | PLS | |||
---|---|---|---|---|---|---|
δ | r | δ | r | δ | r | |
Regular helix | 0.090 | 0.836 | 0.048 | 0.956 | 0.040 | 0.971 |
Distorted helix | 0.129 | 0.043 | 0.035 | 0.809 | 0.036 | 0.791 |
Regular β-strand | 0.090 | 0.695 | 0.073 | 0.792 | 0.063 | 0.853 |
Distorted β-strand | 0.281 | −0.081 | 0.020 | 0.913 | 0.023 | 0.889 |
Turn | 0.201 | 0.098 | 0.052 | 0.325 | 0.052 | 0.332 |
Other | 0.169 | 0.278 | 0.050 | 0.717 | 0.050 | 0.720 |
We hypothesise that the underlying cause of this is that in our Bayesian approach we do not perform variable selection to identify a subset of the reference set for each cross-validation step. Variable selection has been found to avoid inconsistencies between CD spectra and secondary structure schema to improve the quality of the analysis.13,27 We have chosen not to select a subset of proteins for the reference set that are similar to the test protein as we wanted to use all the information from the reference set. If an α-helix, for example, has a characteristic signal then it should be consistently present in spectra for all proteins containing α-helice. We do believe that other protein features, such as distortions at the ends/joins, side chains and higher order structures, can obscure or modify this signal. These features are not necessarily represented in any of the secondary structure characterisation schemes. Bayesian analyses usually consider information from the whole data set and use the parameter uncertainty to weight the information content rather than discarding information that does not fit the pattern.
Another consideration is that early, landmark papers that found a need for variable selection32,33 were using techniques such as partial least squares (PLS) to identify the basis vectors from a comparatively small number of reference proteins. Whilst PLS will always succeed in producing n basis vectors from n (linearly independent) protein measurements, it is not clear from this kind of analysis how many of the resulting vectors contain only contributions from the underlying signal. The variability inherent in the dataset will certainly dominate in the last basis vectors and (hopefully) the signal will dominate in the first few vectors, but it may be the case, especially when the number of proteins in the reference set is small, that the basis vectors in the middle are actually representing the variability in the dataset as much as clearly identified secondary structures. The immediate conclusion that could be made is that variable selection is important for making predictions using existing classification schemes. However, the apparent success of the variable selection approaches depends on having ‘like’ proteins in the reference set which is simply not always possible with unknown proteins.
Each potential secondary structure scheme is represented by a different design matrix of secondary structure fractions F. Given F, we can evaluate the marginal likelihood for the model analytically from eqn (11) since the test set is empty here. These marginal likelihoods can be used to produce Bayes factors comparing any pair of models and, once prior probabilities for each model have been specified, the posterior probability in favour of each secondary structure scheme can be identified. Following Scott et al. and Spencer et al.34,35 we adjusted for multiplicity caused by different numbers of structures in the different models by first assigning a prior distribution over the number of classes, and then dividing the mass equally amongst the models that have the same number of classes. In applications we chose a uniform prior over the number of classes between 3 and the maximum as summarised in Table 3.
Classification schemes | Posterior probability | Log marginal likelihood | Secondary structure classes | ||
---|---|---|---|---|---|
DSSP | 0.510 | −7296.235 | α-helix | 310-helix+β-strand + bend | β-bridge + turn + other |
0.448 | −7296.366 | α-helix | β-strand + bend | 310-helix+β-bridge + turn + other | |
0.028 | −7299.131 | α-helix+β-bridge | β-strand + bend | 310-helix + turn + other | |
0.013 | −7299.892 | α-helix+β-bridge | 310-helix + β-strand + bend | Turn + other | |
SELCON | 0.999 | −7319.677 | Regular α-helix | Regular β-strand | Distorted α-helix + distorted β-strand + turn + unordered |
BeStSel | 1.000 | −7205.936 | Regular α-helix | Left β-strand + relaxed β-strand + parallel | Distorted α-helix + right β-strand + turn |
DSSP, SELCON, BeStSel | 1.000 | −7205.936 | Regular α-helix | Left β-strand + relaxed β-strand + parallel | Distorted α-helix + right β-strand + turn |
We performed model comparison to find the most appropriate model within the three secondary structure schemes individually and also amongst all three schemes. Again, to avoid bias towards schemes with larger numbers of models, we first assigned a prior probability of one third to each scheme and then divided this prior mass amongst the models stemming from each scheme as before. Table 3 gives all the schemes with posterior probability greater than 0.001. For DSSP the posterior probability is largely split between 2 very similar models that differ only in where 3–10 helix is included. For the SELCON scheme the best model included regular helix and regular β-strand as separate components and combined the remaining classes together. The best model for the BeStSel scheme included 3 basis spectra: regular α-helix; the sum of left-β-strand + relaxed and β-strand + parallel, and the remaining components. This model had by far the largest marginal likelihood and therefore it also dominates the comparison between the 3 classification schemes. Interestingly under all three classification schemes just 3 basis spectra were needed to explain the data. These all included a single α-helix class in the best model and combined class including β-strand as the second basis spectrum. It may be concluded that no more than three distinct structures that can be assigned from the data between 175–260 nm.
To investigate this further we performed a cross-validation with the BeStSel secondary structure scheme with the highest marginal likelihood for 3, 4 and 5 secondary structure classes. Results for the normalised measure ζ are given in ESI.† We repeated each analysis using data down to 175, 180, 185, 190, 195 and 200 nm. The results show that for all these ranges our approach can successfully identify 3 secondary structure classes, as the ζ values are all substantially above 1. This provides at least some evidence that our model selection approach, which favours the BeStSel secondary structure scheme, does so with good reason. The ζ values decay slightly as the lowest 5 nm of the spectra are sequentially removed, but not substantially. However, when we try to infer more than 3 secondary structure classes then there is always at least one class that has a ζ value below one.
These results are an interesting contrast with the accepted literature consensus e.g.36 which is that data to 200 nm contain two independent pieces of information (which with the requirement of the sum of components adding to 1 makes three pieces of information), data to 190 nm contain 3–4, and data to 178 nm provide 5. The origin of this consensus is in a single value decomposition approach based on 16 reference spectra performed in 1985 by Manavalan and Johnson.33 Our work indicates that although CD spectra to lower wavelengths do contain more information about protein structure, it cannot be translated into increasing numbers of traditional well-defined structures.
![]() | ||
Fig. 1 Plots of fraction (out of a total of 1) of secondary structure motifs versus α-helical content for the proteins in reference set SP175 using the DSSP structure annotation. |
Fig. 2(a) shows the three characteristic spectra that were estimated from the SP175 reference set: the thick lines are the posterior median of the three columns of B and the shaded area captures a 95% credible interval. In Fig. 2(b) we show an image representation of the posterior mode for Σ, given by S*/(nr + nλ + 1). Thus, we conclude that despite the tendency for α-helix and β-sheet to act like a see-saw (Fig. 1), they have distinct typical shapes that combine to give an observed spectrum.
Fig. 2(a) and (b) show that there is more uncertainty/variability for lower wavelengths and relatively little variability above 250 nm where the spectra approach zero. The wide diagonal green band in Fig. 2(b) indicates a strong correlation between errors at similar wavelengths. Conversely the red patches indicate a negative correlation between very low wavelengths and the middle of the spectrum, indicating that if the observed spectrum is lower than predicted at around 190 nm, for example, then it will be higher than expected in the 210–230 nm region.
A key feature of our modelling approach is that we estimate covariance structure of a CD spectrum directly from the data, allowing the measurement uncertainty to change with wavelength and errors at similar wavelengths to be correlated. Most existing algorithms implicitly assume that the errors are uncorrelated so that Σ is forced to be a diagonal matrix.11–14,27 This forces errors at similar wavelengths to be uncorrelated, when in reality we expect them to be similar. Fig. 2(b) shows the estimated error structure of a spectrum and strong correlation structure is clearly present. Furthermore, the diagonal elements, representing variances, have a strong dependence on the wavelength, with larger variances at lower wavelengths. This feature is expected due to how the CD instrument works. At shorter wavelengths the high-tension voltage of the photomultiplier tube, which transforms the light signal into an electrical signal, is increased to compensate for the lower power of the light source. Thus, an increased variability in the low UV region data (λ < 200 nm) is a known feature of this kind of spectrum, but gives a worrying indication that the reference spectra are not as perfect as one might hope in the low wavelength region. Above 200 nm the covariance gradually fades to zero along the diagonal indicating higher quality data in this region.
Fig. 2(b) also shows that close wavelengths are positively correlated but further wavelengths are negatively correlated. Two potential mechanisms for generating a negative correlation structure are shown in Fig. 3. Negative correlation could be due to differences in the spectra of proteins in the reference set with similar overall secondary structure content. For example, if two proteins, with close secondary structure, display two similar spectra with a small scaling factor (concentration error) or a slight translation on the wavelength axis (poor wavelength calibration),37 these differences would lead to a fitted spectrum somewhere in between and the residual would be positive correlated in the short distance but negative for further wavelengths as the spectra change sign or gradient. Another source of disagreement between the observed and fitted spectra could be a lack of fit of the model, which might stem from the different between, n helical residues being in 5 small helices or 1 large one.27 Nevertheless, our methodology is able to identify these features of the CD data covariance structure, allowing it to properly weight the information when performing secondary structure estimation or spectral comparison.
![]() | ||
Fig. 3 Schematic representation of two kinds of error that lead to the correlation structure observed in Fig. 2(b). In (a) a translation to the left of 2 nm and (b) a rescaling of the spectrum by a factor of 0.7 (dash line) was applied to the true spectrum (solid line). The true spectrum was bovine gamma-E protein (PDB entry 1m8u). The bottom row plots show the difference between the true spectrum and the spectrum with error. |
Using eqn (14) and (15) we evaluate the marginal likelihood and the posterior model probability for the two models. As a model prior we chose P(Mi) = 1/2 for i ∈ {0,1}.
Under model M0 the spectra cx and cy are assumed to have the same secondary structure f : fx = fy. Here Ft has two identical columns, both equal to f. Under model M1 we obtain posterior samples for fx and fy, which represent the two columns of Ft. For both models the marginal likelihood can be estimated with the importance sampling estimator (15). However, for model M0 the importance proposal must be fitted to posterior samples from just one column of Ft, and the second column is set equal to the first; whilst in M1 the two columns are sampled independently.
We tested the comparison method for two CD spectra with two simulated examples. For our first spectrum cx, we take the spectrum of sucrose porin protein (scrY, PDBID: 1a0s) from the MP180 dataset.38 To obtain our second spectrum cy, we added noise to cx. First, we added white noise (Fig. 4 left column), with standard deviation equal to 0.12Δε. Second, Fig. 4 right column, we added multivariate Gaussian noise with zero mean and covariance matrix given by the posterior mode for Σ, representing the usual experimental variability. For both comparisons we fitted the competing models using 1000 iterations of the MCMC and then used 100000 importance samples drawn from a t-distribution with 3 degrees of freedom. The model comparison with white noise suggests that the secondary structures are different (posterior probability of a difference 0.987). For the experimental noise, the model comparison favours the simpler model, that the two secondary structures are the same (posterior probability of a difference 0.033). Although the magnitude of the white noise is much smaller than the experimental variability (see Fig. 4 lower plots), the model comparison exercise has correctly identified that the white noise does not conform to experimental variability.
We have showed the importance of capturing the correct covariance structure within a spectrum. Our Bayesian approach accounts carefully for the uncertainty in measurement as well as the unknown basis spectra. Three basis spectra and their uncertainty envelopes were generated. The experimental uncertainty could be removed by replacing the smoothed averaged reference and test spectra in our calculations with multiple individual repeats, preferably from different experiments and instruments. The result would be the spectral/structural variation for a revised Fig. 1(a). With replicate spectra the model could be further developed by adding a multivariate random effect associated with each protein, so that it would become possible to quantify the variation due to measurement error and poor model fit separately.
Whether our Bayesian method identifies two spectra as different depends strongly on the kind of noise in the two spectra and we were able to distinguish between a typical CD error, inferred from the reference set, and another kind of noise, i.e. a Gaussian random error. This comparison method has potential wide applications, as a suitable tool to monitor structural changes in protein screening assays, production processes or drug discovery processes.
A second direction for future development would be to combine data from different techniques such as linear dichroism, infrared absorbance spectroscopy, Raman, NMR etc.3,39–46 that might provide orthogonal information about protein structure. By fully characterising the measurement uncertainty with each technique, our Bayesian approach provides a natural way to combine and to correctly weight information across techniques, unlike existing approaches28,47 in which the influence of each technique depends only on the relative numbers of points observed in each spectrum.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/d0ay01645d |
This journal is © The Royal Society of Chemistry 2021 |