Jie
Li
ab,
Kochise C.
Bennett
ab,
Yuchen
Liu
ab,
Michael V.
Martin
c and
Teresa
Head-Gordon
*abcd
aPitzer Center for Theoretical Chemistry, University of California, Berkeley, CA 94720, USA. E-mail: thg@berkeley.edu
bDepartment of Chemistry, University of California, Berkeley, CA 94720, USA
cDepartment of Bioengineering, University of California, Berkeley, CA 94720, USA
dDepartment of Chemical and Biomolecular Engineering, University of California, Berkeley, CA 94720, USA
First published on 3rd March 2020
Here we report a new machine learning algorithm for protein chemical shift prediction that outperforms existing chemical shift calculators on realistic data that is not heavily curated, nor eliminates test predictions ad hoc. Our UCBShift predictor implements two modules: a transfer prediction module that employs both sequence and structural alignment to select reference candidates for experimental chemical shift replication, and a redesigned machine learning module based on random forest regression which utilizes more, and more carefully curated, feature extracted data. When combined together, this new predictor achieves state-of-the-art accuracy for predicting chemical shifts on a randomly selected dataset without careful curation, with root-mean-square errors of 0.31 ppm for amide hydrogens, 0.19 ppm for Hα, 0.84 ppm for C′, 0.81 ppm for Cα, 1.00 ppm for Cβ, and 1.81 ppm for N. When similar sequences or structurally related proteins are available, UCBShift shows superior native state selection from misfolded decoy sets compared to SPARTA+ and SHIFTX2, and even without homology we exceed current prediction accuracy of all other popular chemical shift predictors.
The heuristic chemical shift back-calculators are formulated as approximate analytical models such as shAIC,8 PPM,9 and PPM_One,10 empirical alignment-based methods such as SHIFTY11 and SPARTA,12,13 3D representations for machine learning of chemical shifts in solid state NMR for small molecules,14,15 and feature-based methods including SHIFTCALC,16 SHIFTX,17 PROSHIFT,18 CamShift,19 and SPARTA+;20 in the case of SHIFTX221 both alignment and features are utilized. Some of the most successful alignment-based methods rely on the fact that proteins with similar sequences will also share similar structures which lead to their exhibiting similar chemical shifts. This idea was first exploited by the program SHIFTY11 which “transferred” the chemical shifts from known sequences in the database to the query sequence based on a global sequence alignment. Higher accuracy was achieved in the formulation of SHIFTY+ by replacing the global sequence alignment with a local sequence alignment, and is included in the most recent chemical shift prediction program SHIFTX2.21 Alignment-based methods in general yield predictions with higher accuracy when a good sequence homologue is found in the database, and the constant increase in the number of sequences and associated chemical shifts deposited into the Biomolecular Magnetic Resonance Bank (BMRB),22 suggests that a similar sequence to the query sequence will continue to increase steadily.
On the other hand, methods that are based on sequence alignments will by definition fail if sequence similarity between the query sequence with any sequence in the database is too low, as well as the possibility that similar sequences can adopt very different structural folds.23 For query sequences with low sequence identity, the analytical or feature extraction methods predict secondary chemical shifts (i.e. from a random coil reference24) by providing data input formulated as hypersurfaces of structural data attributes such as backbone ϕ, ψ angles and hydrogen bonding derived from X-ray structures or calculated from quantum mechanics, or physical data observables such as ring currents,25,26 electric field effects,27 or Lipari–Szabo order parameters,28 that can be generated from easily parsed computational models. These feature extracted data are then used to establish empirical hypersurfaces such as used in SHIFTS7,29 and CamShift,19 or to train a machine learning algorithm in the cases of PROSHIFT,18 SPARTA+,20 and the SHIFTX+ component of SHIFTX2.21
In the evaluation of chemical shifts calculators like SPARTA+ and SHIFTX2, extensive effort has been put into making the test dataset as clean as possible. However, for chemical shifts of real-world proteins, large deviations from predicted values are typically considered as “outliers” and removed from the test dataset post-prediction.20 The definition of outliers can be arbitrary and results in higher test performance than operating on a real-world data set that may not necessarily be plagued by experimental errors. In this work we examine the current performance of feature extracted methods represented by SPARTA+, as well as the combination of sequence alignment and feature extracted method as implemented in SHIFTX2 on a randomly selected test dataset with minimal data filtering. First we assess the performance in terms of root mean square error (RMSE) with respect to experimental chemical shifts for SPARTA+ and SHIFTX2 when evaluated on a fully independent set of test proteins with high-resolution (<2.4 Å) X-ray structures. We use chemical shift data deposited in the BMRB, in which protein chemical shifts have been re-referenced with respect to high-resolution X-ray structures using RefDB developed by Wishart and co-workers.30 We also assess SPARTA+ and SHIFTX2 performance when eliminating putative outliers as determined by test data set filtering using PANAV,13 or removing test proteins with >30% sequence identity to the training set, which are dataset preparation steps that provide a more fair comparison to the two standard chemical shift calculators.
Furthermore, we show that higher accuracy for chemical shifts can be achieved with an enhanced hybrid algorithm, UCBShift, that makes predictions using machine learning on a more extended set of extracted features and transferring experimental chemical shifts from a database by utilizing both sequence and structural alignments. Although we find that we can realize better RMSEs if we also filter the different aspects of the test data, the resulting UCBShift chemical shift prediction method on all of the data including outliers yields RMSEs that will be at the level of 0.31, 0.19, 0.84, 0.81, 1.00 and 1.81 ppm for H, Hα, C′, Cα, Cβ and N respectively when evaluated on any independently generated test sequence.
The improved chemical shift performance of UCBShift can be utilized in several predictive contexts such as detection of erroneous chemical shift assignments and errors in reference shifts, to refine single folded structures31 or refinement of ensembles such as we have done in our Experimental Inferential Structure Determination Method (EISD)32,33 for folded and intrinsically disordered proteins (IDPs). To illustrate the usefulness of the UCBShift method, we consider the discrimination among alternative folds or selection of native structures among structural “decoys”. We determine that UCBShift is able to identify the native structure of two different proteins that comprise two different decoy data sets with certainty by examining the correlation between predicted and experimental chemical shifts, with significant improvement over SPARTA+ and SHIFTX2 prediction methods when sequence or structural homology is available.
Unlike previous incarnations of these data sets, which stripped out all presence of crystal waters and ligands, we generated a data set that retained the small molecules in the crystal structures. Our hypotheses is that for crystal waters especially, they often are highly conserved and functional, and are likely highly populated even in solution NMR experiments.34,35 Any reported hydrogen atoms in the Research Collaboratory for Structural Bioinformatics protein databank (RCSB or PDB) structures36 were removed and a systematic approach for adding a complete set of hydrogens used the program REDUCE60 to keep consistency in the structural data used across all approaches. To ensure more robust training, for each atom type, residues with chemical shifts deviating from the average by 5 standard deviations and residues that DSSP37 failed to generate secondary structure predictions were removed, which accounted for the removal of 40, 5, 18, 147 and 1 training examples for H, Hα, Cα, Cβ and N shifts, respectively. When stereochemically inequivalent Hα were present, their shifts were averaged. In the creation of data for each of the individual atom types, any residues that do not have recorded chemical shifts in the database were eliminated.
Before excluding redundant chains from the database, there were altogether 852 proteins in the training dataset. Duplicate chains were identified and excluded from our dataset: two chains are regarded as duplicates if the sequences and their structures are exactly the same, or eliminating the shorter sequence if it is a sub-sequence of a longer sequence (which is kept). However, 32 chains in the SPARTA+ dataset were retained because although they had identical sequences, they were found to have different structures and thus different chemical shifts. After excluding the duplicate chains by this prescription, the number of protein chains in the training dataset decreased to 647. The filtering of the training dataset based on RCI-S238 in principle excludes flexible residues whose chemical shifts are harder to predict. We did not exclude training data based on RCI-S2 as was done in some other chemical shift predictors, because a complete training set that covers different prediction difficulties is crucial for obtaining reliable performance for real-world applications. Table 1 reports the total number of training data examples for each of the different atom types. In addition, the RefDB database,30 which is a database for re-referenced protein chemical shifts assignments extracted and corrected from BMRB, was also compiled for the alignment-based chemical shifts prediction.
# of PDBs | H | Hα | C′ | Cα | Cβ | N | |
---|---|---|---|---|---|---|---|
Train | 647 | 72894 | 56149 | 58228 | 79611 | 70621 | 74896 |
Test | 200 | 19120 | 11727 | 8231 | 13140 | 10139 | 15374 |
Test (curated) | 200 | 18494 | 11240 | 7861 | 12533 | 9883 | 14610 |
LH-Test | 100 | 8634 | 4979 | 3332 | 5685 | 4278 | 6576 |
LH-Test (curated) | 100 | 8606 | 4950 | 3331 | 5251 | 4201 | 6480 |
Since the training dataset in Table 1 covers all of the data from SPARTA+ and SHIFTX2, a separate test dataset needs to be prepared for a fair comparison of all of the chemical shift programs. Therefore, 200 proteins with high-resolution (<2.4 Å) X-ray structures and with chemical shifts available in the RefDB were selected at random to form a separate test set that do not share the exact same sequence as training structures. These structures were downloaded from RCSB and again hydrogens were added with the REDUCE algorithm.60 Erroneous chemical shifts assignments were removed from this test dataset, which include 9 (H, Cβ, and N) chemical shifts that were significantly offset from the random coil average; 8 Cβ chemical shifts from cysteines that show strong disagreement with their expected chemical shifts under their oxidation state in the crystal structures, and all C′ chemical shifts from 3 proteins that are anti-correlated with predictions from SPARTA+, SHIFTX2 and UCBShift (see ESI† for details). It is essential to remove these chemical shifts from the test set because of evidence for the existence of experimental or recording errors in these data; but no further processing was done on the test set so that it is a good representation of a “real-world” application.
A more carefully “curated” test dataset based on these 200 proteins was also prepared, which additionally exclude paramagmetic proteins, some Hα chemical shifts that have calculated ring current effect exceeding 1.5 ppm, “outliers” detected by the PANAV program,13 and chemical shifts corrected by PANAV that are different from their original values by more than 0.3 ppm for hydrogens, 1.0 ppm for carbons and 1.5 ppm for nitrogen. These additional test filters are similar to the procedures used by SPARTA+ and SHIFTX2 in preparing their test datasets.20,21 A complete list of the 200 testing proteins are given in the ESI (Table S3†). As is inevitable, some of these 200 proteins share high sequence identity with some of the training data, so we also generate test datasets after filtering out proteins with >30% sequence identity to yield a low-homology test set (LH-Test) with 100 test proteins (Table 1).
For the UCBShift-Y module, a query sequence is first aligned with all sequences in the RefDB database using the local BLAST algorithm,39 and the PDB files for all sequences generating significant matches are further aligned with the query PDB structure using the mTM-align algorithm.40 The alignments were further filtered to only keep those alignments that have TM score greater than 0.8 and an RMSD with the query protein structure that is smaller than 1.75 Å. For each of the aligned PDB sequences, its best alignment with the RefDB sequence is determined using the Needleman–Wunsch alignment.41 If the residues are exactly the same, the shifts from RefDB are directly transferred to the target; otherwise, the secondary chemical shifts from RefDB are transferred to account for the different chemical shift reference states for different amino acids. To be more specific, the target shift for atom A and residue I is calculated from the matching residue J:
δI,A = δrc,I,A + (δJ,A − δrc,J,A) | (1) |
When multiple significant structural alignments exist for a given residue, the secondary shifts from these references are averaged with an exponential weighting wI,
wI = e5(SNA × STM) + BIJ × 1(BIJ ≥ 0) | (2) |
• 20 numbers representing the score for substituting the residue to any other amino acid, and taken from the BLOSUM62 substitution matrix42
• Sine and cosine values of the ϕ and ψ dihedral angles at the residue. Taking the sine and cosine values of the dihedral angles prevents the discontinuity when the dihedral angle goes from +180° to −180°. For the undefined dihedral angles, for example the ϕ angle of a residue at the N-terminus and the ψ angle of a residue at the C-terminus, both the sine and cosine values were set to zero.
• A binary number indicating whether χ1 or χ2 dihedral angles for the side chain exists (existence indicator), and the sine and cosine values of these angles when they are defined for the same reasons described for backbone dihedral angles.
• Existence indicators and geometric descriptors for the hydrogen bond between the amide hydrogen and carboxyl oxygen, and between the Cα hydrogen and a carboxyl group (so called α-hydrogen bonds). For each position in the query residue that hydrogen bonds can form, a group of five numbers describe the properties of the hydrogen bond: a boolean number indicating its existence, the distance between the closest hydrogen bond donor–acceptor pair, the cosine values for the angles at the donor hydrogen atom and at the acceptor atom, and the energy of the hydrogen bond calculated with the DSSP model.37 For the query residue, all hydrogen descriptors for amide hydrogen, carboxyl oxygen and α hydrogen are included, but only the carboxyl oxygen features are included for the previous residue, and the amide hydrogen features for the next residue. These add up to 25 hydrogen bond descriptors for any given residue.
• S2 order parameters calculated by the contact model28
• Absolute and relative accessible surface area produced by the DSSP program.
• Hydrophobicity of the residue by the Wimley–White whole residue hydrophobicity scales.43
• Ring current effect calculated by the Haigh–Mallion model.25,26 For each training model for a specific atom type, the ring current for that atom type is included, while the ring currents for other atom types are excluded from the feature set.
• The one-hot representation of the secondary structure of the residue produced by DSSP program (composed of eight categories)
• Average B factor of the residue extracted from the PDB file.
• Half-sphere exposure of the residue44
• Polynomial transformations of some of the residue-specific features, such as the hydrogen bond distances (dHB), by including dHB2, dHB−1, dHB−2, dHB−3, and the squares of the cosine values of the dihedral angles are also included as additional features. These polynomial quantities have been found to be correlated with secondary chemical shifts, and have occurred in several empirical formulas for calculating chemical shifts.3,45
Unlike SPARTA+ and SHIFTX+, we have developed a pipeline with an extra tree regressor46 followed by random forest regressor47 as the machine learning based predictor shown in Fig. 1. Both the extra tree regressor and random forest regressor are ensembles of tree regressors that split the data using a subset of the features, and make ensemble-based predictions via a majority vote. However, extra tree regressors split the nodes in each tree randomly by selecting an optimal cut-point from uniformly distributed cut-points in the range of the feature, while the random forest regressors calculate the locally optimal cut in a feature by comparing the information entropy difference before and after the split. The random forest regressor learns based on the predictions from the first tree regressor and all the other input features, which can be regarded as a variant of the boosting algorithm,48 since it learns from the mistakes the first predictor makes. The pipeline was first optimized using the TPOT tool49 with 3-fold cross validation on the training set, and all the parameters were fine-tuned using a temporal validation dataset with 50 structures randomly selected from the training set. Because tree-based ensemble models are robust to the inclusion of irrelevant features,50 feature selection was not performed. A more detailed analysis of the feature importance will be given in the Results.
Algorithmically, two separate random forest (RF) regressors are trained. The first RF regressor (R1) only accepts features extracted from the structure and the prediction from the extra tree regressor, and the second RF regressor (R2) additionally takes the secondary shift output from UCBShift-Y, together with additional scores and coverage indicating the quality of the alignments, and is trained using only a subset of the training data for which UCBShift-Y is able to make a prediction. Based on the availability of UCBShift-Y predictions, the final prediction of the whole algorithm is generated either by R1 (when no UCBShift-Y predictions are available) or R2 (when UCBShift-Y is able to make predictions). Finally, the random coil reference values are added back to the prediction to complete the total chemical shift prediction, i.e. the predictions are calculated with
(3) |
Dataset | Test | LH-Test | ||||
---|---|---|---|---|---|---|
Atom type | SPARTA+ | SHIFTX2 | UCBShift | SPARTA+ | SHIFTX2 | UCBShift |
a Reported SPARTA+ values from ref. 20: 0.49 ppm for H, 0.25 ppm for Hα, 1.09 ppm for C′, 0.94 ppm for Cα, 1.14 ppm for Cβ, and 2.45 ppm for N. b Reported SHIFTX2 values from ref. 51: 0.17 ppm for H, 0.12 ppm for Hα, 0.53 ppm for C′, 0.44 ppm for Cα, 0.52 ppm for Cβ, and 1.12 ppm for N. | ||||||
H | 0.51 ± 0.003 | 0.44 ± 0.003 | 0.31 ± 0.003 | 0.49 ± 0.004 | 0.49 ± 0.003 | 0.45 ± 0.004 |
Hα | 0.27 ± 0.002 | 0.23 ± 0.003 | 0.19 ± 0.002 | 0.27 ± 0.003 | 0.26 ± 0.003 | 0.26 ± 0.003 |
C′ | 1.25 ± 0.01 | 1.16 ± 0.01 | 0.84 ± 0.01 | 1.16 ± 0.01 | 1.20 ± 0.01 | 1.14 ± 0.01 |
Cα | 1.16 ± 0.01 | 1.05 ± 0.01 | 0.81 ± 0.01 | 1.13 ± 0.02 | 1.15 ± 0.01 | 1.09 ± 0.01 |
Cβ | 1.35 ± 0.02 | 1.27 ± 0.03 | 1.00 ± 0.03 | 1.36 ± 0.05 | 1.37 ± 0.06 | 1.34 ± 0.05 |
N | 2.72 ± 0.02 | 2.40 ± 0.02 | 1.81 ± 0.02 | 2.73 ± 0.02 | 2.73 ± 0.03 | 2.61 ± 0.02 |
Dataset | Test (curated) | LH-Test (curated) | ||||
---|---|---|---|---|---|---|
Atom type | SPARTA+ | SHIFTX2 | UCBShift | SPARTA+ | SHIFTX2 | UCBShift |
H | 0.49 ± 0.002 | 0.42 ± 0.002 | 0.30 ± 0.002 | 0.48 ± 0.003 | 0.47 ± 0.003 | 0.43 ± 0.003 |
Hα | 0.26 ± 0.002 | 0.22 ± 0.002 | 0.18 ± 0.002 | 0.26 ± 0.003 | 0.25 ± 0.002 | 0.24 ± 0.003 |
C′ | 1.15 ± 0.009 | 1.06 ± 0.009 | 0.77 ± 0.008 | 1.16 ± 0.01 | 1.19 ± 0.01 | 1.13 ± 0.01 |
Cα | 1.09 ± 0.008 | 0.98 ± 0.009 | 0.76 ± 0.009 | 1.08 ± 0.01 | 1.10 ± 0.01 | 1.04 ± 0.01 |
Cβ | 1.17 ± 0.009 | 1.09 ± 0.02 | 0.82 ± 0.01 | 1.15 ± 0.01 | 1.15 ± 0.01 | 1.12 ± 0.02 |
N | 2.59 ± 0.02 | 2.25 ± 0.02 | 1.71 ± 0.01 | 2.67 ± 0.02 | 2.66 ± 0.02 | 2.55 ± 0.02 |
SHIFTX2 is seen to outperform SPARTA+ for chemical shift RMSE for all atom types on the curated dataset when there is high sequence homology for which it was designed, and it performs comparably to SPARTA+ on the curated data for target sequences with low sequence similarity to the training data. However, we find that the actual performance on curated data set is less accurate than the reported performance of the SHIFTX2 method.21 One possible explanation is that a sequence similarity analysis revealed that out of the original 61 testing proteins of SHIFTX2, 4 proteins had 100% sequence alignment with a protein in the training dataset, sometimes under different identification numbers (Table S5†). This problem of training data leakage into the testing data of the original SHIFTX2 method could be a non-trivial source of the better performance of SHIFTX2 reported in the literature. The protein-specific average RMSE error and the scatter plots for the SHIFTX2 predicted and experimental shifts are also given in Fig. S2 and S3† on the uncurated test dataset.
By comparison we find that filtering of the test set for outliers that disagree with the predictions, the elimination of paramagnetic proteins, and removing test shifts for hydrogen due to potentially inaccurate and large ring currents effects has more limited effect on prediction performance. To illustrate, the distributions of absolute errors from SPARTA+ for paramagnetic proteins and diamagnetic proteins in the Test dataset are shown in Fig. S4.† The error distributions are not that different for H, Hα, Cβ, and N, and while paramagnetic proteins show higher prediction errors than diamagnetic proteins for the C′ and Cα data types, they are not egregious errors.
We find that UCBShift outperforms SPARTA+ and the SHIFTX2 algorithm for chemical shift prediction RMSE when tested on the uncurated test data set, the more carefully curated test data, and regardless of the level of sequence homology. The protein-specific average RMSE error and the scatter plots for the UCBShift predicted and experimental shifts are also given in Fig. S2 and S3† on the uncurated Test dataset. Therefore UCBShift is more accurate for real-world applications, where the types of proteins may be more diverse than the test sets for SPARTA+ and SHIFTX2. In order to understand the improved performance of UCBShift in particular, we analyze the components of the algorithm including the UCBShift-X and UCBShift-Y modules, as well as the importance of the extracted features, utilizing the full test set of 200 proteins in more detail below.
We hypothesized that a structure-based alignment followed by sequence alignment would be more reliable since it would (1) allow for transferring shifts from structurally homologous proteins with low sequence identity, while also (2) ensuring that the transferred chemical shift values are not from a protein that has high sequence similarity but low structural homology with the query protein. This is confirmed in Fig. 2 which plots the difference of the RMSE on amide hydrogen chemical shift prediction between our UCBShift-Y and SHIFTY+ as a function of sequence identity. Plots for other atom types are given in Fig. S5.† Here the sequence identity is defined as the ratio of the number of matched residues to either the length of the query sequence or the length of the matched sequence, whichever is longer. Furthermore, the UCBShift results are reported with a specifically designed “test mode” which will not utilize sequences with more than 99% identity with the query sequence for making the prediction; this practice ensures the testing performance is a more realistic reflection of the actual performance when operating on input data which is not included in the search database. It is evident that on average predictions on query sequences with low sequence similarity but high structural homology are greatly improved with UCBShift-Y.
A particularly interesting example is the prediction for adenylate kinase (PDB ID: 4AKE)52 and its mutant (PDB ID: 1E4V),53 both of which are identical but for a single substitution of a valine for a glycine residue at position 10 (Fig. 3). Even with such high sequence identity, these two proteins adopt quite different tertiary structures with a backbone RMSD of 7.08 Å as can be seen from the overlay of their two structures in Fig. 3a. Hence while the experimental chemical shifts for these two proteins have a root-mean-square difference (RMSD) of 0.38 ppm for amide hydrogen shifts overall, the maximum H chemical shift difference is much larger at 1.34 ppm and is reflected in the surprisingly lower correlation (R-value = 0.86) between the amide hydrogen shifts for two proteins given the high sequence similarity (Fig. 3b). Therefore when using SHIFTY or SHIFTY+ for the 1E4V query sequence, the best sequence match will be 4AKE, thus increasing the chemical shift prediction error due to the huge structural deviation between the two proteins. Instead when using our UCBShift-Y module it selects two alternative proteins, 1AKE and 2CDN, which share an average sequence similarity of only 67% with the query protein. The correlation between the predicted 1E4V amide hydrogen shifts with UCBShift-Y which chooses references based on structural alignment and the experimental values are given in Fig. 3c, raising the R-value to 0.94 and lowering the RMSE to 0.25 ppm.
Fig. 3 Analysis of the transfer prediction module for UCBShift which uses sequence and structural alignment. (a) Structural alignment of adenylate kinase (4AKE, orange) and the mutant G10V of adenylate kinase (1E4V, purple). (b) Correlation between experimental chemical shifts of the amide hydrogen for 4AKE and 1E4V. (c) Correlation between predicted amide hydrogen chemical shifts using UCBShift with experimental values. In this case structural alignments instead of sequence alignments were used for selecting references for the transfer prediction. |
Fig. S6† summarizes the results of UCBShift-Y vs. SHIFTY+ for chemical shift prediction for all atom types, validating that the structural alignments successfully found better reference proteins for the query protein which improved the overall prediction quality. In comparison with SHIFTY+, all atom types other than carboxyl carbon are improved in accuracy; although predictions for the carboxyl carbons are at the same level of accuracy as SHIFTY+, the failure to improve this atom type with UCBShift-Y is likely due to the lower number of chemical shifts available for transfer prediction for this atom type. Finally we note that our UCBShift-Y can be used as a standalone chemical shift predictor when sequence and structural alignments exist and have available experimental chemical shifts.
The RMSE of the pipeline with extra tree regressor and random forest regressor but without inputs from UCBShift-Y (R1) between the predicted chemical shifts and the observed shifts is summarized in Table 3 and named UCBShift-X. It is found to be statistically better for all the atom types when compared with SPARTA+, or the SHIFTX+ component of SHIFTX2, which also use no sequence and/or structural alignments. The overall performance of the UCBShift-X machine learning module is promising, and it also can be used as a reliable standalone predictor for chemical shifts, especially when no faithful alignment is found using UCBShift-Y.
UCBShift components | H | Hα | C′ | Cα | Cβ | N |
---|---|---|---|---|---|---|
UCBShift-X (R1) | 0.44 | 0.25 | 1.17 | 1.08 | 1.28 | 2.49 |
UCBShift-Y | 0.21 | 0.17 | 0.64 | 0.57 | 0.67 | 1.25 |
ML with UCBShift-Y input (R2) | 0.19 | 0.15 | 0.66 | 0.57 | 0.70 | 1.23 |
UCBShift (utilizing both R1 and R2) | 0.31 | 0.19 | 0.84 | 0.81 | 1.00 | 1.81 |
If we consider using the R2 module (which is trained using only a subset of the training data for which UCBShift-Y is able to make a prediction), the errors of some atom types further decrease (Table 3). Interestingly, the averaged RMSE from R2 for H, Hα and N is even smaller than the average RMSE of UCBShift-Y, indicating that the second ML module is doing better than just combining the results from UCBShift-Y and from the first level machine learning module R0 for these atom types. But given the uncertainties in sequence and structural alignments or the lack of chemical shift data for UCBShift-Y, both the R1 and R2 machine learning modules are utilized to yield the final UCBShift algorithm and results for chemical shifts as given in Table 3 for all the six atom types.
Feature categories | R 0 | R 1 | R 2 |
---|---|---|---|
Backbone dihedral angles | 0.22 | 0.11 | 0.04 |
Transformed features | 0.23 | 0.11 | 0.04 |
Secondary structure | 0.17 | 0.005 | 0.001 |
BLOSUM numbers | 0.11 | 0.02 | 0.005 |
Hydrogen bond | 0.11 | 0.06 | 0.02 |
Half surface exposure | 0.05 | 0.06 | 0.008 |
Ring current | 0.04 | 0.03 | 0.005 |
Sidechain dihedral angles | 0.03 | 0.03 | 0.005 |
Atomic surface area | 0.02 | 0.01 | 0.002 |
B Factor | 0.008 | 0.01 | 0.002 |
S 2 order parameters | 0.006 | 0.01 | 0.002 |
Hydrophobicity | 0.002 | 0.001 | 0.0002 |
pH values | 0.001 | 0.002 | 0.001 |
Prediction from R0 | N/A | 0.53 | 0.19 |
UCBShift-Y prediction | N/A | N/A | 0.67 |
UCBShift-Y metrics | N/A | N/A | 0.01 |
The UCBShift-Y prediction as well as the prediction from R0 are the dominant factors for the R2 model; this result indicates the network is indeed trying to differentiate situations when UCBShift-X predictions are more reliable and when they are not so accurate in comparison to R0 predictions, as well as based on the other structure-derived features. Therefore, using machine learning to combine the predictions from feature-based prediction and alignment-based prediction is a better strategy than doing a weighted average of the two predictions. Finally features such as hydrophobicity and pH values, and to some extent B-factors and S2 order parameters, seem to play a minor role in predictive capacity of the ML module.
Fig. 4 Average correlation coefficients between predicted chemical shifts of decoyed structures and observed chemical shifts versus Cα RMSD between decoyed structures and native structure for PDB 1CTF (a and b) and 1HFZ (c and d). Results are visualized as UCBShift-X compared to SPARTA+ (a and c) and UCBShift compared to SHIFTX2 (b and d). |
The decoy dataset for PDB structure 1CTF was obtained from the Decoy ‘R’ Us database56 which contains the native structure and 630 structures with a range of 1.3–9.1 Å in the α-carbon RMSD, and the decoy dataset for PDB structure 1HFZ generated using 3DRobot57 which contains the native structure and 298 structures with a range of 0.4–4.2 Å in the α-carbon RMSD. Using UCBShift-X alone has similar discriminative power for the native state as SPARTA+, and predicted chemical shifts for lower RMSD structures also tend to have better correlation with experimental values using UCBShift-X (Fig. 4a and c). The complete UCBShift method shows greater discriminative power for the native state than found with either SPARTA+ or SHIFTX2, in which we can differentiate between structures within experimental resolutions (<2 Å) against unlikely structures more easily, while still retaining the highest correlation for the (experimental) native structure (Fig. 4b and d).
Although the performance of these algorithms are much better when applied to carefully curated test data, the filtering out of test data risks the inability to distinguish between a poor prediction from a poor experimental chemical shift value. Large outliers would certainly result from the wrong random coil reference for Cβ shifts due to ambiguous cysteine oxidation states, or single whole proteins which exhibit many chemical shift outliers for particular atom types, and should not be considered a failure in algorithmic performance, but a problem of the experimental data. However, further test filtering can start to become arbitrary as we move from deviant to suspicious to acceptable experimental agreement with the prediction; one can't have it both ways. Thus in this paper we have provided a realistic range of test performance since scientists use these chemical shift predictors on real-world data that may differ from the original training datasets such that the algorithms do not generalize well-i.e. some measure of disagreement with experiment may just simply be prediction error. As such Table 2 provides a more realistic range of test reliability for all three methods.
Although we have realized noticeable improvement over other protein chemical shifts predictors, we believe we are reaching the limit of accuracy by using extracted feature from structures or transfer predictions through alignments. Deep learning may be helpful in the next step since it can operate directly on 3D data representations without the potential bias introduced by features extracted by human experts,58,59 as we and others have shown recently for chemical prediction in the solid state,14,15 in which we greatly improved prediction for all atom types and approached chemical accuracy on par with ab initio calculations for hydrogen in particular. The ability to move to 3D representations will be important for QM chemical shift predictions for intrinsically disordered proteins, since feature extracted data will be less available and likely less representative for this class of protein, and the results can be ensemble averaged to provide a prediction that can be compared against solution NMR experimental data for structural ensemble refinement as we have shown for IDPs.32,33
Footnote |
† Electronic supplementary information (ESI) available: The exclusion of erroneous chemical shifts assignments in the test dataset and complete compilation of the test dataset. Evaluation of SPARTA+, SHIFTX2 and UCBShift performance on test datasets. Performance analysis of UCBShift-Y in comparison with SHIFTY+ and learning curves for the random forest models. See DOI: 10.1039/c9sc06561j |
This journal is © The Royal Society of Chemistry 2020 |