A neural network learning approach for improving the prediction of residue depth based on sequence-derived features

Renxiang Yan*ab, Xiaofeng Wangc, Weiming Xua, Weiwen Caia, Juan Linab, Jian Lid and Jiangning Song*def
aSchool of Biological Sciences and Engineering, Fuzhou University, Fuzhou 350108, China. E-mail: yanrenxiang@fzu.edu.cn
bFujian Key Laboratory of Marine Enzyme Engineering, Fuzhou 350002, China
cCollege of Mathematics and Computer Science, Shanxi Normal University, Linfen 041004, China
dInfection and Immunity Program, Biomedicine Discovery Institute, Monash University, Melbourne, VIC 3800, Australia. E-mail: jiangning.song@monash.edu
eMonash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
fNational Engineering Laboratory for Industrial Enzymes, Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China

Received 11th May 2016 , Accepted 12th July 2016

First published on 12th July 2016


Residue depth is a solvent exposure measure that quantitatively describes the depth of a residue from the protein surface. It is an important parameter in protein structural biology. Residue depth can be used in protein ab initio folding, protein function annotation, and protein evolution simulation. Accordingly, accurate prediction of residue depth is an essential step towards the characterization of the protein function and development of novel protein structure prediction methods with optimized sensitivity and specificity. In this work, we propose an effective method termed as NNdepth for improved residue depth prediction. It uses sequence-derived features, including four types of sequence profiles, solvent accessibility, secondary structure and sequence length. Two sequence-to-depth neural networks were first constructed by incorporating various sources of information. Subsequently, a simple depth-to-depth equation was used to combine the two NN models and was shown to achieve an improved performance. We have designed and performed several experiments to systematically examine the performance of NNdepth. Our results demonstrate that NNdepth provides a more competitive performance when compared with our previous method evaluated using the Student t-test with a p-value < 0.001. Furthermore, we performed an in-depth analysis of the effect and importance of various features used by the models and also presented a case study to illustrate the utility and predictive power of NNdepth. To facilitate the wider research community, the NNdepth web server has been implemented and seamlessly incorporated as one of the components of our previously developed outer membrane prediction systems (available at http://genomics.fzu.edu.cn/OMP). In addition, a stand-alone software program is also publicly accessible and downloadable at the website. We envision that NNdepth should be a powerful tool for high-throughput structural genomics and protein functional annotations.


1. Introduction

Protein structures contain essential clues to decipher their biological functions at the molecular level.1 During the past decades, considerable efforts have been made to develop efficient computational methods for accurate prediction and modeling of protein structures.2–6 Protein structural information is essential for functional annotation and characterization.7–9 In general, there exists three levels of protein structure prediction, i.e. 1D, 2D and 3D.10,11 The detailed definitions of these three levels of structural parameters can be found in the work of Rost and his co-workers.12 The 1D structure prediction is to predict the structural features of proteins along the one-dimensional sequences.13 The 2D structure prediction aims to predict the relationship between amino acid residues of proteins, such as residue–residue contacts (including short- and long-range residue contacts).14 More recently, an orientation-dependent residue contact number termed as half-sphere exposure (HSE) was proposed by Hamelryck,15 which has proved useful in protein structure prediction.16–19 The 3D structure prediction is to predict the three-dimensional structure of proteins.20 The 1D properties of each residue are features along the one-dimensional protein sequence. For example, protein secondary structure, backbone torsion angles, residue depth and solvent accessibility are representative 1D features (see ESI file 1 for representative 1D and 2D properties). Accordingly, accurate prediction of these 1D properties is crucial for protein folding and functional studies, and represents an important step towards the ultimate 3D structure modeling. Encouraging progress has been made in sequence-based 1D structure prediction over the last few decades.16,21,22

Among these, protein secondary structure is one of the most important 1D properties. PSIPRED is a highly accurate method for protein secondary structure prediction, which was developed based on two feed-forward neural networks.21 It has achieved an average Q3 accuracy of >80% when assessed on large datasets in recent years. In reality, PSIPRED has been widely used by a number of CASP-winning23 3D fold recognition techniques, which include MUSTER,24 HHSearch,25 RAPTOR,26 FFAS-3D,27 and SPARK3,28 in order to improve the accuracy of sequence-to-structure alignments. In addition, the two backbone torsion angles (BTAs) along a protein chain, also known as the Ramachandran plot,29 are also important 1D properties, which describe the rotations of the polypeptide backbone around the bonds between N–Cα (i.e. phi angle) and Cα–C (i.e. psi angle), respectively. The BTAs provide useful information for protein 3D structure modeling.30 Thus, extensive efforts have been devoted to develop BTA predictors. For instance, ANGLOR,31 TANGLE32 and SPINE-X33 are three competitive BTA predictors. The predicted torsion angle values by BTA predictors have been proved useful for ab initio protein structure assembly.34

The solvent accessible surface area, also known as accessible surface area (ASA), is a solvent exposure measure that describes the surface area of a biomolecule accessible to a solvent.35 As most buried residues are usually evolutionarily conserved, ASA can be used as a useful parameter for protein function prediction.36 Accurate prediction of ASA has proved useful for fold recognition, conformational change, and protein stability prediction.37 Nevertheless, ASA cannot be used to distinguish residues that are located slightly below the surface of a protein from those that are located in the core of the protein. For both cases, the ASA values of those residues that are completely buried inside the protein structure and located below the protein surface are equal to zeros, however, their actual depth values may be considerably different. To address this problem, Chakravarty and Varadarajan proposed a novel parameter called residue depth (RD) for a quantitative measurement of the depth value of a residue in a protein structure.38

RD is also regarded as a solvent exposure measure. In contrast to ASA, RD can accurately measure the degree of inaccessibility of a given residue buried inside a protein,38 which complements the information provided by ASA. It has been proven that ASA is negatively correlated with RD.38 In addition to providing insights into the organization of 3D structure, RD has also been utilized in the identification of ligand binding pockets39 and prediction of pKa of ionizable groups.40 The RD values are zeros for solvent-accessible residues, and greater than zeros for buried residues in the protein interior. Furthermore, it has been reported that deeply buried residues are conserved, since they are usually located in the core of proteins and are expected to contribute more to the thermodynamic stability of proteins.41

Considering the importance of RD, it is necessary to develop reliable prediction tools for RD. To date, several elegant algorithms for RD prediction have been proposed. In 2008, Yuan and Wang developed a novel RD prediction method (Y–W method) using PSI-BLAST42 profiles and protein length.43 Also in the year, Zhang et al. proposed an effective method called RDpred44 to predict RD values by combining secondary structure, residue position, protein size and sequence profiles in a support vector machine (SVM) model. Later in 2009, Song et al. developed a more complicated method termed as Prodepth45 by incorporating new structural descriptors, such as predicted solvent accessibility, disordered regions, and molecular weight. Despite the availability of these methods, the prediction performance of RD is not satisfactory and new algorithms are still urgently desirable, especially in the current post-genomic era. Apparently, more efforts are needed to develop novel and specialized methods to address this challenging task. Nevertheless, the progress for developing new RD prediction tools has been slow at least in the last three years.

We have recently developed a program called PSSM-2-Features46 for predicting protein secondary structure, solvent accessibility and backbone angles at the residue level. This program can also be used in RD prediction. However, it only used the PSSM and PSFM sequence profiles as the features and fed into a single neural network model for RD prediction. Previously, this program has been benchmarked and achieved an MAE score of 0.062 for RD prediction. Thus, its performance of RD prediction is not satisfactory and there remains a possibility to further its performance.

In the present work, we develop a neural network learning-based approach to significantly improve the prediction performance of RD by incorporating additional informative features. This new method compares favorably with several previously developed methods. Furthermore, a stand-alone software program is implemented and made available, which can be used as a useful tool for high-throughput prediction and annotation of a large number of sequences in local computers.

2. Materials and methods

2.1. Datasets

In this study, two previously collected datasets46 were used for model training and testing, which contained 6675 and 1073 proteins, respectively. These two datasets were named as PDB_TRAIN6675 and SCOPe_TEST1073 (http://genomics.fzu.edu.cn/OMP/benchmarks), respectively. Proteins in the PDB_TRAIN6675 dataset were dissimilar to all proteins in the SCOPe_TEST1073 dataset (with the BLAST47 e-value > 0.001). In addition, we also performed the independent test using the PDB_CS6001 dataset, which was a none-redundant subset of the PDB_TRAIN6675 dataset with the sequence identity <30% between any protein pairs. PDB_CS6001 was also generated in our previous work by removing the sequence redundancy with the BLASTCluster program. The advantage of using these previously compiled datasets is that it offers an easy and straightforward way to directly compare the predictive performance between our method and previous methods.

2.2. Residue depth calculation

All proteins in the training and test datasets have available 3D structural information. The EDTSurf48 program was used to calculate the RD values for residues in these proteins. The raw RD values output by EDTSurf falled within the range of [2.8, 9.8], where a higher value indicates that the residue is more deeply buried in the protein structure. Similar to the scoring term in FFAS-3D,27 the raw RD values were normalized to the range of [0, 1] using the following equation
 
image file: c6ra12275b-t1.tif(1)
where RD(i) is the normalized value and dv(i) is the original absolute RD value of the residue i calculated by the EDTSurf48 program. The neural network models will be subsequently trained to predict the normalized RD values of all residues in the target proteins.

2.3. Neural network learning and input features used for model training

The neural network (NN) algorithm was implemented using the Encog package (https://code.google.com/p/encog-java/downloads/list) in this work. To train neural network models, input features should generally be extracted from the protein sequence, such as the amino acid compositions, and PSSM profiles obtained from a PSI-BLAST search. In principle, the prediction performance of neural network methods depends on the amount and quality of useful information contained in the extracted features used for model training. Meanwhile, the corresponding parameters of neural networks also need to be fine tuned and optimized to achieve the best performance. In the present study, the following features are used to develop effective neural network models for improved RD prediction.
2.3.1. Sequence profiles.
(a) PSSM and PSFM profiles. The NCBI non-redundant (NR) database was downloaded from the website ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz. The obtained sequences were further clustered at a cutoff of 90% sequence identity (under the ‘global alignment’ mode) using the CD-HIT program.49 To generate the sequence profiles, each query sequence was threaded against the filtered NCBI NR database using PSI-BLAST42 by three iterations with an e-value threshold of 0.001. The profile generated by the option ‘-Q’ is usually called PSSM (Position Specific Scoring Matrix). In addition, it should be noted that there exists another matrix generated in the same PSSM output file, which contains the frequency of twenty amino acids at each residue position. This frequency matrix was termed as PSFM (Position Specific Frequency Matrix). Both PSSM and PSFM profiles were used as the input features in this work.
(b) MTX profile. In addition to the PSSM and PSFM profiles, a check point profile can also be generated by the PSI-BLAST search with the option of ‘-C’. This check point profile is a binary version of the PSSM profile and has been used in several programs such as Rps-BLAST (Reverse PSI-BLAST) and Impala.50 To generate readable data from the binary file, we used a similar way to FORTE51 through exploiting the makemat program of the BLAST package to convert PSI-BLAST checkpoints into score matrices. Here, this score matrix was named as the MTX profile. Although the PSSM and its binary check point profile are generated from the same source, they are slightly different. The major difference between the PSSM and MTX profiles is that MTX contains two floating point values, thereby providing more accurate descriptions than the PSSM. Considering that certain elements of both PSSM and MTX profiles are negatively valued, we directly scaled all values of both PSSM and MTX profiles to the range of [0, 1] using the following standard logistic function
 
image file: c6ra12275b-t2.tif(2)
where x is the raw value (i.e. the input value), while y is the normalized value (i.e. the output value). The normalized values of both PSSM and MTX profiles were consistently used in this study.

(c) SP profile. The PSI-BLAST program uses the Henikoff and Henikoff weight52 scheme to calculate sequence profiles from multiple sequence alignments (MSAs). The Henikoff and Henikoff weight is a position-based algorithm, where the weights are calculated based on the diversity observed at each position in the MSA. Therefore, the PSSM, PSFM and MTX profiles generated by PSI-BLAST can be regarded as position-based profiles. They can be classified as the same type of sequence profiles although they are slightly different. Here, we also employed a sequence-based weight, which differs from the position-based weight, to generate a complementary sequence profile. We used a simple way to calculate the weight as
 
image file: c6ra12275b-t3.tif(3)
where wi is the weight value of sequence i, idi is the sequence identity between the target and the ith identified protein. idi was calculated based on the alignment using the PSI-BLAST program. Then, we used the following equation to calculate 20 amino acid frequencies of each residue using the generated MSA as
 
image file: c6ra12275b-t4.tif(4)
where fu,r is the amino acid frequency of residue r in the column u, N is the number of sequences in the MSA, wi is the weight value of the sequence i, δu,ri is set to 1 if the sequence i contains the residue r in the column u, and 0 otherwise. For unaligned regions, only the target sequence itself was used to calculate the amino acid frequencies. This generated profile was named as the SP profile (Sequence-based weight scheme Profile). It is noteworthy that the details of the construction and calculation procedures of these profiles can be found in the Perl codes of our stand-alone program which has been made publicly available at http://genomics.fzu.edu.cn/OMP/public_NNdepth.tar.bz2.
2.3.2. Secondary structure profiles. Three types of protein secondary structure predictions were used and examined. The first type is the secondary structure prediction by PSIPRED,21 while the other two types were generated by PSSM-2-Features trained using the secondary structure data defined by DSSP35 and STRIDE,53 respectively. Although these three methods examined here are all NN-based predictors, they have different architectures which might have different implications on the RD prediction performance.
2.3.3. Relative solvent accessibility profiles. The solvent accessibility of an amino acid residue in a protein describes the extent of such residue accessible to the solvent (often water) surrounding the protein.35 In our previous work, the relative solvent accessibility (RSA) of each residue was calculated by dividing its accessible surface area by the maximum surface area of that residue type. The PSSM-2-Features program was trained to predict the RSA values using the neural network learning framework. In this study, the RSA results predicted by the PSSM-2-Features program were used as the input features to train the NNdepth models for RD prediction.
2.3.4. Protein sequence lengths. The protein sequence length has been shown to be correlated with the average residue depth of a protein38 to some extent and also used as a useful feature in previous RD prediction.45 In the present study, two different encoding schemes in terms of the protein sequence length were investigated: (1) the raw protein length was directly divided by 2000 and then the value was used as the input score, and (2) the raw protein length was transformed into four binary units in the same way as suggested by Rost and Sander54 as follows
 
image file: c6ra12275b-t5.tif(5)
where s is the raw protein sequence length. The first and second encodings of the sequence length were named L1 and L2, respectively.
2.3.5. Residue position. Zhang et al. have previously observed that residues at the termini are often located at or close to the surface44 of the protein structure. Accordingly, they proposed the following parameter to predict RD
 
image file: c6ra12275b-t6.tif(6)
where i is the position index of the residue and L is the sequence length. With this parameter, the position information of each residue can also be used as an input feature. This residue position encoding was thus named rp in this work.
2.3.6. Residue conservation. We also took into account the residue conservation by adopting a simple strategy. First, we calculated the entropy value of each residue as
 
image file: c6ra12275b-t7.tif(7)
where fi,r is the frequency of the rth residue at the position i in the PSFM profile. We then calculated the conservation score for each residue as
 
image file: c6ra12275b-t8.tif(8)
where entropy(i) is the entropy value of residue i calculated using eqn (7) and CS(i) is the conservation score of the residue i. If a position is highly conserved (for example, there exists only one type of conserved amino acids occurring at this position), the entropy should be equal to 0. The entropy is close to 2.996 if the residue is highly variable (i.e. the frequencies of twenty amino acids at that position are distributed equally). The CS values fall within the range of [0, 1], where a higher score indicates a more conserved status of that residue.

2.4. Training of RD predictors

The training procedure of a RD predictor can be regarded as a regression problem. We used different combinations of input features to train neural network models, and examined their performance difference in predicting the real RD values. These input features were combined and assessed to develop a hybrid RD predictor with an improved performance.

2.5. Performance assessment measure

In order to assess the prediction performance of RD predictors, we use two primary measures, namely the Mean Absolute Error (MAE) and Pearson's Correlation Coefficient (PCC). MAE and PCC are two quantities commonly used to measure how close the predicted values generated by the models are to the observed original values. MAE is calculated as follows
 
image file: c6ra12275b-t9.tif(9)
where fi is the prediction score, yi is the true value and n is the number of residues. PCC is given by
 
image file: c6ra12275b-t10.tif(10)
where n is the number of samples (i.e. the number of residues in all query sequences). Similar to eqn (9), fi and Yi denote the predicted and original values of residue depth for residue i, respectively.

3. Results and discussions

3.1. The performance of the NNdepth models based on different feature combinations

The prediction models of NNdepth were intensively trained using the PDB_TRAIN6675 dataset and strictly tested on the SCOPe_TEST1073 dataset. It should be noted that the parameters of models were optimized on the PDB_TRAIN6675 dataset (i.e. the optimal parameters of neural networks were determined using the PDB_TRAIN6675 dataset), while the SCOPe_TEST1073 dataset was used as an independent test dataset to assess the performance of models trained on the PDB_TRAIN6675 dataset. We found that after incorporation of evolutionary information (in terms of the sequence profiles), the NNdepth model demonstrated to be very effective for RD prediction. For example, the model achieved an MAE value of 0.061 and a PCC value of 0.611 when using only the PSSM and PSFM features as the input. Thus, we grouped different models according to the combinations of the profiles, e.g. PSSM + PSFM and MTX + SP profiles. The corresponding performance of the models trained using different profile combinations is listed in Table 1. As can be seen, the Net1 model constructed by combining PSSM, PSFM, sequence length, secondary structure, residue position and solvent accessibility, resulted in an MAE of 0.055 and a PCC value of 0.638, respectively. The Net2 model constructed by combining MTX, SP, sequence length, secondary structure, residue position and solvent accessibility, resulted in an MAE of 0.054 and a PCC value of 0.643, respectively. In addition, different models may reflect different aspects of RD prediction and can be complementary with each other to a certain extent. As such, combination of these trained models might help to further improve the prediction performance. Indeed, by combining the Net1 and Net2 models using a simple linear equation (i.e. image file: c6ra12275b-t11.tif) within the final RD prediction system, the overall performance of NNdepth was considerably improved. The performance improvement is clearly manifested in Table 1 (MAE = 0.053 and PCC = 0.646). The Net1 and Net2 models can be regarded as sequence-to-depth regression predictors, and the linear equation is a depth-to-depth predictor. Compared with the Net2 model, the performance improvement is slight after combining Net1 and Net2. However, when the combined model was constructed, it would take less than one second to generate the prediction score. This is because the neural network-based models have been already trained and it would only take little time to calculate the final scores using the trained models. Therefore, we think that the computational complexity is not a concern here.
Table 1 Prediction performance of different NNdepth models evaluated on the SCOPe_TEST1073 dataset by combining different feature combinations
# Model MAE PCC
a PSSM, PSFM, L1, SS, RSA and CS stand for position-specific scoring matrix (PSSM), position-specific frequency matrix (PSFM), the sequence length (L1), three probabilities of secondary structures (SS), relative solvent accessibility (RSA) and conservation score (CS), respectively. The Net1 and Net2 models were further integrated to construct a hybrid predictor with slightly improved performance. The prediction results of PSSM-2-Features are directly cited from our previous work.46
PSSM + PSFM profiles
1 PSSM + PSFM 0.061 0.611
2 PSSM + PSFM + L1 0.056 0.633
3 PSSM + PSFM + L1 + SS 0.055 0.636
4 (PSSM + PSFM + L1 + SS + RSA + rp + CS)Net1,a 0.055 0.638
[thin space (1/6-em)]
MTX + SP profiles
1 MTX + SP 0.060 0.606
2 MTX + SP + L1 0.060 0.606
3 MTX + SP + L1 + SS 0.055 0.632
4 (MTX + SP + L1 + SS + RSA + rp + CS)Net2,a 0.054 0.643
[thin space (1/6-em)]
Combined Model (Net1 + Net2)
1 Model 1 + Model 2 0.053 0.649
[thin space (1/6-em)]
Our previous work
1 PSSM-2-Featuresa 0.062 0.597


In addition, we also tested a single NN model by incorporating all used features (e.g. PSSM + PSFM + MTX + SP), but found that the performance could not be improved when compared with the combined model (Net1 + Net2). This is probably because the information contained in PSSM, PSFM, MTX and SP is redundant and the dimension of the input features is very high when combining all features in a single NN. The Net1 and Net2 models can be regarded as two sub-optimal models and complement with each other. Therefore, the performance can be further improved when combining the two models. In general, a method with a higher correlation is more useful and machine learning-based models are likely to achieve relatively lower MAEs. For example, both Net1 and Net2 generated higher PCC than PSSM-2-Features and they also had relatively lower MAEs than PSSM-2-Features. However, there also exist cases where a method may generate a higher PCC value, but still have worse MAE values. The reason might be that different methods and feature descriptors have their own advantages in balancing the two performance measures. Such cases will be discussed in the following Section 3.2.

3.2. Comparison with well-established RD predictors

In this section, we first compared NNdepth with our previous work46 (i.e. the PSSM-2-Features method). As shown in Table 2, when evaluated using the 30-fold cross-validation benchmark on the PDB_CS6001 dataset (a non-redundant subset of PDB_TRAIN6675 with the sequence identity of less than 30% between any protein pair), NNdepth attained a PCC value of 0.618 and an MAE value of 0.077, which was better than the PSSM-2-Features method (PCC = 0.533, and MAE = 0.083). When benchmarked on the SCOPe_TEST1073 dataset, NNdepth still achieved a better performance (PCC = 0.646, and MAE = 0.053 by NNdepth vs. PCC = 0.597, and MAE = 0.062 by PSSM-2-Features). The improvement might be attributed to the use of effective and complementary features. For example, the Q3 accuracy of PSIPRED for the secondary structure prediction is approximately 80% on both training and test datasets. Meanwhile, we performed Student t-test to examine the statistical significance of the NNdepth method. The null hypothesis of the t-test here is that there is no statistical difference in the distribution of MAEs per protein for the compared two methods. As a result, the p-value associated with this null hypothesis is less than 0.001, which means that the prediction error generated by the NNdepth method was significantly different from that of the compared PSSM-2-Features method. As a comparison, we tested the SABLE55 program for RSA prediction on both PDB_CS6001 and SCOPe_TEST1073. The SABLE program achieved PCC = 0.643, and MAE = 0.170, and PCC = 0.607 and MAE = 0.186 on the two datasets, respectively. These results suggest that different predictors show different performance when tested on different datasets.
Table 2 The mean absolute error (MAE) and Pearson's correlation coefficient (PCC) of RD
Property MAE PCC
a The results were tested on the SCOPe_TEST1073 dataset.b The results were tested based on the PDB_CS6001 dataset.
NNdeptha 0.053 0.646
NNdepthb 0.077 0.618


Furthermore, it is necessary to compare our method with other state-of-the-art methods. In fact, all the three methods, i.e. Y–W, RDpred and Prodepth, used the PSSM and other features to train their respective models based on the support vector regression (SVR) algorithm. Song et al.45 previously compiled a dataset of 489 protein chains and tested the performance of these three methods on such dataset. In this study, we used the same 489 protein chains in order to directly compare our method with others. To avoid over-fitting, we removed the similar proteins to these 489 proteins from our training dataset and re-trained NNdepth models. As a result, NNdepth achieved a PCC of 0.722 and an MAE of 3.35 (raw depth value) on these 489 protein chains. This was a comparable performance compared with all other three methods (Y–W achieved PCC = 0.64 and MAE = 1.41, RDpred achieved PCC = 0.68 and MAE = 1.36, while Prodepth achieved PCC = 0.71 and MAE = 1.28 (ref. 45)). In particular, the MAE measure of NNdepth was worse than other methods while the PCC value of NNdepth was better than them. Some methods have their own advantages on lowering MAE, while others may have advantages on improving PCC. Note that the performance of the three methods was directly cited from the reference of Song et al.45 benchmarked on the same dataset. In fact, it was difficult to train the models using the SVM program (e.g. libsvm and svmlight) based on our dataset (i.e. PDB_TRAIN6675 dataset) considering that the input feature (e.g. PSSM, PSFM) files exceeded 10G. The model trained using the SVM algorithm is different from the model trained using neural networks. During neural network training, we dynamically read some samples from the hard disk and update the weights using these samples each time. This procedure would proceed until all samples had been used. The details of dynamical use of the training samples can be found at the website of https://en.wikipedia.org/wiki/Backpropagation. In the training procedure of our NN models, the size of the training dataset was much bigger than the previous methods. In addition, the residue depth values in the PDB structures calculated by different programs were also different. This is probably because different solvent surface defined by different programs. Therefore, benchmark performance in this section may only serve as a coarse-grained assessment rather than as a precise and fair comparison.

On one hand, the performance improvement of NNdepth can be attributed to the incorporation of heterogeneous and informative profile- and structural-based features. On the other hand, as the amount of training data rapidly increases, the performance of machine learning-based algorithms would typically increase. The training data set was much larger than that used to train other methods (see the ref. 43–45 for details), which might be an important factor for the performance improvement of our method. Meanwhile, the neural network models were intensively trained, and accordingly, an optimized performance can be achieved. In addition, it is also worth mentioning that the NNdepth is a solely sequence-based method. Other structural template-based features, such as structural fragments that can be extracted from the Protein Data Bank56 an used to further enhance the performance of the NN models, were not explored in this study. This is a limitation of the current NNdepth method.

3.3. Further analysis of residue depth prediction

We calculated the RD values for all residues in the SCOPe_TEST1073 dataset and displayed their distribution in Fig. 1. As can be seen the figure, the majority of amino acid residues had RD values of less than 0.6. The average depth and standard deviation values for the 20 amino acid residue types are listed in Table 3 and their respective prediction errors are given in Table 4. Based on the results in Table 3 and Fig. 1, we developed two baseline predictors. The first predictor returned a prediction score for each residue by using equation mean(r) + rand × std(r), in which mean(r) and std(r) denote the mean and standard deviation scores of the r type of the residue, while rand is a random number in the range of [−1, 1]. The second predictor returned a random number between 0 and 0.3. However, both predictors generated low PCC values of less than 0.35. The results by the two predictors suggest that depth values could not be directly predicted by single amino acid measures.
image file: c6ra12275b-f1.tif
Fig. 1 Distribution of residue depth values on the SCOPe_TEST1073 dataset.
Table 3 The average depth (mean) and standard deviation (std) values for 20 amino acid residue types
Residue Mean Std
A 0.144 0.149
C 0.162 0.141
D 0.076 0.059
E 0.075 0.051
F 0.194 0.153
G 0.098 0.121
H 0.109 0.088
I 0.201 0.168
K 0.085 0.046
L 0.186 0.155
M 0.168 0.158
N 0.087 0.074
P 0.100 0.082
Q 0.087 0.066
R 0.087 0.054
S 0.095 0.099
T 0.111 0.104
V 0.189 0.166
W 0.173 0.126
Y 0.155 0.118


Table 4 The average mean absolute error (MAE) and Pearson's correlation coefficient (PCC) between the predicted and actual RD values for 20 amino acid types
Residue MAE PCC
A 0.069 0.626
C 0.079 0.573
D 0.028 0.537
E 0.023 0.509
F 0.087 0.543
G 0.053 0.555
H 0.042 0.522
I 0.091 0.592
K 0.022 0.476
L 0.084 0.585
M 0.076 0.627
N 0.032 0.534
P 0.034 0.506
Q 0.029 0.540
R 0.028 0.489
S 0.043 0.573
T 0.047 0.579
V 0.087 0.602
W 0.075 0.484
Y 0.063 0.549


The MAE measure was further analyzed according to three-state protein secondary structure (Table 5). As shown, the error of residues located within coiled regions was the smallest (MAE = 0.035), followed by α-helix residues (MAE = 0.055) and β-strand residues (MAE = 0.084). This observation is consistent with the previous results reported by Song et al.45

Table 5 Prediction performance according to the three-type secondary structures
SSa MAE PCC
a ‘H’ represents the α-helix, ‘E’ denotes the β-strand, and ‘C’ stands for coil.
H 0.055 0.635
E 0.084 0.626
C 0.035 0.502


Furthermore, we searched the depth values of all residues included in the SCOPe_TEST1073 dataset against the AAindex57 database which contained 544 physicochemical and biochemical properties of 20 amino acid types. As a result, we found that RD values manifested a good correlation with contact and hydrophobicity indices, e.g. AAindex entries BASU050103, BASU050101 and CIDH920104.

3.4. Selection of more effective features and parameter optimization

A critical question that should be addressed in this study is how to select the optimal parameters for training NN models, including the optimal window size of local sequence profiles, the numbers of hidden layers, the number of nodes in the hidden layers, iteration times, learning rate and momentum.

To address this, we used the grid search algorithm to identify the optimal values of these parameters. During the grid search, we simply searched each parameter through a manually specified range of parameter values. The values of hidden nodes were set to the range of [50, 900]. The window sizes of the local sequence profiles were examined in the range of [7, 25]. Certain residues at the N- or C-terminus cannot be extracted with the full window. To overcome this problem, an extra unit per amino acid was used to indicate whether the residue spans either the N- or C-terminus of the protein chain. For a region that spans the N- or C-terminus, its feature values, including conservation parameters, would be set to zeros and the value of the additional bit would be set to 1, otherwise the value of the bit would be set to 0. The range of learning rate of NN models was set as [0.001, 0.01] and the momentum was set as [0.80, 0.90], respectively. As a result, we identified the optimal performance with the corresponding interval range for each parameter. For example, Net1 and Net2 obtained their optimal performance when 300 and 400 nodes were respectively used in the hidden layers.

The detailed procedures of how to optimize the parameters of the models can be found in the ESI file 2. Chakravarty and Varadarajan showed the RSA and RD measures fit well to the exponential function.38 Therefore, we used a simple exponential function to map RSA to RD as RD = exp(−RSA), where exp is the exponential function. Interestingly, the PCC value of RSA and RD based on this equation is 0.623. Combining PSSM and PSFM profiles resulted in a PCC value of 0.611. Similar results can be obtained when using the MTX and SP profiles. This suggests that the RSA measure is better correlated with RD than sequence profiles. When the sequence length feature was incorporated, both MAE and PCC of prediction were further improved (MAE was improved from 0.061 to 0.056 and PCC improved from 0.611 to 0.633, respectively). As discussed above, two different encodings of sequence lengths were examined. The prediction performance based on the L1 encoding of sequence length was slightly better than that based on L2 (data not shown). Combining the encodings of both L1 and L2 could not further improve the performance. In addition, we found that features such as protein secondary structure, solvent accessibility, residue conservation values and residue positions could be used to further improve the performance. When all these effective features were integrated, the prediction performance was significantly improved by approximately 10%.

In addition, the secondary structure profiles trained by PSSM-2-Features and predicted by PSIPRED were assessed. The PSIPRED prediction results were found to be more effective than that of the PSSM-2-Features. To seamlessly incorporate the PSIPRED method in our program, we implemented the neural network framework of PSIPRED using the Java programming language. Then, we used the neural network weights of PSIPRED. All source codes of this implementation are publicly available at http://genomics.fzu.edu.cn/OMP/pub/.

3.5. The web server and stand-alone program of NNdepth

To facilitate the wider research community, the NNdepth web server has been implemented and seamlessly incorporated as a component of our previously developed outer membrane prediction system (available at http://genomics.fzu.edu.cn/OMP). The web server was developed using the programming languages of Perl, Java and HTML. The stand-alone program was developed using Java. Perl scripts were used to execute the Blast program and parse the text. A user can submit multiple sequences (up to 50) to the web server at the same time. Upon the submission, a traceable numbers will be provided to track and visualize the prediction results. The flow chart of our algorithm is illustrated in Fig. 2. The query sequence will be first threaded through the NCBI NR database for three iterations with an e-value threshold of 0.001. Second, the generated sequence profiles will be fed into the trained NNdepth models. Finally, the output values by the models will be combined to refine the final prediction results. The output of the web server includes the normalized values and the converted raw RD values (i.e. the distance of residue from the protein surface). In addition, we have also implemented a stand-alone program of NNdepth, which can be readily executed in local computers to facilitate high-throughput prediction of RD values for a large number of sequences. The most time-consuming part of the RD prediction by NNdepth is generating sequence profiles by PSI-BLAST. This process usually takes 3–5 minutes for a typical protein sequence with 500 amino acids. After that, the prediction procedure by NNdepth can be accomplished within 10–20 seconds.
image file: c6ra12275b-f2.tif
Fig. 2 Flowchart of the NNdepth method. First, a query protein sequence is iteratively threaded through the local NCBI NR database for three repeats with an e-value threshold of 0.001 to generate sequence profiles and multiple sequence alignments. The PSSM and PSFM profiles are directly generated with the option of ‘-Q’. A check point file is generated with the option of ‘-C’. Then, the readable text MTX profile is obtained from the check point file by using the makemat program of BLAST package. Meanwhile, a multiple sequence alignment is generated with the option of ‘-o’. The SP profile is generated from the multiple sequence alignments. The PSSM and PSFM profiles are combined to train a neural network model. Similarly, the MTX and SP profiles are also combined to train a neural network model. Finally, the average of two predicted RD values (the equation image file: c6ra12275b-t12.tif) is generated as the output of the two neural network models (i.e. Net1 and Net2).

3.6. Case study of RD prediction

To illustrate the predictive capability of NNdepth, we performed a case study by applying NNdepth to predict the RD value of the formylmethanofuran dehydrogenase subunit e-like protein with the sequence length of 147 residues. The prediction result is shown in Fig. 3. The name of the protein was denoted by the SCOP entry (i.e. d2glza1). The predicted and observed RD values were colored by blue and red, respectively. As can be seen, there exists a similar trend between the observed and predicted RD profiles. This protein was relatively well predicted with a PCC value of 0.651 and an MAE of 0.052. It is worth mentioning that this case study is provided here to showcase a realistic application of NNdepth, rather than providing a performance comparison between different methods.
image file: c6ra12275b-f3.tif
Fig. 3 The observed and predicted residue depth profiles for the formylmethanofuran dehydrogenase subunit e-like protein (SCOP entry: d2glza1). The predicted and observed residue profiles are colored as blue and red, respectively. The regions predicted with absolute errors were colored. The color scale is from blue to red, where blue corresponds to the prediction with a low error value and red corresponds to the prediction with a high error value. The 3D structural images and color bar are visualized using the PyMOL program.

4. Conclusions

In this work, we have developed an improved RD prediction approach termed as NNdepth by combining multiple sources of information and taking into consideration different and complementary sequence profiles and physicochemical features. Benchmarking experiments indicate that NNdepth has significantly outperformed our previously developed PSSM-2-Features method with a p-value < 0.001 by the t-test and compared favorably with three other existing methods. The results suggest that NNdepth has a great potential to serve as an effective tool in real applications. A user-friendly web server and stand-alone program of NNdepth have been made freely accessible. In addition, we also anticipate that this method will not only facilitate the development of more accurate RD prediction methods with better performance, but also help to further our understanding of the protein sequence–structure–function paradigm.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

RY and JS defined the research problem. RY wrote the codes, developed the web server, and drafted the manuscript. XW, JL and WX participated in the method assessment. JL contributed to the data analysis and critically revised the manuscript. JS directed the research and critically revised the manuscript. All authors read and approved the final manuscript.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (31500673, 61202167, 61303169), Education and Science Foundation for Young teachers of Fujian (JA14049), Start-Up Fund of Fuzhou University (XRC-1336), and Science Development Foundation of Fuzhou University (2013-XY-17 and 2014-XY-15) and Hundred Talents Program of the Chinese Academy of Sciences (CAS). JL is an NHMRC Senior Research Fellow. JS is a recipient of the Hundred Talents Program of CAS.

References

  1. R. Yan, X. Wang, L. Huang, J. Lin, W. Cai and Z. Zhang, Mol. BioSyst., 2014, 10, 2495–2504 RSC.
  2. S. Bhattacharya, S. Lee, R. Grisshammer, C. G. Tate and N. Vaidehi, J. Chem. Theory Comput., 2014, 10, 5149–5160 CrossRef CAS PubMed.
  3. N. Poorinmohammad and H. Mohabatkar, Journal of arthropod-borne diseases, 2015, 9, 116–124 Search PubMed.
  4. H. K. Ho, L. Zhang, K. Ramamohanarao and S. Martin, Methods Mol. Biol., 2013, 932, 87–106 CAS.
  5. S. Srihari and H. W. Leong, J. Bioinf. Comput. Biol., 2013, 11, 1230002 CrossRef PubMed.
  6. N. Mansour, F. Kanj and H. Khachfe, Interdiscip. Sci.: Comput. Life Sci., 2012, 4, 190–200 CrossRef CAS PubMed.
  7. A. Pascual-Garcia, D. Abia, R. Mendez, G. S. Nido and U. Bastolla, Proteins, 2010, 78, 181–196 CrossRef CAS PubMed.
  8. P. V. Escriba and G. L. Nicolson, Biochim. Biophys. Acta, 2014, 1838, 1449–1450 CrossRef CAS PubMed.
  9. C. L. Mills, P. J. Beuning and M. J. Ondrechen, Comput. Struct. Biotechnol. J., 2015, 13, 182–191 CrossRef PubMed.
  10. B. Rost, Encyclopaedia of Computational Chemistry, 1998, vol. 3, pp. 2242–2255 Search PubMed.
  11. B. Rost, Proteins, 1997, 1, 192–197 CrossRef.
  12. B. Rost and S. O'Donoghue, Comput. Appl. Biosci., 1997, 13, 345–356 CrossRef CAS.
  13. A. R. Kinjo and K. Nishikawa, BMC Bioinf., 2006, 7, 401 CrossRef PubMed.
  14. F. Morcos, T. Hwa, J. N. Onuchic and M. Weigt, Methods Mol. Biol., 2014, 1137, 55–70 CAS.
  15. T. Hamelryck, Proteins, 2005, 59, 38–48 CrossRef CAS PubMed.
  16. J. Song, H. Tan, K. Takemoto and T. Akutsu, Bioinformatics, 2008, 24, 1489–1497 CrossRef CAS PubMed.
  17. R. Heffernan, A. Dehzangi, J. Lyons, K. Paliwal, A. Sharma, J. Wang, A. Sattar, Y. Zhou and Y. Yang, Bioinformatics, 2016, 32, 843–849 CrossRef PubMed.
  18. E. Khaji, M. Karami and Z. Garkani-Nejad, J. Theor. Biol., 2016, 391, 81–87 CrossRef CAS PubMed.
  19. P. Li, G. Pok, K. S. Jung, H. S. Shon and K. H. Ryu, Proteomics, 2011, 11, 3793–3801 CrossRef CAS PubMed.
  20. D. T. Jones, W. R. Taylor and J. M. Thornton, Nature, 1992, 358, 86–89 CrossRef CAS PubMed.
  21. D. T. Jones, J. Mol. Biol., 1999, 292, 195–202 CrossRef CAS PubMed.
  22. R. Heffernan, A. Dehzangi, J. Lyons, K. Paliwal, A. Sharma, J. Wang, A. Sattar, Y. Zhou and Y. Yang, Bioinformatics, 2016 DOI:10.1093/bioinformatics/btv665.
  23. J. Moult, Curr. Opin. Struct. Biol., 2005, 15, 285–289 CrossRef CAS PubMed.
  24. S. Wu and Y. Zhang, Proteins, 2008, 72, 547–556 CrossRef CAS PubMed.
  25. J. Soding, Bioinformatics, 2005, 21, 951–960 CrossRef PubMed.
  26. J. Xu, M. Li, D. Kim and Y. Xu, J. Bioinf. Comput. Biol., 2003, 1, 95–117 CrossRef CAS.
  27. D. Xu, L. Jaroszewski, Z. Li and A. Godzik, Bioinformatics, 2014, 30, 660–667 CrossRef CAS PubMed.
  28. H. Zhou and Y. Zhou, Proteins, 2005, 58, 321–328 CrossRef CAS PubMed.
  29. G. N. Ramachandran, C. Ramakrishnan and V. Sasisekharan, J. Mol. Biol., 1963, 7, 95–99 CrossRef CAS PubMed.
  30. Y. Yang, E. Faraggi, H. Zhao and Y. Zhou, Bioinformatics, 2011, 27, 2076–2082 CrossRef CAS PubMed.
  31. S. Wu and Y. Zhang, PLoS One, 2008, 3, e3400 Search PubMed.
  32. J. Song, H. Tan, M. Wang, G. I. Webb and T. Akutsu, PLoS One, 2012, 7, e30361 CAS.
  33. E. Faraggi, T. Zhang, Y. Yang, L. Kurgan and Y. Zhou, J. Comput. Chem., 2012, 33, 259–267 CrossRef CAS PubMed.
  34. H. Zhou and J. Skolnick, Biophys. J., 2009, 96, 2119–2127 CrossRef CAS PubMed.
  35. W. Kabsch and C. Sander, Biopolymers, 1983, 22, 2577–2637 CrossRef CAS PubMed.
  36. J. E. Donald, I. A. Hubner, V. M. Rotemberg, E. I. Shakhnovich and L. A. Mirny, Bioinformatics, 2005, 21, 2539–2540 CrossRef CAS PubMed.
  37. R. Adamczak, A. Porollo and J. Meller, Proteins, 2005, 59, 467–475 CrossRef PubMed.
  38. S. Chakravarty and R. Varadarajan, Structure, 1999, 7, 723–732 CrossRef CAS PubMed.
  39. Y. Kalidas and N. Chandra, J. Struct. Biol., 2008, 161, 31–42 CrossRef CAS PubMed.
  40. K. P. Tan, T. B. Nguyen, S. Patel, R. Varadarajan and M. S. Madhusudhan, Nucleic Acids Res., 2013, 41, W314–W321 CrossRef PubMed.
  41. L. A. Mirny and E. I. Shakhnovich, J. Mol. Biol., 1999, 291, 177–196 CrossRef CAS PubMed.
  42. S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller and D. J. Lipman, Nucleic Acids Res., 1997, 25, 3389–3402 CrossRef CAS PubMed.
  43. Z. Yuan and Z. X. Wang, Proteins, 2008, 70, 509–516 CrossRef CAS PubMed.
  44. H. Zhang, T. Zhang, K. Chen, S. Shen, J. Ruan and L. Kurgan, BMC Bioinf., 2008, 9, 388 CrossRef PubMed.
  45. J. Song, H. Tan, K. Mahmood, R. H. Law, A. M. Buckle, G. I. Webb, T. Akutsu and J. C. Whisstock, PLoS One, 2009, 4, e7072 Search PubMed.
  46. R. Yan, X. Wang, L. Huang, F. Yan, X. Xue and W. Cai, Sci. Rep., 2015, 5, 11586 CrossRef CAS PubMed.
  47. S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman, J. Mol. Biol., 1990, 215, 403–410 CrossRef CAS PubMed.
  48. D. Xu, H. Li and Y. Zhang, J. Comput. Biol., 2013, 20, 805–816 CrossRef CAS PubMed.
  49. W. Li and A. Godzik, Bioinformatics, 2006, 22, 1658–1659 CrossRef CAS PubMed.
  50. A. A. Schaffer, Y. I. Wolf, C. P. Ponting, E. V. Koonin, L. Aravind and S. F. Altschul, Bioinformatics, 1999, 15, 1000–1011 CrossRef CAS PubMed.
  51. K. Tomii and Y. Akiyama, Bioinformatics, 2004, 20, 594–595 CrossRef CAS PubMed.
  52. S. Henikoff and J. G. Henikoff, J. Mol. Biol., 1994, 243, 574–578 CrossRef CAS PubMed.
  53. M. Heinig and D. Frishman, Nucleic Acids Res., 2004, 32, W500–W502 CrossRef CAS PubMed.
  54. B. Rost and C. Sander, J. Mol. Biol., 1993, 232, 584–599 CrossRef CAS PubMed.
  55. R. Adamczak, A. Porollo and J. Meller, Proteins, 2004, 56, 753–767 CrossRef CAS PubMed.
  56. H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov and P. E. Bourne, Nucleic Acids Res., 2000, 28, 235–242 CrossRef CAS PubMed.
  57. S. Kawashima, P. Pokarowski, M. Pokarowska, A. Kolinski, T. Katayama and M. Kanehisa, Nucleic Acids Res., 2008, 36, D202–D205 CrossRef CAS PubMed.

Footnote

Electronic supplementary information (ESI) available: File 1: representative 1D and 2D properties. File 2: the optimal parameters for Net1 and Net2 for residue depth prediction. See DOI: 10.1039/c6ra12275b

This journal is © The Royal Society of Chemistry 2016