Computational identification of protein S-sulfenylation sites by incorporating the multiple sequence features information†
Abstract
Cysteine S-sulfenylation is a major type of posttranslational modification that contributes to protein structure and function regulation in many cellular processes. Experimental identification of S-sulfenylation sites is challenging, due to the low abundance of proteins and the inefficient experimental methods. Computational identification of S-sulfenylation sites is an alternative strategy to annotate the S-sulfenylated proteome. In this study, a novel computational predictor SulCysSite was developed for accurate prediction of S-sulfenylation sites based on multiple sequence features, including amino acid index properties, binary amino acid codes, position specific scoring matrix, and compositions of profile-based amino acids. To learn the prediction model of SulCysSite, a random forest classifier was applied. The final SulCysSite achieved an AUC value of 0.819 in a 10-fold cross-validation test. It also exhibited higher performance than other existing computational predictors. In addition, the hidden and complex mechanisms were extracted from the predictive model of SulCysSite to investigate the understandable rules (i.e. feature combination) of S-sulfenylation sites. The SulCysSite is a useful computational resource for prediction of S-sulfenylation sites. The online interface and datasets are publicly available at http://kurata14.bio.kyutech.ac.jp/SulCysSite/.