The prediction of methylation states in human DNA sequences based on hexanucleotide composition and feature selection†
Abstract
DNA methylation is an important epigenetic modification, and it plays a crucial role in the regulation of gene expression and the occurrence of cancer. Although various experimental methods have been used to detect DNA methylation states, they are time-consuming and laborious. With the rapid accumulation of DNA sequence data, the gap between the number of known sequences and the number of known methylation annotation is widening rapidly. Therefore, it is indispensable to develop a computational method for predicting methylation states. In this study, the hexanucleotide composition is utilized to characterize the DNA sequences. Maximum relevance minimum redundancy is adopted to preselect a feature subset with discrimination information, and an improved genetic algorithm is employed to obtain the optimal feature subset from the preselected feature subset and the parameters of the support vector machine. In the end, a model on the basis of the optimal feature subset and parameter is constructed and used to predict methylation states. Based on the 5-fold cross-validation, the proposed method achieves an accuracy of 92.42%, a Matthew's correlation coefficient of 0.8484 and the area under the receiver operating characteristic curve of 0.9326. The predictive performance of the hexanucleotide composition is evaluated by comparing with trinucleotide composition and nonanucleotide composition. The results indicate that the current method has a high potential to become a useful tool in DNA methylation states prediction research. The source code of Matlab is freely available on request from the authors.
Please wait while we load your content...