Optimisation of cancer classification by machine learning generates an enriched list of candidate drug targets and biomarkers

Sterling Ramroach; Ajay Joshi; Melford John

doi:10.1039/C9MO00198K

Optimisation of cancer classification by machine learning generates an enriched list of candidate drug targets and biomarkers†

Sterling Ramroach,

*^a Ajay Joshi^a and Melford John^b

Author affiliations

* Corresponding authors

^a Department of Electrical and Computer Engineering, University of the West Indies, Saint Augustine, Trinidad and Tobago
E-mail: sramroach@gmail.com

^b Department of Pre-Clinical Sciences, University of the West Indies, Saint Augustine, Trinidad and Tobago

Abstract

The Cancer Genome Atlas has provided expression values of 18 015 genes for different cancer types. Studies on the classification of cancers by machine learning algorithms have used different data and methods, which makes it difficult to compare their performance. It is unclear, which algorithm performs best and if maximum levels of accuracy have been obtained. In this study, we aimed to optimise the diagnosis of cancer by comparing the performance of five algorithms using the same data, and by identifying the smallest possible number of differentiator genes. Classification accuracies of five algorithms of cancer type and primary site were determined using a gene expression dataset of 5629 samples and a dataset of 9144 samples, respectively. When trained with sample sets ranging from 16 718 to 40 genes, Random Forest (RF), Gradient Boosting Machine (GBM), and Neural Network (NN) consistently achieved 100% or near 100% accuracy in the classification of both cancer type and primary site. Reduction of training sets to the 40 highest-ranked genes resulted in 78-fold and 45-fold faster processing times for RF and GBM, respectively. The olfactory receptor family, keratin associated proteins, and defensin beta family were among the highest ranked genes. The ensemble and NN algorithms were the most accurate at distinguishing between cancer types and primary sites, whereas KNN was the fastest. Training sets can be reduced to the 40 highest-ranked differentiator genes without any significant loss of accuracy, amongst which there are potential drug targets and biomarkers.

Molecular Omics

Optimisation of cancer classification by machine learning generates an enriched list of candidate drug targets and biomarkers†

Abstract

Supplementary files

Article information

Download Citation

Search articles by author

Spotlight

Advertisements