Jump to main content
Jump to site search


Optimisation of cancer classification by machine learning generates enriched list of candidate drug targets and biomarkers

Abstract

Purpose The Cancer Genome Atlas has provided expression values of 18,015 genes for different cancer types. Studies on the classification of cancers by machine learning algorithms have used different data and methods, which makes it difficult to compare their performance. It is unclear, which algorithm performs best and if maximum levels of accuracy have been obtained. In this study, we aimed to optimise the diagnosis of cancer by comparing the performance of five algorithms using the same data, and by identifying the smallest possible number of differentiator genes. Methods Classification accuracies of five algorithms of cancer type and primary site were determined using a gene expression dataset of 5,629 samples and a dataset of 9,144 samples, respectively. Results When trained with samples sets ranging from 16,718 to 40 genes, Random Forest (RF), Gradient Boosting Machine (GBM), and Neural Network (NN) consistently achieved 100% or near 100% accuracy in the classification of both cancer type and primary site. Reduction of training sets to the 40 highest-ranked genes resulted in 78-fold and 45-fold faster processing times for RF and GBM, respectively. The olfactory receptor family, keratin associated proteins, and defensin beta family were among the highest ranked genes. Conclusion The ensemble and NN algorithms were the most accurate at distinguishing between cancer types and primary sites, whereas KNN was the fastest. Training sets can be reduced to the 40 highest-ranked differentiator genes without any significant loss of accuracy, amongst which are potential drug targets and biomarkers.

Back to tab navigation

Supplementary files

Article information


Submitted
31 Dec 2019
Accepted
13 Feb 2020
First published
13 Feb 2020

Mol. Omics, 2020, Accepted Manuscript
Article type
Research Article

Optimisation of cancer classification by machine learning generates enriched list of candidate drug targets and biomarkers

S. Ramroach, A. Joshi and M. John, Mol. Omics, 2020, Accepted Manuscript , DOI: 10.1039/C9MO00198K

Social activity

Search articles by author

Spotlight

Advertisements