Importance of proximity measures in clustering of cancer and miRNA datasets: proposal of an automated framework
Abstract
Distance plays an important role in the clustering process for allocating data points to different clusters. Several distance or proximity measures have been developed and reported in the literature to determine dissimilarities between two given points. The choice of distance measure depends on a particular domain as well as different data sets of the same domain. It is important to automatically determine the appropriate distance measure which acts best for a particular data set. In this study we have developed an automatic clustering technique using the search capability of multiobjective optimization which can automatically determine the relevant distance measure and the corresponding partitioning from a given data set. Our proposed automated framework is generic in nature i.e., any number of different distance measures can be incorporated into it. In our work we have used four existing widely used distance measures, i.e., Euclidean, line symmetry, point symmetry and city block distance to be explored for each data set. In order to measure the richness of an obtained partitioning using a particular distance, four cluster validity indices, the Silhouette index, the DB index, the adjusted rand index and classification accuracy are used. A new encoding strategy which can encode the set of cluster centers and the particular distance function is used to represent the problem. The appropriate distance function and the corresponding partitioning are determined using the search capability of a multiobjective optimization based technique. The efficiency of the proposed technique is shown on clustering three microRNA and three microarray gene expression data sets having varying complexities. The results show the usefulness of the proposed automated approach.