Title: On the performance of distance-based approaches for clustering mixed-type data
Authors: Angelos Markos - Democritus University Of Thrace (Greece) [presenting]
Odysseas Moschidis - University of Macedonia (Greece)
George Menexes - Aristotle University of Thrace (Greece)
Theodoros Chatzipantelis - Research Committee, Aristotle University Thessaloniki (Greece)
Abstract: Clustering of a set of objects described by a mixture of continuous and categorical variables is a challenging task. Popular distance-based approaches for clustering mixed type-data include dissimilarity measures for variables with different measurement scales, standardization of variables to the same scale, extensions of K-means for mixed data and sequential or simultaneous dimension reduction and clustering. A major concern in clustering of mixed-type data is how to achieve a favorable balance between continuous and categorical variables. A number of existing methods require user-specified weights to determine the relative contribution of continuous versus categorical variables. Other approaches adaptively adjust weights by considering the importance of each type of variables. We study the similarity of clustering solutions obtained by different strategies on a number of real mixed-type data sets and study their performance on simulated data sets with varying levels of continuous and categorical overlap. Dimension reduction and clustering methods tend to outperform alternative approaches when categorical variables are more informative than continuous for purposes of clustering, i.e. for data sets in which the continuous variables have substantially more overlap compared to the categorical ones. Recommended practices are provided within the context of this framework.