CMStatistics 2022: Start Registration
View Submission - CMStatistics
Title: Clustering mixed-type data via the KAMILA algorithm Authors:  Marianthi Markatou - University at Buffalo (United States) [presenting]
Abstract: Despite the existence of a large number of clustering algorithms, clustering mixed measurement scale data, that is, interval (continuous) and categorical (nominal and/or ordinal) scale data, remains a challenging problem. We first review the literature on this topic and show that most of the current clustering methods for mixed-scale data suffer from at least one of two central challenges: 1) they are unable to equitably balance the contribution of continuous and categorical scale variables without strong parametric assumptions; 2) they are unable to properly handle data sets in which only a subset of variables are related to the underlying cluster structure of interest. We then develop KAMILA (KAY-means for MIxed LArge data), a clustering method that addresses (1) and, in many situations, (2) without requiring strong assumptions. We next discuss MEDEA (Multivariate Eigenvalue Decomposition Error Adjustment), a weighting scheme that addresses (2) even in the face of a large number of uninformative variables. We study the theoretical aspects of our methods and demonstrate their performance using Monte Carlo simulations and real data sets.