CMStatistics 2022: Start Registration
View Submission - CMStatistics
B1297
Title: Mixed-type data spectral clustering with variable specific distances Authors:  Cristina Tortora - San Jose State University (United States)
Francesco Palumbo - University of Naples Federico II (Italy)
Alfonso Iodice D Enza - Universita di Napoli Federico II (Italy) [presenting]
Abstract: At the core of the spectral clustering approach is the decomposition of the graph Laplacian matrix, a weighted kernel transformation of the pairwise distances/dissimilarities between the observations at hand. It follows that the definition of the distance/dissimilarity matrix is crucial and, in the case of non-continuous and/or mixed datasets, non-obvious nor trivial. A straightforward solution is: to compute pairwise Euclidean distances for the continuous variables, and Hamming distances for the non-continuous variables; to define a general distance matrix via a convex combination of the two matrices previously obtained. The weight of the convex combination dictates the influence of the continuous and categorical variables on the clustering solution. Using Euclidean distances on standardised continuous variables is a reasonable choice; instead, considering the simple matching for the categorical variables is simplistic. We consider a set of association-based, variable-specific, distances and dissimilarities, to define a custom Laplacian matrix suitable for the spectral clustering of mixed data. In particular, we propose a data-driven approach to select, for the considered variable, the most appropriate distance/dissimilarity: the combination of distance/dissimilarity of choice is the one providing the best spectral clustering solution according to a suitable metric.