Title: Clustering and feature screening via L1 fusion penalization
Authors: Gourab Mukherjee - University of Southern California (United States)
Trambak Banerjee - University of Southern California (United States)
Peter Radchenko - University of Sydney (Australia) [presenting]
Abstract: The aim is to study the large sample behavior of a convex clustering framework, which minimizes the sample within cluster sum of squares under an L1 fusion constraint on the cluster centroids. Our analysis is based on a novel representation of the sample clustering procedure as a sequence of cluster splits determined by a sequence of maximization problems. We use this representation to provide a simple and intuitive formulation for the population clustering procedure. We then demonstrate that the sample procedure consistently estimates its population analogue and we derive the corresponding rates of convergence. On the basis of the new perspectives gained from the asymptotic investigation, we propose a key post-processing modification of the original clustering framework. We show, both theoretically and empirically, that the resulting approach can be successfully used to estimate the number of clusters in the population. We also propose an approach for feature screening in the clustering of massive datasets, in which both the number of features and the number of observations can potentially be very large.