B0779
Title: Impact of missing data on mixtures and clustering
Authors: Christophe Biernacki - Inria (France) [presenting]
Abstract: The frequency of missing data increases with the growing size of modern datasets, making this topic important in the research agenda of statisticians. First, we introduce the MCAR mechanism for mixed data (quantitative and categorical) mixture models and illustrate it on a biological data set. Second, as a more theoretical but important step, we discuss the impact of missing values on the EM algorithm for Gaussian mixtures in the MAR situation. We exhibit the fact that the quite familiar degeneracy problem is aggravated during the EM runs, leading to dangerously slow and also more frequent events than with complete data. Finally, we discuss the impact of missing not-at-random values (MNAR mechanism) on the partition estimation provided from mixtures (Gaussian or not). In particular, we defend the advantage of embedding the missingness mechanism directly within the clustering modeling step. A new MNAR model is introduced, discussed and experimented on a medical data set.