Title: On clustering and outlier detection with missing data
Authors: Cristina Tortora - San Jose State University (United States) [presenting]
Hung Tong - San Jose State University (United States)
Louis Tran - San Jose State University (United States)
Abstract: Cluster analysis is a data analysis technique that aims to produce smaller groups of similar observations in a data set. In model-based clustering, the population is assumed to be a convex combination of sub-populations, each of which is modeled by a probability distribution. When the data are characterized by outliers the multivariate Student-$t$ ($T$) and the contaminated normal distribution (CN) provide robust parameter estimates and therefore are more suitable choices compared to Gaussian Mixture models. Recently, the $T$ and CN distributions have been extended to accommodate different tail behaviors across principal components, the models are referred to as multiple scaled distributions, i.e., MST and MSCN respectively. The mixture of CN has the advantage of automatically detecting outliers while the MSCN distribution, has the advantage of directional robust parameter estimates and outlier detection. The term ``directional'' implies that the parameter estimation and outlier detection procedures work separately for each principal component. Some practical limitations of the mentioned models are that they require the number of clusters to be known and the data set to be complete. The two mentioned limitations are overcome by providing a study of indices to select the number of clusters and presenting recent extensions of the CN and MSCN mixtures to cluster data that contain values missing at random.