Title: Multiple latent components clustering
Authors: Stanislaw Wilczynski - University of Wroclaw (Poland) [presenting]
Piotr Sobczyk - Politechnika Wroclawska (Poland)
Malgorzata Bogdan - University of Wroclaw (Poland)
Julie Josse - INRIA (France)
Abstract: In many scientific problems such as identification of genetic pathways based on gene expression data, one of the tasks is finding a lower dimensional subspace representing a collection of points from high-dimensional space. One of the simplest methods to achieve this is to use PCA. However, it is useful only if we assume that all points from the data come from the same lower dimensional subspace. In fact, in lots of cases a more general model is needed, which assumes that variables come from a mixture model and our high-dimensional space is a union of a few low dimensional subspaces. We propose a new method of finding the subspaces called Multiple Latent Components Clustering (MLCC). It is based on $k$-means algorithm, where clusters represent subspaces and the center of a cluster is a set of principal components. To estimate the number of clusters, modified version of Bayesian Information Criterion is used, which takes into account a prior distribution on a number of clusters. In each of the iterative steps of the algorithm, the number of principal components in a single cluster is estimated using Penalized Semi-integrated Likelihood (PESEL) method and the similarity between data point and cluster is measured by BIC. The algorithm is implemented in R package `Varclust'. We will present the results of the comparison of our algorithm with other variable clustering methods, as well as results of real data analysis. We will point out the main differences and advantages of MLCC.