CMStatistics 2017: Start Registration
View Submission - CMStatistics
B0251
Title: Cross-entropy clustering Authors:  Przemyslaw Spurek - Jagiellonian University (Poland) [presenting]
Abstract: Gaussian Mixture Model (GMM) is one of the most popular clustering models implemented in various R packages, such as mclust, Rmixmod, pdfCluster, mixtools, etc. The model focuses on finding the mixture of Gaussians $f=p_1 f_1+\ldots+p_k f_k$ where $p_1,\ldots,p_k > 0$ and $\sum_i p_i = 1,$ which provides an optimal estimation of data set $X \subset \mathbb{R}^N$, measured by the cost function $\mathrm{EM}(f,X) = -\frac{1}{|X|}\sum_{x \in X} \log \left( p_1 f_1(x) + \ldots + p_k f_k(x) \right)$. Its minimization is iteratively performed with use of EM (Expectation Maximization) algorithm. While the expectation step is relatively simple, the maximization step usually needs complicated numerical optimization. We presents R Package CEC, the first open source implementation of a Cross-Entropy Clustering method, which is a fast hybrid between k-means and GMM. Similarly to GMM, CEC searches for Gaussian densities, which minimizes the cost function $\mathrm{CEC}(f,X) = -\frac{1}{|X|}\sum_{x \in X} \log \left( \max ( p_1 f_1(x), \ldots, p_k f_k(x) ) \right)$. Although the difference between the two functions is slight and relies on substituting the sum operation by the maximum, it occurs that the optimization can be realized in a comparable time to k-means. CEC allows to reduce unnecessary clusters on-line.