CMStatistics 2018: Start Registration
View Submission - CMStatistics
B1686
Title: Probabilistic clustering methods in data analysis of macro-datasets Authors:  Aurea Sandra Toledo de Sousa - Universidade dos Azores (Portugal) [presenting]
Helena Bacelar-Nicolau - Universidade de Lisboa (Portugal)
Osvaldo Dias Lopes da Silva - Universidade dos Acores (Portugal)
Leonor Bacelar-Nicolau - Universidade de Lisboa (Portugal)
Abstract: The extraction of useful knowledge from huge data sets stored in large databases is an important task. One possible solution for analysing high-dimensional micro-data sets is the prior identification of classes (usually of individuals) in such databases, whose description is then made through macro-data matrices (also called symbolic data matrices). From a proximity matrix containing similarities or dissimilarities between the pairs of elements to be classified, either classic or probabilistic aggregation criteria can subsequently be applied. We use hierarchical clustering methods based on the weighted generalized affinity coefficient, and on probabilistic aggregation criteria, which apply the transformation by the probabilistic distribution function of appropriated sample statistics. The most relevant clustering structures, obtained with the hierarchical clustering analysis of two datasets taken from the literature of complex data analysis, are described. Their results are compared to those obtained with other clustering algorithms. The results show that the clustering probabilistic approach performs well over both datasets, which also happen with similar previous studies on either simulated or real data sets.