COMPSTAT 2018: Start Registration
View Submission - COMPSTAT2018
Title: A clustering method proposition for mixed type data Authors:  Odysseas Moschidis - University of Macedonia (Greece) [presenting]
Theodoros Chatzipantelis - Research Committee, Aristotle University Thessaloniki (Greece)
Abstract: The typical encoding of a continuous variable in a categorical ordinal variable, presents two major drawbacks: a) distinctively different values are classified in the same class with a major loss of information; b) values close to one another, which stand each side the boundary of two classes are classified in different classes, with a distortion of information. Few algorithms cluster mixed type datasets with both numerical and categorical attributes. We propose an algorithm that enables hierarchical clustering of data with numerical and categorical attributes based on WARD criterion and chi-squared metric. Each categorical variable is replaced by a set of 0-1 variables, one for each variable category, taking value 1 if the corresponding category has been observed and 0 otherwise and each numerical variables is replaced by a set of n-grades possibilities. With the proposed encoding that is an evolvement of ordinal data encoding, each value of the continuous variable could be classified in all n-classes of the categorical variable using as values the probabilities of a corresponding probability distribution function, different for each value of the numerical variable. This results in the elimination of the drawbacks a) and b) of the typical encoding, for, as we are going to suggest, we achieve the reconstruction of the values of the numerical variable. The proposed methodology gives similar results with MCA