Title: Distance estimation for mixed continuous and categorical data with missing values
Authors: Eduardo Mendes - Fundacao Getulio Vargas (Brazil) [presenting]
Glauco Azevedo - Fundacao Getulio Vargas (Brazil)
Abstract: A methodology is proposed to estimate the pairwise distance between mixed continuous and categorical data with missing values. Distance estimation is the base for many regression/classification methods, such as nearest neighbors and discriminant analysis, and for clustering techniques such as k-means and k-medoids. Classical methods for handling missing data rely on mean imputation, that could underestimate the variance, or regression-based imputation methods. Unfortunately, when the goal is to estimate the distance between observations, data imputation may perform badly and bias the results toward the data imputation model. We estimate the pairwise distances directly, treating the missing data as random. The joint distribution of the data is approximated using a multivariate mixture model for mixed continuous and categorical data. We present an EM-type algorithm for estimating the mixture and a general methodology for estimating the distance between observations. Simulations shows an improved performance of our method when compared to traditional imputation methods.