Title: Missing data imputation problems may occur when dealing with categorical data
Authors: Francesco Palumbo - University of Naples Federico II (Italy) [presenting]
Alfonso Iodice D Enza - Universita di Cassino e del Lazio Meridionale (Italy)
Angelos Markos - Democritus University Of Thrace (Greece)
Abstract: Missing data imputation issues may occur when dealing with categorical data: for example, in surveys, respondents are reluctant to answer questions related to sensitive information (e.g. income, sexual orientation, religion). When missing is related to some of the observed data and it only occurs in a subset of variables, missingness is referred in the literature as missing at random (MAR). Under these conditions, good techniques need to incorporate variables that are related to the missingness. In MAR imputation, several approaches work satisfactorily when dealing with continuous variables; however they cannot be easily generalized to multinomial categorical data. In this contribution a procedure is proposed that combines principal component methods and multiple imputation via chained equations (MICE) to impute missing entries in high dimensional categorical data. Given a set of p categorical variables and one variable with MAR values, the procedure imputes the missing entries according to association structure between the incomplete variable and the p observed variables. In particular, a reduced number of linear combinations (principal components) are defined by means of correspondence analysis-based methods; such components are the input of a suitable MICE procedure. The procedure can be iterated when more than one of the categorical variable presents missing entries. The contribution ends with a comparative study among other categorical data imputation approaches.