Title: How to deal with missing values in the high dimensional supervised context
Authors: Hadrien Lorenzo - INRIA (France) [presenting]
Olivier Cloarec - Corporate Research Advanced Data AnalyticS (France)
Jerome Saracco - INRIA Bordeaux Sud Ouest - University of Bordeaux (France)
Abstract: The exploratory analysis searches for a data structure hidden from the user. In the supervised case, this data structure may only represent a very small portion of the total data structure, which is all the more volatile as the sampling is small or the quantity of descriptors is large. The management of missing data is classically done without taking into account the supervised nature of the research question. This assumes that the structure associated with the answer is accessible to an unsupervised method. This assumption is questionable, especially in high dimensions. Indeed, in high dimensions, an unsupervised approach of imputation estimates on variables is useless to the prediction model. In practice, only the initialization values of the imputation (often iterative algorithms of EM type) are kept for the variables of interest. A solution to the problem of supervised imputation in high dimensions is presented. The context is the linear model solved with the PLS (Partial Least Squares) method. The construction of subspaces in the linear context makes it possible to ``skip'' the missing data, what does the Nipals method. On the other hand, the regularized PCA approach, through multiple imputation implemented in the missMDA R-package, allows the estimation of missing data for unsupervised problems. The objective is to compare these different approaches in contexts more or less favorable to each of them.