Title: Imputation for supervised learning problems in high dimension
Authors: Hadrien Lorenzo - INRIA (France) [presenting]
Jerome Saracco - ENSC - Bordeaux INP - Inria (France)
Olivier Cloarec - Corporate Research Advanced Data AnalyticS (France)
Abstract: The problem of missing data often occurs in data analysis. We consider missing values of the type MAR (Missing At Random). Then, the probability that a value is missing depends on one or multiple observed variables. Most modern algorithms focus on this type of missing values, and the most used implementations are certainly MICE, missForest, missMDA, or k-Nearest Neighbors imputations. To take into account sampling variability, it is better to propose $M$ values for each missing value instead of a single one. This so-called ``multiple imputation'' procedure allows to provide proper imputation, in contrast to improper imputation. In practice, $M=5$ is often sufficient. Most of the existing methods are not well suited to the high dimensional context, when the sample size $n$ is much lower than the number of variables $p$, often symbolized as $n<<p$. In supervised analysis, the variable $y$ must be explained by the variable $x$. This implies that the part of $x$ associated with $y$ can be hard to find, when the classical imputation methodologies suffer. We present a new methodology, called Koh-Lanta, able to deal with missing values in a supervised context, using multiple imputation, and tackling the high dimensional issues. For the sake of simplicity, missing values are considered only in the $x$ part.