Title: Determining the number of components of a PLS regression on incomplete data
Authors: Titin Agustin Nengsih - University of Strasbourg (France) [presenting]
Frederic Bertrand - Universite de Strasbourg (France)
Myriam Maumy-Bertrand - Universite de Strasbourg (France)
Nicolas Meyer - Universite de Strasbourg (France)
Abstract: Missing data is known to be a concern for the applied researcher. Several methods have been developed for handling incomplete data. Imputation is the process of substituting missing data before estimating the relevant model parameters. PLS regression is a multivariate model estimated either by the SIMPLS or NIPALS algorithm. The goal is to analyze the impact of the missing data proportion on the estimation of the number of components of a PLS regression by simulations. We compare the criteria for selection of the number of components of a PLS regression on incomplete data and PLS regression on imputed data set which used three methods of imputation: multiple imputations by chained equations (MICE), k-nearest neighbors imputation (KNNimpute) and a singular value decomposition imputation (SVDimpute). The compared criteria are Q2-LOO, Q2-10 fold, AIC, AIC-DoF, BIC and BIC-DoF on different proportions of missing data (from 1 to 50\%) and under a MCAR assumption and a MAR assumption. The results show that MICE had the closest to the correct number of components at each frequency of missingness although it needs a long time for the execution. Furthermore, NIPALS-PLSR ranked second, followed by KNNimpute and SVDimpute. Whatever the criterion, except Q2-LOO, the number of components in a PLS regression is far from the true one and tolerance to incomplete data sets depends on the sample size, the proportion of missing data and the chosen component selection method.