CMStatistics 2020: Start Registration
View Submission - CMStatistics
Title: BIG-SIR: a Sliced Inverse Regression approach for massive data Authors:  Benoit Liquet - Macquarie University (Australia) [presenting]
Jerome Saracco - INRIA Bordeaux Sud Ouest - University of Bordeaux (France)
Abstract: In a massive data setting, the focus is on a semiparametric regression model involving a real dependent variable $Y$ and a $p$-dimensional covariable $X$. This model includes a dimension reduction of $X$ via an index $X'\beta$. The Effective Dimension Reduction (EDR) direction $\beta$ cannot be directly estimated by the Sliced Inverse Regression (SIR) method due to the large volume of the data. To deal with the main challenges of analysing massive datasets which are the storage and computational efficiency, we propose a new SIR estimator of the EDR direction by following the ``divide and conquer'' strategy. The data is divided into subsets. EDR directions are estimated in each subset which is a small dataset. The recombination step is based on the optimisation of a criterion which assesses the proximity between the EDR directions of each subset. Computations are run in parallel with no communication among them. A simulation study using our \texttt{edrGraphicalTools} R package shows that our approach enables us to reduce the computation time and conquer the memory constraint problem posed by massive datasets. A combination of \texttt{foreach} and \texttt{bigmemory} R packages are exploited to offer efficiency of execution in both speed and memory. Results are visualised using the bin-summarise-smooth approach through the \texttt{bigvis} R package. Finally, we illustrate our proposed approach on a massive airline data set.