Title: Random forests for big data
Authors: Robin Genuer - Bordeaux University INSERM Vaccine Research Institute (France)
Nathalie Villa-Vialaneix - MIAT-INRA (France)
Jean-Michel Poggi - University Paris-Sud Orsay (France) [presenting]
Christine Tuleau-Malot - University Nice (France)
Abstract: Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data, but they also often include online data and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, available proposals that deal with scaling random forests to Big Data problems are selectively reviewed. These proposals rely on parallel environments or on online adaptations of random forests. We also describe how the out-of-bag error is addressed in these methods. Then, we formulate various remarks for random forests in the Big Data context. Finally, we experiment five variants on two massive datasets, a simulated one as well as a real-world dataset. These numerical experiments lead to highlight the relative performance of the different variants, as well as some of their limitations.