Title: Random forests variable importances: Towards a better understanding and large-scale feature selection
Authors: Antonio Sutera - University of Liege (Belgium) [presenting]
Gilles Louppe - New York University (Switzerland)
Celia Chatel - Aix-Marseille University (France)
Louis Wehenkel - University of Liege (Belgium)
Pierre Geurts - University of Liege (Belgium)
Abstract: One of the most practically useful features of random forests is the possibility to derive from the ensemble of trees an importance score for each input variable that assesses its relevance for predicting the output. These importance scores have been successfully applied on many problems but they are still not well understood theoretically. Recent works towards a better understanding, and a better exploitation, of the mean decrease impurity (MDI) measure will be discussed. First, a theoretical analysis of this measure in asymptotic sample and ensemble size conditions will be presented. Main results include a characterization of the conditions under which this measure is consistent with respect to a common definition of variable relevance. Then, motivated by very high dimensional problems, MDI importances derived from finite tree ensembles will be analysed under the constraint that each tree can be built only from a subset of variables of fixed size. In this setting, a sequential variable sampling mechanism is proposed and compared with uniform sampling. When used for the identification of all relevant variables, importance scores obtained using this sampling mechanism are shown, theoretically and empirically, to significantly improve convergence speed in several conditions with respect to uniform sampling.