Title: Variable importance for random forests: MDA and Shapley effects
Authors: Clement Benard - Safran Tech (France) [presenting]
Abstract: Variable importance measures are the main tools to analyze the black-box mechanisms of random forests. Although the mean decrease accuracy (MDA) is widely accepted as the most efficient variable importance measure for random forests, little is known about its statistical properties. In fact, the exact MDA definition varies across the main random forest software. The objective is to rigorously analyze the behavior of the main MDA implementations. Consequently, we establish their limits when the sample size increases. In particular, we break down these limits into three components: the first two terms are related to Sobol indices, which are well-defined measures of a covariate contribution to the response variance, as opposed to the third term, whose value increases with dependence within covariates. Thus, we theoretically demonstrate that the MDA does not target the right quantity when covariates are dependent, a fact that has already been noticed experimentally. To address this issue, we define new important measures for random forests: the Sobol-MDA and SHAFF. The Sobol-MDA fixes the flaws of the original MDA, and is appropriate for variable selection. On the other hand, SHAFF is a fast and accurate estimate of Shapley's effects, even when input variables are dependent. SHAFF is appropriate to rank all variables for interpretation purposes. We prove the consistency of both the Sobol-MDA and SHAFF, and show that they empirically outperform their competitors.