Title: Inference and consistent variable selection for random forests and other tree-based ensembles
Authors: Lucas Mentch - University of Pittsburgh (United States) [presenting]
Giles Hooker - Cornell University (United States)
Abstract: Despite the success of tree-based learning algorithms (bagging, boosting, random forests), these methods are often seen as prediction-only tools whereby the interpretability and intuition of traditional statistical models is sacrificed for predictive accuracy. We present an overview of recent work that suggests this black-box perspective need not be the case. We begin by developing formal statistical inference procedures for predictions generated by supervised learning ensembles. Ensemble methods based on bootstrapping often improve accuracy and stability, but fail to provide a framework in which distributional results are available. Instead of aggregating full bootstrap samples, we consider a general resampling scheme in which predictions are averaged over trees built on subsamples and demonstrate that the resulting estimator belongs to an extended class of U-statistics. As such, a corresponding central limit theorem is developed allowing for confidence intervals to accompany predictions, as well as formal hypothesis tests for variable significance and additivity. The test statistics proposed can also be extended to produce consistent measures of variable importance that are robust to correlation structures between predictors. Finally, we discuss efficient variance estimation methods for the above procedures and provide demonstrations on ebird citizen science data.