B1412
Title: The projected covariance measure for model-free variable significance testing
Authors: Rajen D Shah - University of Cambridge (United Kingdom) [presenting]
Ilmun Kim - Yonsei University (Korea, South)
Anton Rask Lundborg - University of Cambridge (United Kingdom)
Richard Samworth - University of Cambridge (United Kingdom)
Abstract: Testing the significance of a variable $X$ for predicting a response $Y$ given additional covariates $Z$, is a ubiquitous task in statistics. One approach is to specify a generalised linear model, and then test whether the regression coefficient for $X$ is non-zero. However, when the model is misspecified, as will invariably be the case, the test may have poor power, for example when $X$ is involved in complex interactions, or lead to many false rejections. We study the problem of testing the model-free null that the conditional mean of $Y$ given $X$ and $Z$ does not depend on $X$. We propose a simple and general framework that can leverage flexible machine learning methods such as random forests or neural nets to yield both robust error control and high power. The procedure involves performing 4 regressions, two to construct a particular projection of $Y$ on $X$ and $Z$ using one half of the data, and the remaining two to estimate the expected conditional covariance between this projection and $Y$ on the remaining half of the data. By using appropriate regression methods, we show that settings, where $Z$ or $X$ are high-dimensional, can be tackled when there is an underlying sparse model. In the case where $X$ and $Z$ are of moderate dimension, we show that a version of our procedure using spline regression achieves (up to a log factor) what we prove is the minimax optimal rate for nonparametric testing.