Title: Bandwidth selection in nonparametric regression with very large sample size
Authors: Mario Francisco-Fernandez - Universidade da Coruna (Spain)
Ricardo Cao - University of Coruna (Spain)
Daniel Barreiro Ures - Universidade da Coruna (Spain) [presenting]
Abstract: In the context of nonparametric regression estimation, the behaviour of kernel methods such as the Nadaraya-Watson or local linear estimators is heavily influenced by the value of the bandwidth parameter, which determines the trade-off between bias and variance. This clearly implies that the selection of an optimal bandwidth, in the sense of minimizing some risk function (MSE, MISE, etc.), is a crucial issue. However, the task of estimating an optimal bandwidth using the whole sample can be very expensive in terms of computing time in the context of big data, due to the computational complexity of some of the most used algorithms for bandwidth selection (leave-one-out cross validation, for example, has $O(n^2)$ complexity). To overcome this problem, we propose two methods that estimate the optimal bandwidth for several subsamples of our large dataset and then extrapolate the result to the original sample size making use of the asymptotic expression of the MISE bandwidth. Preliminary simulation studies show that the proposed methods lead to a drastic reduction in computing time, while the statistical precision is only slightly decreased.