CMStatistics 2021: Start Registration
View Submission - CMStatistics
Title: Probit regression for large data sets via coresets Authors:  Christian Peters - TU Dortmund (Germany) [presenting]
Abstract: The purpose is to show how probit regression, an important generalized linear model for binary responses, can be solved efficiently, even when the data sets are large. To this end, we develop an algorithm that reduces a $d$-dimensional data set from $n$ points to a coreset of $\operatorname{poly}(\mu d \log n)$ weighted points, where $\mu$ is a usually small complexity parameter for compressing the data. The coreset provably allows approximating the probit loss function on the original data set up to a factor of $(1\pm\varepsilon)$ for all data sets with bounded mu-complexity, in which case the data is inseparable and the maximum likelihood estimator exists. We show how the coreset can be computed by an online algorithm that requires only one pass over the data set and $O(d^2)$ update time. The experiments on real-world data sets demonstrate that the algorithm outperforms both uniform sampling and stochastic gradient descent and that it can be applied successfully in the context of Bayesian probit regression. We also briefly discuss extensions of the standard probit model that also admit efficient sublinear approximations in the coreset framework.