CMStatistics 2022: Start Registration
View Submission - CMStatistics
B1828
Title: Unsupervised tree boosting for learning probability distributions Authors:  Naoki Awaya - Stanford University (United States)
Li Ma - Duke University (United States) [presenting]
Abstract: An unsupervised tree boosting algorithm is proposed for inferring the underlying sampling distribution of an i.i.d. sample based on fitting additive tree ensembles in a fashion analogous to supervised tree boosting. Integral to the algorithm is a new notion of ``addition'' on probability distributions that leads to a coherent notion of ``residualization'', i.e., subtracting a probability distribution from an observation to remove the distributional structure from the sampling distribution of the latter. We show that these notions arise naturally for univariate distributions through cumulative distribution function (CDF) transforms and compositions due to several "group-like" properties of univariate CDFs. While the traditional multivariate CDF does not preserve these properties, a new definition of multivariate CDF can restore these properties, thereby allowing the notions of ``addition'' and ``residualization'' to be formulated for multivariate settings as well. This then gives rise to the unsupervised boosting algorithm based on forward-stagewise fitting of an additive tree ensemble, which sequentially reduces the Kullback-Leibler divergence from the truth. The algorithm allows the analytic evaluation of the fitted density and outputs a generative model that can be readily sampled from. We enhance the algorithm with scale-dependent shrinkage and a two-stage strategy that separately fits the marginals and the copula.