Title: Streaming statistical models via merge \& reduce
Authors: Katja Ickstadt - TU Dortmund University (Germany) [presenting]
Abstract: Merge \& Reduce is a general algorithmic scheme in the theory of data structures. Its main purpose is to transform static data structures into dynamic data structures with as little overhead as possible. This can be used to turn classic off-line algorithms for summarizing and analyzing data into streaming algorithms. We transfer these ideas to the setting of statistical data analysis in streaming environments. Our approach is conceptually different from previous settings where Merge \& Reduce has been employed. Instead of summarizing the data, we combine the Merge \& Reduce framework directly with statistical models. This enables performing computationally demanding data analysis tasks on massive data sets. The computations are divided into small tractable batches independent of the total number of observations $n$ and the results are combined in a structuredway at the cost of a bounded $O(log n)$ factor in their memory requirements. It is only necessary (though non-trivial) to choose an appropriate statistical model and implement merge and reduce operations for the specific type of model. We illustrate our Merge \& Reduce schemes on simulated and real world data employing Bayesian linear regression models, Gaussian mixture models, and generalized linear models.