Title: Narrow big data in streams and Kolmogorov complexity
Authors: Michal Cerny - University of Economics Prague (Czech Republic) [presenting]
Abstract: The following computational model for Narrow Big Data is considered. The dataset is formalized as an $(n\times p)$-matrix $A$ where $n$ stands for the number of observations, $p$ stands for the dimension and $n$ is assumed to be superpolynomial in $p$. The memory is restricted by a polynomial in $p$. Thus, $A$ cannot be stored in memory in full. Rows of $A$ (data points) are accessible on-line one-by-one; once a data point is read into memory, it is dropped forever. This is a natural model for representation of Narrow Big Data which, however, imposes serious restrictions on statistical algorithms. It can be shown that certain quantities cannot be computed in the model at all. As an example, we prove that the sample median cannot be computed in this model. Negative proofs are based on Kolmogorov complexity arguments, and essentially show that such quantities contain ``too much information'' which cannot be stored within the memory bounds imposed by the computational model. Another example is linear regression - while Ordinary Least Squares (OLS) can be efficiently computed in the model by a sequence of rank-one updates of the information matrix, Kolmogorov complexity arguments imply that $L_1$-norm estimators cannot be computed at all.