Title: Learning-augmented count-min sketches via Bayesian nonparametrics
Authors: Emanuele Dolera - University of Pavia (Italy)
Stefano Favaro - University of Torino and Collegio Carlo Alberto (Italy) [presenting]
Stefano Peluchetti - Cogent Labs (Japan)
Abstract: The count-min sketch (CMS) is a randomized data structure that provides estimates of tokens frequencies in a large data stream using a compressed representation of the data by random hashing. We present a Bayesian nonparametric (BNP) approach to CMS, and then develop a novel learning-augmented CMS under power-law data streams. We assume that tokens in the stream are drawn from an unknown discrete distribution, which is endowed with a Pitman-Yor process (PYP) prior. By means of distributional properties of the PYP, we compute the posterior distribution of a tokens frequency in the stream, given the hashed data, and in turn, corresponding BNP estimates. Applications to synthetic and real data show that our approach achieves a remarkable performance in the estimation of low-frequency tokens. This is known to be a desirable feature in the context of natural language processing, where it is indeed common in the context of the power-law behaviour of the data.