A Bayesian nonparametric approach to count-min sketch under power-law data streams
Emanuele Dolera, Stefano Favaro, Stefano Peluchetti

TL;DR
This paper introduces a Bayesian nonparametric method for the count-min sketch data structure, improving frequency estimates for low-frequency tokens in power-law data streams, with applications in natural language processing.
Contribution
It develops a novel BNP-based learning-augmented CMS using a normalized inverse Gaussian process prior for better estimation under power-law distributions.
Findings
Enhanced estimation of low-frequency tokens.
Effective in natural language processing scenarios.
Outperforms traditional CMS in power-law data streams.
Abstract
The count-min sketch (CMS) is a randomized data structure that provides estimates of tokens' frequencies in a large data stream using a compressed representation of the data by random hashing. In this paper, we rely on a recent Bayesian nonparametric (BNP) view on the CMS to develop a novel learning-augmented CMS under power-law data streams. We assume that tokens in the stream are drawn from an unknown discrete distribution, which is endowed with a normalized inverse Gaussian process (NIGP) prior. Then, using distributional properties of the NIGP, we compute the posterior distribution of a token's frequency in the stream, given the hashed data, and in turn corresponding BNP estimates. Applications to synthetic and real data show that our approach achieves a remarkable performance in the estimation of low-frequency tokens. This is known to be a desirable feature in the context of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Algorithms and Data Compression · Music and Audio Processing
MethodsGaussian Process
