Learning-augmented count-min sketches via Bayesian nonparametrics
Emanuele Dolera, Stefano Favaro, Stefano Peluchetti

TL;DR
This paper introduces a Bayesian nonparametric framework for count-min sketches, specifically developing the CMS-PYP using Pitman-Yor process priors, which improves low-frequency token estimation in data streams, especially textual data.
Contribution
It provides a Bayesian proof for the CMS-DP and extends it to the CMS-PYP, enabling broader nonparametric modeling and better estimation of low-frequency tokens in data streams.
Findings
CMS-PYP outperforms CMS and CMS-DP in low-frequency token estimation
The Bayesian proof approach is adaptable to various nonparametric priors
Applications show improved accuracy on synthetic and real textual data
Abstract
The count-min sketch (CMS) is a time and memory efficient randomized data structure that provides estimates of tokens' frequencies in a data stream of tokens, i.e. point queries, based on random hashed data. A learning-augmented version of the CMS, referred to as CMS-DP, has been proposed by Cai, Mitzenmacher and Adams (\textit{NeurIPS} 2018), and it relies on Bayesian nonparametric (BNP) modeling of the data stream of tokens via a Dirichlet process (DP) prior, with estimates of a point query being obtained as suitable mean functionals of the posterior distribution of the point query, given the hashed data. While the CMS-DP has proved to improve on some aspects of CMS, it has the major drawback of arising from a ``constructive" proof that builds upon arguments tailored to the DP prior, namely arguments that are not usable for other nonparametric priors. In this paper, we present a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Data Management and Algorithms · Music and Audio Processing
