A smoothed-Bayesian approach to frequency recovery from sketched data
Mario Beraha, Stefano Favaro, Matteo Sesia

TL;DR
This paper introduces a smoothed-Bayesian method for accurately recovering symbol frequencies from compressed sketch data, offering computational efficiency and strong theoretical guarantees, especially for large-scale and complex distributions.
Contribution
It presents a novel frequentist framework for Bayesian-inspired frequency estimation from sketches, addressing computational challenges and extending applicability to large, realistic datasets.
Findings
Method achieves unbiased, optimal estimates under squared error loss.
Supports efficient multi-hash sketch frequency estimation.
Validated on synthetic and real datasets, outperforming existing methods.
Abstract
We provide a novel statistical perspective on a classical problem at the intersection of computer science and information theory: recovering the empirical frequency of a symbol in a large discrete dataset using only a compressed representation, or sketch, obtained via random hashing. Departing from traditional algorithmic approaches, recent works have proposed Bayesian nonparametric (BNP) methods that can provide more informative frequency estimates by leveraging modeling assumptions about the distribution of the sketched data. In this paper, we propose a smoothed-Bayesian method, inspired by existing BNP approaches but designed in a frequentist framework to overcome the computational limitations of the BNP approaches when dealing with large-scale data from realistic distributions, including those with power-law tail behaviors. For sketches obtained with a single hash function, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Algorithms and Data Compression · Data Management and Algorithms
