Bayesian nonparametric estimation of coverage probabilities and distinct counts from sketched data
Stefano Favaro, Matteo Sesia

TL;DR
This paper introduces a Bayesian nonparametric approach to estimate coverage probabilities and distinct counts from sketched data, enabling analysis when only compressed summaries are available.
Contribution
It proposes a novel Bayesian methodology for estimating coverage and distinct counts from sketched data, applicable to large-scale and imperfect data summaries.
Findings
Effective estimation demonstrated on real datasets
Applicable with Dirichlet process prior, with some computational challenges
Shows promise for large-scale data analysis in various fields
Abstract
The estimation of coverage probabilities, and in particular of the missing mass, is a classical statistical problem with applications in numerous scientific fields. In this paper, we study this problem in relation to randomized data compression, or sketching. This is a novel but practically relevant perspective, and it refers to situations in which coverage probabilities must be estimated based on a compressed and imperfect summary, or sketch, of the true data, because neither the full data nor the empirical frequencies of distinct symbols can be observed directly. Our contribution is a Bayesian nonparametric methodology to estimate coverage probabilities from data sketched through random hashing, which also solves the challenging problems of recovering the numbers of distinct counts in the true data and of distinct counts with a specified empirical frequency of interest. The proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Statistical Methods and Inference · Gene expression and cancer classification
