Sharp Frequency Bounds for Sample-Based Queries

Eric Bax; John Donald

arXiv:2208.06753·cs.LG·August 16, 2022

Sharp Frequency Bounds for Sample-Based Queries

Eric Bax, John Donald

PDF

Open Access

TL;DR

This paper introduces a method to efficiently compute tight probabilistic bounds on item frequencies in large data sets using fixed-size samples, achieving near-optimal accuracy without exhaustive computation.

Contribution

It presents a novel approach for deriving sharp PAC bounds on frequencies from sample-based data sketches, improving inference precision in large-scale data analysis.

Findings

01

Achieves bounds that are either sharp or off by only one

02

Provides an efficient method for PAC frequency inference

03

Enhances accuracy of data sketches for large data sets

Abstract

A data sketch algorithm scans a big data set, collecting a small amount of data -- the sketch, which can be used to statistically infer properties of the big data set. Some data sketch algorithms take a fixed-size random sample of a big data set, and use that sample to infer frequencies of items that meet various criteria in the big data set. This paper shows how to statistically infer probably approximately correct (PAC) bounds for those frequencies, efficiently, and precisely enough that the frequency bounds are either sharp or off by only one, which is the best possible result without exact computation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Mining Algorithms and Applications · Machine Learning and Data Classification · Rough Sets and Fuzzy Logic