Mining Top-K Frequent Itemsets Through Progressive Sampling
Andrea Pietracaprina, Matteo Riondato, Eli Upfal, Fabio Vandin

TL;DR
This paper introduces a progressive sampling method with stopping conditions and Bloom filter enhancements to efficiently approximate top-K frequent itemsets, significantly reducing sample size while maintaining high accuracy.
Contribution
It proposes a novel progressive sampling algorithm with stopping criteria for efficient top-K frequent itemset mining, including practical improvements using Bloom filters.
Findings
Sample sizes smaller than the dataset can accurately approximate top-K itemsets.
The upper bound on sample size is asymptotically tight for constant w.
Experiments show high accuracy with reduced sample sizes.
Abstract
We study the use of sampling for efficiently mining the top-K frequent itemsets of cardinality at most w. To this purpose, we define an approximation to the top-K frequent itemsets to be a family of itemsets which includes (resp., excludes) all very frequent (resp., very infrequent) itemsets, together with an estimate of these itemsets' frequencies with a bounded error. Our first result is an upper bound on the sample size which guarantees that the top-K frequent itemsets mined from a random sample of that size approximate the actual top-K frequent itemsets, with probability larger than a specified value. We show that the upper bound is asymptotically tight when w is constant. Our main algorithmic contribution is a progressive sampling approach, combined with suitable stopping conditions, which on appropriate inputs is able to extract approximate top-K frequent itemsets from samples…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
