Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees
Matteo Riondato, Eli Upfal

TL;DR
This paper introduces a sampling-based method for efficiently discovering approximate frequent itemsets and association rules with strong theoretical guarantees, leveraging VC-dimension bounds related to dataset characteristics.
Contribution
It presents a novel technique that uses VC-dimension to determine sample sizes for guaranteed approximation quality in data mining tasks.
Findings
Sample size depends linearly on VC-dimension.
VC-dimension is bounded by the dataset's d-index.
The method provides tight approximation guarantees.
Abstract
The tasks of extracting (top-) Frequent Itemsets (FI's) and Association Rules (AR's) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High quality approximations of FI's and AR's are sufficient for most practical uses, and a number of recent works explored the application of sampling for fast discovery of approximate solutions to the problems. However, these works do not provide satisfactory performance guarantees on the quality of the approximation, due to the difficulty of bounding the probability of under- or over-sampling any one of an unknown number of frequent itemsets. In this work we circumvent this issue by applying the statistical concept of \emph{Vapnik-Chervonenkis (VC) dimension} to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications · Rough Sets and Fuzzy Logic · Imbalanced Data Classification Techniques
