Efficient Discovery of Association Rules and Frequent Itemsets through   Sampling with Tight Performance Guarantees

Matteo Riondato; Eli Upfal

arXiv:1111.6937·cs.DS·March 19, 2015

Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees

Matteo Riondato, Eli Upfal

PDF

Open Access

TL;DR

This paper introduces a sampling-based method for efficiently discovering approximate frequent itemsets and association rules with strong theoretical guarantees, leveraging VC-dimension bounds related to dataset characteristics.

Contribution

It presents a novel technique that uses VC-dimension to determine sample sizes for guaranteed approximation quality in data mining tasks.

Findings

01

Sample size depends linearly on VC-dimension.

02

VC-dimension is bounded by the dataset's d-index.

03

The method provides tight approximation guarantees.

Abstract

The tasks of extracting (top- $K$ ) Frequent Itemsets (FI's) and Association Rules (AR's) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High quality approximations of FI's and AR's are sufficient for most practical uses, and a number of recent works explored the application of sampling for fast discovery of approximate solutions to the problems. However, these works do not provide satisfactory performance guarantees on the quality of the approximation, due to the difficulty of bounding the probability of under- or over-sampling any one of an unknown number of frequent itemsets. In this work we circumvent this issue by applying the statistical concept of \emph{Vapnik-Chervonenkis (VC) dimension} to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Mining Algorithms and Applications · Rough Sets and Fuzzy Logic · Imbalanced Data Classification Techniques