Finding the True Frequent Itemsets

Matteo Riondato; Fabio Vandin

arXiv:1301.1218·cs.LG·January 23, 2014

Finding the True Frequent Itemsets

Matteo Riondato, Fabio Vandin

PDF

TL;DR

This paper introduces a statistical learning approach to accurately identify true frequent itemsets in transactional data, minimizing false positives by using VC-dimension theory to set an optimal frequency threshold.

Contribution

It proposes a novel algorithm that leverages VC-dimension bounds to reliably find true frequent itemsets with high probability, improving over traditional methods.

Findings

01

The method effectively reduces false positives in frequent itemset mining.

02

It outperforms standard bounds like Chernoff in experimental comparisons.

03

The approach guarantees high-probability correctness of identified itemsets.

Abstract

Frequent Itemsets (FIs) mining is a fundamental primitive in data mining. It requires to identify all itemsets appearing in at least a fraction $θ$ of a transactional dataset $D$ . Often though, the ultimate goal of mining $D$ is not an analysis of the dataset \emph{per se}, but the understanding of the underlying process that generated it. Specifically, in many applications $D$ is a collection of samples obtained from an unknown probability distribution $π$ on transactions, and by extracting the FIs in $D$ one attempts to infer itemsets that are frequently (i.e., with probability at least $θ$ ) generated by $π$ , which we call the True Frequent Itemsets (TFIs). Due to the inherently stochastic nature of the generative process, the set of FIs is only a rough approximation of the set of TFIs, as it often contains a huge number of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.