Finding the True Frequent Itemsets
Matteo Riondato, Fabio Vandin

TL;DR
This paper introduces a statistical learning approach to accurately identify true frequent itemsets in transactional data, minimizing false positives by using VC-dimension theory to set an optimal frequency threshold.
Contribution
It proposes a novel algorithm that leverages VC-dimension bounds to reliably find true frequent itemsets with high probability, improving over traditional methods.
Findings
The method effectively reduces false positives in frequent itemset mining.
It outperforms standard bounds like Chernoff in experimental comparisons.
The approach guarantees high-probability correctness of identified itemsets.
Abstract
Frequent Itemsets (FIs) mining is a fundamental primitive in data mining. It requires to identify all itemsets appearing in at least a fraction of a transactional dataset . Often though, the ultimate goal of mining is not an analysis of the dataset \emph{per se}, but the understanding of the underlying process that generated it. Specifically, in many applications is a collection of samples obtained from an unknown probability distribution on transactions, and by extracting the FIs in one attempts to infer itemsets that are frequently (i.e., with probability at least ) generated by , which we call the True Frequent Itemsets (TFIs). Due to the inherently stochastic nature of the generative process, the set of FIs is only a rough approximation of the set of TFIs, as it often contains a huge number of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
