Consistent Subset Sampling

Konstantin Kutzkov; Rasmus Pagh

arXiv:1404.4693·cs.DS·April 21, 2014·1 cites

Consistent Subset Sampling

Konstantin Kutzkov, Rasmus Pagh

PDF

Open Access

TL;DR

This paper introduces an efficient method for consistent subset sampling in large datasets, optimizing time and space complexity, and demonstrates its application in data mining tasks like frequent itemset and bipartite clique estimation.

Contribution

It generalizes consistent sampling to size-k subsets with improved time and space complexity using a novel hash-based approach.

Findings

01

Achieves expected time complexity of Θ(b^{⌈k/2⌉} log log b + pb^k)

02

Uses space complexity of Θ(b^{⌈k/4⌉})

03

Effectively estimates frequent itemsets and bipartite cliques in data streams

Abstract

Consistent sampling is a technique for specifying, in small space, a subset $S$ of a potentially large universe $U$ such that the elements in $S$ satisfy a suitably chosen sampling condition. Given a subset $I \subseteq U$ it should be possible to quickly compute $I \cap S$ , i.e., the elements in $I$ satisfying the sampling condition. Consistent sampling has important applications in similarity estimation, and estimation of the number of distinct items in a data stream. In this paper we generalize consistent sampling to the setting where we are interested in sampling size- $k$ subsets occurring in some set in a collection of sets of bounded size $b$ , where $k$ is a small integer. This can be done by applying standard consistent sampling to the $k$ -subsets of each set, but that approach requires time $Θ (b^{k})$ . Using a carefully designed hash function,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Mining Algorithms and Applications · Data Management and Algorithms · Algorithms and Data Compression