Consistent Subset Sampling
Konstantin Kutzkov, Rasmus Pagh

TL;DR
This paper introduces an efficient method for consistent subset sampling in large datasets, optimizing time and space complexity, and demonstrates its application in data mining tasks like frequent itemset and bipartite clique estimation.
Contribution
It generalizes consistent sampling to size-k subsets with improved time and space complexity using a novel hash-based approach.
Findings
Achieves expected time complexity of Θ(b^{⌈k/2⌉} log log b + pb^k)
Uses space complexity of Θ(b^{⌈k/4⌉})
Effectively estimates frequent itemsets and bipartite cliques in data streams
Abstract
Consistent sampling is a technique for specifying, in small space, a subset of a potentially large universe such that the elements in satisfy a suitably chosen sampling condition. Given a subset it should be possible to quickly compute , i.e., the elements in satisfying the sampling condition. Consistent sampling has important applications in similarity estimation, and estimation of the number of distinct items in a data stream. In this paper we generalize consistent sampling to the setting where we are interested in sampling size- subsets occurring in some set in a collection of sets of bounded size , where is a small integer. This can be done by applying standard consistent sampling to the -subsets of each set, but that approach requires time . Using a carefully designed hash function,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications · Data Management and Algorithms · Algorithms and Data Compression
