C-MinHash: Practically Reducing Two Permutations to Just One
Xiaoyun Li, Ping Li

TL;DR
This paper simplifies C-MinHash by demonstrating that only one permutation is needed instead of two, maintaining low bias and high accuracy in estimating Jaccard similarity with extensive experimental validation.
Contribution
The paper proves that a single permutation suffices for C-MinHash, reducing computational complexity while preserving estimation accuracy.
Findings
Single permutation achieves similar accuracy as two permutations.
Bias of the new estimator is extremely small and negligible.
Experimental results confirm the effectiveness of using one permutation.
Abstract
Traditional minwise hashing (MinHash) requires applying independent permutations to estimate the Jaccard similarity in massive binary (0/1) data, where can be (e.g.,) 1024 or even larger, depending on applications. The recent work on C-MinHash (Li and Li, 2021) has shown, with rigorous proofs, that only two permutations are needed. An initial permutation is applied to break whatever structures which might exist in the data, and a second permutation is re-used times to produce hashes, via a circulant shifting fashion. (Li and Li, 2021) has proved that, perhaps surprisingly, even though the hashes are correlated, the estimation variance is strictly smaller than the variance of the traditional MinHash. It has been demonstrated in (Li and Li, 2021) that the initial permutation in C-MinHash is indeed necessary. For the ease of theoretical analysis, they have used two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Bayesian Methods and Mixture Models · Machine Learning and Algorithms
