C-MinHash: Practically Reducing Two Permutations to Just One

Xiaoyun Li; Ping Li

arXiv:2109.04595·cs.DS·September 13, 2021·1 cites

C-MinHash: Practically Reducing Two Permutations to Just One

Xiaoyun Li, Ping Li

PDF

Open Access

TL;DR

This paper simplifies C-MinHash by demonstrating that only one permutation is needed instead of two, maintaining low bias and high accuracy in estimating Jaccard similarity with extensive experimental validation.

Contribution

The paper proves that a single permutation suffices for C-MinHash, reducing computational complexity while preserving estimation accuracy.

Findings

01

Single permutation achieves similar accuracy as two permutations.

02

Bias of the new estimator is extremely small and negligible.

03

Experimental results confirm the effectiveness of using one permutation.

Abstract

Traditional minwise hashing (MinHash) requires applying $K$ independent permutations to estimate the Jaccard similarity in massive binary (0/1) data, where $K$ can be (e.g.,) 1024 or even larger, depending on applications. The recent work on C-MinHash (Li and Li, 2021) has shown, with rigorous proofs, that only two permutations are needed. An initial permutation is applied to break whatever structures which might exist in the data, and a second permutation is re-used $K$ times to produce $K$ hashes, via a circulant shifting fashion. (Li and Li, 2021) has proved that, perhaps surprisingly, even though the $K$ hashes are correlated, the estimation variance is strictly smaller than the variance of the traditional MinHash. It has been demonstrated in (Li and Li, 2021) that the initial permutation in C-MinHash is indeed necessary. For the ease of theoretical analysis, they have used two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Bayesian Methods and Mixture Models · Machine Learning and Algorithms