C-OPH: Improving the Accuracy of One Permutation Hashing (OPH) with Circulant Permutations
Xiaoyun Li, Ping Li

TL;DR
This paper introduces C-OPH, a novel densification scheme for One Permutation Hashing that leverages circulant permutations to significantly improve estimation accuracy of Jaccard similarity in binary data.
Contribution
The paper proposes C-OPH, a new densification method for OPH using circulant permutations, reducing variance and requiring shorter permutations for better accuracy.
Findings
C-OPH achieves the smallest estimation variance among OPH methods.
C-OPH uses a shorter permutation length, improving efficiency.
The variance of Jaccard similarity estimation is strictly smaller with C-OPH.
Abstract
Minwise hashing (MinHash) is a classical method for efficiently estimating the Jaccrad similarity in massive binary (0/1) data. To generate hash values for each data vector, the standard theory of MinHash requires independent permutations. Interestingly, the recent work on "circulant MinHash" (C-MinHash) has shown that merely two permutations are needed. The first permutation breaks the structure of the data and the second permutation is re-used time in a circulant manner. Surprisingly, the estimation accuracy of C-MinHash is proved to be strictly smaller than that of the original MinHash. The more recent work further demonstrates that practically only one permutation is needed. Note that C-MinHash is different from the well-known work on "One Permutation Hashing (OPH)" published in NIPS'12. OPH and its variants using different "densification" schemes are popular…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Genomic variations and chromosomal abnormalities · Advanced Image and Video Retrieval Techniques
