In Defense of MinHash Over SimHash
Anshumali Shrivastava, Ping Li

TL;DR
This paper provides a theoretical and experimental comparison showing that MinHash generally outperforms SimHash for binary data, especially in high similarity regions, offering practical guidance for large-scale search applications.
Contribution
The paper offers a rigorous theoretical analysis demonstrating MinHash's superiority over SimHash for binary data, validated by extensive experiments, and clarifies when each method should be used.
Findings
MinHash outperforms SimHash in high similarity regions.
MinHash is also better in low similarity regions under practical data assumptions.
Theoretical bounds relate resemblance and cosine similarities, guiding LSH choice.
Abstract
MinHash and SimHash are the two widely adopted Locality Sensitive Hashing (LSH) algorithms for large-scale data processing applications. Deciding which LSH to use for a particular problem at hand is an important question, which has no clear answer in the existing literature. In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary, as common in practice such as search. The collision probability of MinHash is a function of resemblance similarity (), while the collision probability of SimHash is a function of cosine similarity (). To provide a common basis for comparison, we evaluate retrieval results in terms of for both MinHash and SimHash. This evaluation is valid as we can prove that MinHash is a valid LSH with respect to , by using a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Face and Expression Recognition
