When Similarity Digest Meets Vector Management System: A Survey on Similarity Hash Function
Zhushou Tang, Lingyi Tang, Keying Tang, Ruoying Tang

TL;DR
This survey reviews well-known similarity hash functions for vector management systems, highlighting MinHash, Nilsimsa, and variants like SimHash as effective options for large-scale similarity analysis.
Contribution
It systematically evaluates existing similarity hash functions and identifies the most suitable ones for vector management systems in large-scale applications.
Findings
MinHash and Nilsimsa are directly applicable in vector management pipelines.
MinHash, SimHash variants, and feature hashing perform best for large-scale similarity analysis.
The paper discusses performance and drawbacks of these functions.
Abstract
The booming vector manage system calls for feasible similarity hash function as a front-end to perform similarity analysis. In this paper, we make a systematical survey on the existent well-known similarity hash functions to tease out the satisfied ones. We conclude that the similarity hash function MinHash and Nilsimsa can be directly marshaled into the pipeline of similarity analysis using vector manage system. After that, we make a brief and empirical discussion on the performance, drawbacks of the these functions and highlight MinHash, the variant of SimHash and feature hashing are the best for vector management system for large-scale similarity analysis.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Scientific Computing and Data Management · Machine Learning and Data Classification
