When Similarity Digest Meets Vector Management System: A Survey on   Similarity Hash Function

Zhushou Tang; Lingyi Tang; Keying Tang; Ruoying Tang

arXiv:2109.08789·cs.DB·October 12, 2021

When Similarity Digest Meets Vector Management System: A Survey on Similarity Hash Function

Zhushou Tang, Lingyi Tang, Keying Tang, Ruoying Tang

PDF

Open Access

TL;DR

This survey reviews well-known similarity hash functions for vector management systems, highlighting MinHash, Nilsimsa, and variants like SimHash as effective options for large-scale similarity analysis.

Contribution

It systematically evaluates existing similarity hash functions and identifies the most suitable ones for vector management systems in large-scale applications.

Findings

01

MinHash and Nilsimsa are directly applicable in vector management pipelines.

02

MinHash, SimHash variants, and feature hashing perform best for large-scale similarity analysis.

03

The paper discusses performance and drawbacks of these functions.

Abstract

The booming vector manage system calls for feasible similarity hash function as a front-end to perform similarity analysis. In this paper, we make a systematical survey on the existent well-known similarity hash functions to tease out the satisfied ones. We conclude that the similarity hash function MinHash and Nilsimsa can be directly marshaled into the pipeline of similarity analysis using vector manage system. After that, we make a brief and empirical discussion on the performance, drawbacks of the these functions and highlight MinHash, the variant of SimHash and feature hashing are the best for vector management system for large-scale similarity analysis.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Scientific Computing and Data Management · Machine Learning and Data Classification