# Fast Similarity Sketching

**Authors:** S{\o}ren Dahlgaard, Mathias B{\ae}k Tejs Langhede, Jakob B{\ae}k, Tejs Houen, Mikkel Thorup

arXiv: 1704.04370 · 2024-05-07

## TL;DR

This paper introduces a new approach to similarity sketching that aims to efficiently preserve Jaccard similarity between sets with strong concentration guarantees, improving upon traditional methods like MinHash.

## Contribution

The paper proposes a novel similarity sketching method that offers faster computation while maintaining strong concentration bounds for Jaccard similarity estimation.

## Key findings

- Achieves similar accuracy to MinHash with reduced computational complexity
- Provides theoretical guarantees for concentration bounds
- Demonstrates improved performance in large-scale similarity search

## Abstract

We consider the $\textit{Similarity Sketching}$ problem: Given a universe $[u] = \{0,\ldots, u-1\}$ we want a random function $S$ mapping subsets $A\subseteq [u]$ into vectors $S(A)$ of size $t$, such that the Jaccard similarity $J(A,B) = |A\cap B|/|A\cup B|$ between sets $A$ and $B$ is preserved. More precisely, define $X_i = [S(A)[i] =   S(B)[i]]$ and $X = \sum_{i\in [t]} X_i$. We want $E[X_i]=J(A,B)$, and we want $X$ to be strongly concentrated around $E[X] = t \cdot J(A,B)$ (i.e. Chernoff-style bounds). This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors $S(A)$ are also called $\textit{sketches}$. Strong concentration is critical, for often we want to sketch many sets $B_1,\ldots,B_n$ so that we later, for a query set $A$, can find (one of) the most similar $B_i$. It is then critical that no $B_i$ looks much more similar to $A$ due to errors in the sketch.   The seminal $t\times\textit{MinHash}$ algorithm uses $t$ random hash functions $h_1,\ldots, h_t$, and stores $\left ( \min_{a\in A} h_1(A),\ldots, \min_{a\in A} h_t(A) \right )$ as the sketch of $A$. The main drawback of MinHash is, however, its $O(t\cdot |A|)$ running time, and finding a sketch with similar properties and faster running time has been the subject of several papers. (continued...)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1704.04370/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/1704.04370/full.md

## References

23 references — full list in the complete paper: https://tomesphere.com/paper/1704.04370/full.md

---
Source: https://tomesphere.com/paper/1704.04370