Algorithms for Similarity Search and Pseudorandomness

Tobias Christiani

arXiv:1906.09430·cs.DS·June 25, 2019

Algorithms for Similarity Search and Pseudorandomness

Tobias Christiani

PDF

Open Access

TL;DR

This paper advances algorithms for approximate near neighbor search and pseudorandom number generation, providing new frameworks, bounds, and practical algorithms with improved efficiency and theoretical guarantees.

Contribution

It introduces new frameworks and bounds for ANN search using locality-sensitive hashing and develops high-quality pseudorandom number generators with optimal or near-optimal resource usage.

Findings

01

Reduced evaluations and complexity in ANN algorithms.

02

Established tight bounds for space-time tradeoffs in ANN.

03

Developed high-quality pseudorandom number generators with constant time.

Abstract

We study the problem of approximate near neighbor (ANN) search and show the following results: - An improved framework for solving the ANN problem using locality-sensitive hashing, reducing the number of evaluations of locality-sensitive hash functions and the word-RAM complexity compared to the standard framework. - A framework for solving the ANN problem with space-time tradeoffs as well as tight upper and lower bounds for the space-time tradeoff of framework solutions to the ANN problem under cosine similarity. - A novel approach to solving the ANN problem on sets along with a matching lower bound, improving the state of the art. - A self-tuning version of the algorithm is shown through experiments to outperform existing similarity join algorithms. - Tight lower bounds for asymmetric locality-sensitive hashing which has applications to the approximate furthest neighbor…

Tables9

Table 1. Table 1: Overview of data-independent locality-sensitive hashing (LSH) and filtering (LSF) results

Reference	Setting	$ρ_{q}$	$ρ_{u}$
LSH [91, 86], LSF [24]	$(X, dist)$ , $(X, sim)$	$\frac{\log (1 / p)}{\log (1 / q)}$
Theorem 3.1	$(X, dist)$ , $(X, sim)$	$\frac{\log (p_{q} / p_{1})}{\log (p_{q} / p_{2})}$	$\frac{\log (p_{u} / p_{1})}{\log (p_{q} / p_{2})}$
Cross-poly. LSH [13]	$(α, β)$ -sim., $(𝕊^{d}, ⟨ \cdot, \cdot ⟩)$	$\frac{1 - α}{1 + α} / \frac{1 - β}{1 + β}$
Spherical cap LSF [100]	$(α, o_{d} (1))$ -sim., $(𝕊^{d}, ⟨ \cdot, \cdot ⟩)$	$\frac{{(1 - α^{1 + λ})}^{2}}{1 - α^{2}}$	$\frac{{(α^{λ} - α)}^{2}}{1 - α^{2}}$
Theorem 3.2	$(α, β)$ -sim., $(𝕊^{d}, ⟨ \cdot, \cdot ⟩)$	$\frac{{(1 - α^{1 + λ})}^{2}}{1 - α^{2}} / \frac{{(1 - α^{λ} β)}^{2}}{1 - β^{2}}$	$\frac{{(α^{λ} - α)}^{2}}{1 - α^{2}} / \frac{{(1 - α^{λ} β)}^{2}}{1 - β^{2}}$
Ball-carving LSH [11]	$(r, c r)$ -nn. in $ℓ_{2}^{d}$	$1 / c^{2}$
Ball-search LSH* [95]		$\frac{c^{2} {(1 + λ)}^{2}}{{(c^{2} + λ)}^{2} - c^{2} (1 + λ^{2}) / 2 - λ^{2}}$	$\frac{c^{2} {(1 - λ)}^{2}}{{(c^{2} + λ)}^{2} - c^{2} (1 + λ^{2}) / 2 - λ^{2}}$
Theorem 3.3		$\frac{c^{2} {(1 + λ)}^{2}}{{(c^{2} + λ)}^{2}}$	$\frac{c^{2} {(1 - λ)}^{2}}{{(c^{2} + λ)}^{2}}$
Lower bound [121]	LSH in $ℓ_{2}^{d}$	$\geq 1 / c^{2}$
Theorem 3.4	LSF in $ℓ_{2}^{d}$	$\geq 1 / c^{2}$
Lower bound [114, 19]	LSH in $(𝕊^{d}, ⟨ \cdot, \cdot ⟩)$	$\geq \frac{1 - α}{1 + α}$
Theorem 3.5, [16]	LSF in $(𝕊^{d}, ⟨ \cdot, \cdot ⟩)$	$\geq \frac{{(1 - α^{1 + λ})}^{2}}{1 - α^{2}}$	$\geq \frac{{(α^{λ} - α)}^{2}}{1 - α^{2}}$

Table 2. Table 2: Overview of ρ 𝜌 \rho -values for similarity search with Hamming vectors of equal weight t 𝑡 t .

	Hamming $r_{1} < r_{2}$	Braun-Blanquet $b_{1} > b_{2}$	Jaccard $j_{1} > j_{2}$
Bit-sampling [91]	$r_{1} / r_{2}$	$\frac{1 - b_{1}}{1 - b_{2}}$	$\frac{1 - j_{1}}{1 + j_{1}} / \frac{1 - j_{2}}{1 + j_{2}}$
MinHash [30]	$\log \frac{1 - r_{1}}{1 + r_{1}} / \log \frac{1 - r_{2}}{1 + r_{2}}$	$\log \frac{b_{1}}{2 - b_{1}} / \log \frac{b_{2}}{2 - b_{2}}$	$\log (j_{1}) / \log (j_{2})$
Cross-poly. [13]	$\frac{r_{1}}{r_{2}} \frac{1 - r_{2} / 2}{1 - r_{1} / 2}$	$\frac{1 - b_{1}}{1 + b_{1}} / \frac{1 - b_{2}}{1 + b_{2}}$	$\frac{1 - j_{1}}{1 + 3 j_{1}} / \frac{1 - j_{2}}{1 + 3 j_{2}}$
Data-dep. [17]	$\frac{r_{1}}{r_{2}} \frac{1}{2 - r_{1} / r_{2}}$	$\frac{1 - b_{1}}{1 + b_{1} - 2 b_{2}}$	$\frac{(1 - j_{1}) (1 + j_{2})}{1 - j_{1} j_{2} + 3 (j_{1} - j_{2})}$
Theorem 4.1	$\log (1 - r_{1}) / \log (1 - r_{2})$	$\log (b_{1}) / \log (b_{2})$	$\log \frac{2 j_{1}}{1 + j_{1}} / \log \frac{2 j_{2}}{1 + j_{2}}$

Table 3. Table 3: Dataset size, average set size, and average number of sets that a token is contained in.

Dataset	# sets / $10^{6}$	avg. set size	sets / tokens
AOL	$7.35$	$3.8$	$18.9$
BMS-POS	$0.32$	$9.3$	$1797.9$
DBLP	$0.10$	$82.7$	$1204.4$
ENRON	$0.25$	$135.3$	$29.8$
FLICKR	$1.14$	$10.8$	$16.3$
LIVEJ	$0.30$	$37.5$	$15.0$
KOSARAK	$0.59$	$12.2$	$176.3$
NETFLIX	$0.48$	$209.8$	$5654.4$
ORKUT	$2.68$	$122.2$	$37.5$
SPOTIFY	$0.36$	$15.3$	$7.4$
UNIFORM	$0.10$	$10.0$	$4783.7$
TOKENS10K	$0.03$	$339.4$	$10000.0$
TOKENS15K	$0.04$	$337.5$	$15000.0$
TOKENS20K	$0.06$	$335.7$	$20000.0$

Table 4. Table 4: Join time in seconds for AllPairs (ALL) and CPSJoin (CP) with recall ≥ 90 % absent percent 90 \geq 90\% .

Parameter	Description	Test	Final
limit	Brute force limit	$100$	$250$
$ℓ$	Sketch word length	$4$	$8$
$t$	Size of MinHash set	$128$	$128$
$ε$	Brute force aggressiveness	$0.0$	$0.1$
$δ$	Sketch false negative prob.	$0.1$	$0.05$

Table 5. Table 5: Parameters of the CPSJoin algorithm, their setting during parameter experiments, and their setting for the final join time experiments

Parameter	Description	Test	Final
limit	Brute force limit	$100$	$250$
$ℓ$	Sketch word length	$4$	$8$
$t$	Size of MinHash set	$128$	$128$
$ε$	Brute force aggressiveness	$0.0$	$0.1$
$δ$	Sketch false negative prob.	$0.1$	$0.05$

Table 6. Table 6: Number of pre-candidates, candidates and results for ALL and CP with at least 90 % percent 90 90\% recall.

Dataset	Threshold $0.5$		Threshold $0.7$
	ALL	CP	ALL	CP
	8.5E+09	7.4E+09	6.2E+08	2.9E+09
AOL	8.5E+09	1.4E+09	6.2E+08	3.1E+07
	1.3E+08	1.2E+08	1.6E+06	1.5E+06
	2.0E+09	9.2E+08	2.7E+08	3.3E+08
BMS-POS	1.8E+09	1.7E+08	2.6E+08	4.9E+06
	1.1E+07	1.0E+07	2.0E+05	1.8E+05
	6.6E+09	4.6E+08	1.2E+09	1.3E+08
DBLP	1.9E+09	4.6E+07	7.2E+08	4.3E+05
	1.7E+06	1.6E+06	9.1E+03	8.5E+03
	2.8E+09	3.7E+08	2.0E+08	1.5E+08
ENRON	1.8E+09	6.7E+07	1.3E+08	2.1E+07
	3.1E+06	2.9E+06	1.2E+06	1.2E+06
	5.7E+08	2.1E+09	9.3E+07	9.0E+08
FLICKR	4.1E+08	1.1E+09	6.3E+07	3.8E+08
	6.6E+07	6.1E+07	2.5E+07	2.3E+07
	2.6E+09	4.7E+09	7.4E+07	4.2E+08
KOSARAK	2.5E+09	2.1E+09	6.8E+07	2.1E+07
	2.3E+08	2.1E+08	4.4E+05	4.1E+05
	9.0E+09	2.8E+09	5.8E+08	1.2E+09
LIVEJ	8.3E+09	3.6E+08	5.6E+08	1.8E+07
	2.4E+07	2.2E+07	8.1E+05	7.6E+05
	8.6E+10	1.3E+09	1.0E+10	4.3E+08
NETFLIX	1.3E+10	3.1E+07	3.4E+09	6.4E+05
	1.0E+06	9.5E+05	2.4E+04	2.2E+04
	5.1E+09	1.1E+09	3.0E+08	7.2E+08
ORKUT	3.9E+09	1.3E+06	2.6E+08	8.1E+04
	9.0E+04	8.4E+04	5.6E+03	5.3E+03
	5.0E+06	1.2E+08	4.7E+05	8.5E+07
SPOTIFY	4.8E+06	3.1E+05	4.6E+05	2.7E+03
	2.0E+04	1.8E+04	2.0E+02	1.9E+02
	1.5E+10	1.7E+08	8.1E+09	4.9E+07
TOKENS10K	4.1E+08	5.7E+06	4.1E+08	1.9E+06
	1.3E+05	1.3E+05	7.4E+04	6.9E+04
	3.6E+10	3.0E+08	1.9E+10	8.1E+07
TOKENS15K	9.6E+08	7.2E+06	9.6E+08	1.9E+06
	1.4E+05	1.3E+05	7.5E+04	6.9E+04
	6.4E+10	4.4E+08	3.4E+10	1.0E+08
TOKENS20K	1.7E+09	8.8E+06	1.7E+09	1.9E+06
	1.4E+05	1.4E+05	7.9E+04	7.4E+04
	2.5E+09	3.7E+08	6.5E+08	1.3E+08
UNIFORM005	2.0E+09	9.5E+06	6.1E+08	3.9E+04
	2.6E+05	2.4E+05	1.4E+03	1.3E+03

Table 7. Table 7: Overview of generators that produce a k 𝑘 k -independent sequence over a finite field 𝔽 𝔽 \mathbb{F} . We use ε 𝜀 \varepsilon to denote an arbitrary positive constant and ω 𝜔 \omega and ω k subscript 𝜔 𝑘 \omega_{k} to denote, respectively, a primitive element and a k 𝑘 k -th root of unity of 𝔽 𝔽 \mathbb{F} . The unit for space is the number of elements of 𝔽 𝔽 \mathbb{F} that need to be stored, i.e., a factor log 2 ⁡ | 𝔽 | subscript 2 𝔽 \log_{2}|\mathbb{F}| from the number of bits. Probabilistic constructions rely on random generation of objects for which no explicit construction is known, and may fail with some probability.

Construction	Time	Space	Comment
Polynomials [92, 42]	$O (k)$	$O (k)$
Multipoint [164]	$O (\log^{2} k \log \log k)$	$O (k \log k)$
Multipoint [27]	$O (\log k \log \log k)$	$O (k)$	Requires $ω$ .
Siegel [151]	$O (1)$	$O ({\| 𝔽 \|}^{ε})$	Probabilistic.
Theorem 8.1	$O (1)$	$k poly \log k$	Explicit.
Theorem 8.2	$O (1)$	$O (k \log^{2 + ε} k)$	Probabilistic.
Theorem 8.3	$O (\log k)$	$O (k)$	Requires $ω_{k}$ , FFT.

Table 8. Table 8: Generation time in nanoseconds per 64-bit value using Horner’s scheme, Gao-Mateer’s FFT and an implementation of our constant-time generator

$k$	Horner	FFT	FFT + Expander
$k$	ns	ns	$c$	$m$	$d$	$δ$	ns
$2^{5}$	177	243	64	$2^{13}$	8	$10^{- 7}$	15
$2^{6}$	361	294	64	$2^{14}$	8	$10^{- 8}$	16
$2^{7}$	730	338	64	$2^{15}$	8	$10^{- 9}$	19
$2^{8}$	1470	375	64	$2^{16}$	8	$10^{- 10}$	23
$2^{9}$	2950	412	64	$2^{17}$	8	$10^{- 11}$	24
$2^{10}$	5902	449	64	$2^{18}$	8	$10^{- 12}$	25
$2^{11}$	11808	487	32	$2^{18}$	8	$10^{- 12}$	35
$2^{12}$	23627	523	64	$2^{18}$	16	$10^{- 29}$	43
$2^{13}$	47183	561	32	$2^{18}$	16	$10^{- 29}$	54
$2^{14}$	94429	599	64	$2^{22}$	8	$10^{- 15}$	68
$2^{15}$	188258	638	64	$2^{23}$	8	$10^{- 16}$	69
$2^{16}$	376143	678	64	$2^{24}$	8	$10^{- 17}$	77
$2^{17}$	751781	719	64	$2^{25}$	8	$10^{- 18}$	85
$2^{18}$	1505016	765	64	$2^{26}$	8	$10^{- 19}$	93
$2^{19}$	3015969	808	32	$2^{26}$	8	$10^{- 19}$	110
$2^{20}$	6082313	864	64	$2^{26}$	16	$10^{- 46}$	175

Table 9. Table 9: Space-time tradeoffs for k 𝑘 k -independent hash functions

Reference	Space	Time
Polynomials [92, 42]	$k$	$O (k)$
Preprocessed polynomials [96]	$k^{1 + ε} {(\log u)}^{1 + o (1)}$	$(poly \log k) {(\log u)}^{1 + o (1)}$
Expanders [84] + [151]	$k^{1 + ε} d^{2}$	$d = O {(\log (u) \log (k))}^{1 + 1 / ε}$
Expander powering [151]	$k^{(1 - ε) t} u^{ε} + u^{1 / t}$	$O {(1 / ε)}^{t}$
Double tabulation [158]	$k^{5 t} + u^{1 / t}$	$O (t)$
Recursive tabulation [158]	$poly k + u^{1 / t}$	$O (t^{\log t})$
Corollary 9.1	$k u^{1 / t} t^{3}$	$O (t^{2} + t^{3} \log (k) / \log (u))$
Corollary 9.2	$k^{2} u^{1 / t} t^{2}$	$O (t \log t + t^{2} \log (k) / \log (u))$
Corollary 9.3	$k u^{1 / t} t$	$O (t \log t)$
Cell probe lower bound [151]	$k {(u / k)}^{1 / t}$	$t < k$ probes
Cell probe upper bound [151]	$k {(u / k)}^{1 / t} t$	$O (t)$ probes

Equations568

h \sim H_{H} Pr [h (x) = h (y)] = 1 - h \sim H_{H} Pr [h (x) \neq = h (y)] = 1 - dist_{H} (x, y) / d .

h \sim H_{H} Pr [h (x) = h (y)] = 1 - h \sim H_{H} Pr [h (x) \neq = h (y)] = 1 - dist_{H} (x, y) / d .

ρ = \frac{lo g ( 1/ ( 1 - r / d ))}{lo g ( 1/ ( 1 - cr / d ))} \leq 1/ c

ρ = \frac{lo g ( 1/ ( 1 - r / d ))}{lo g ( 1/ ( 1 - cr / d ))} \leq 1/ c

h (x) = i \in x arg min f (i) .

h (x) = i \in x arg min f (i) .

h \sim H_{J} Pr [h (x) = h (y)] = \frac{∣ x \cap y ∣}{∣ x \cup y ∣} = sim_{J} (x, y) .

h \sim H_{J} Pr [h (x) = h (y)] = \frac{∣ x \cap y ∣}{∣ x \cup y ∣} = sim_{J} (x, y) .

h (x) = sign (⟨ x, z ⟩) .

h (x) = sign (⟨ x, z ⟩) .

h \sim H_{C} Pr [h (x) = h (y)] = 1 - θ (x, y) / π .

h \sim H_{C} Pr [h (x) = h (y)] = 1 - θ (x, y) / π .

y_{i} = {x_{i} - x_{i} with probability \frac{1 + α}{2}, with probability \frac{1 - α}{2} .

y_{i} = {x_{i} - x_{i} with probability \frac{1 + α}{2}, with probability \frac{1 - α}{2} .

ρ = \frac{lo g ( 1/ p _{1} )}{lo g ( 1/ p _{2} )} \geq max (\frac{lo g ( 1/ α )}{lo g ( 1/ β )}, \frac{1 - α}{1 + α - 2 β}) - o_{d} (1) .

ρ = \frac{lo g ( 1/ p _{1} )}{lo g ( 1/ p _{2} )} \geq max (\frac{lo g ( 1/ α )}{lo g ( 1/ β )}, \frac{1 - α}{1 + α - 2 β}) - o_{d} (1) .

ρ = \frac{1 - α}{1 + α} / \frac{1 - β}{1 + β} + o_{d} (1) .

ρ = \frac{1 - α}{1 + α} / \frac{1 - β}{1 + β} + o_{d} (1) .

c^{2} ρ_{q} + (c^{2} - 1) ρ_{u} = 2 c^{2} - 1 .

c^{2} ρ_{q} + (c^{2} - 1) ρ_{u} = 2 c^{2} - 1 .

f \sim F Pr [f (x_{1}) = y_{1} \land f (x_{2}) = y_{2} \land \dots \land f (x_{k}) = y_{k}] = ∣ R ∣^{- k} .

f \sim F Pr [f (x_{1}) = y_{1} \land f (x_{2}) = y_{2} \land \dots \land f (x_{k}) = y_{k}] = ∣ R ∣^{- k} .

h^{'} (x) = (h_{1} (x), \dots, h_{k} (x))

h^{'} (x) = (h_{1} (x), \dots, h_{k} (x))

g_{l} (x) = (h_{1, f_{1} (l)} (x), \dots, h_{k, f_{k} (l)} (x)) .

g_{l} (x) = (h_{1, f_{1} (l)} (x), \dots, h_{k, f_{k} (l)} (x)) .

E [Z^{2}]

E [Z^{2}]

= (L^{2} - L) E [Z_{l} Z_{l^{'}}] + μ

\leq L^{2} E [Π_{i = 1}^{k} Y_{l, i} Y_{l^{'}, i}] + μ

= L^{2} (E [Y_{l, i} Y_{l^{'}, i}])^{k} + μ .

Pr [\exists l \in [L] : g_{l} (x) = g_{l} (y)] \geq \frac{1 + ε μ}{1 + ( 1 + ε ) μ} .

Pr [\exists l \in [L] : g_{l} (x) = g_{l} (y)] \geq \frac{1 + ε μ}{1 + ( 1 + ε ) μ} .

Pr [s (x)_{i} = s (y)_{i}]

Pr [s (x)_{i} = s (y)_{i}]

= (1 + Pr [h_{i} (x) = h_{i} (y)]) /2.

Pr [Z \leq 0] = Pr [- (Z - μ) + s \geq μ + s] \leq Pr [(- (Z - μ) + s)^{2} \geq (μ + s)^{2}] .

Pr [Z \leq 0] = Pr [- (Z - μ) + s \geq μ + s] \leq Pr [(- (Z - μ) + s)^{2} \geq (μ + s)^{2}] .

Pr [(- (Z - μ) + s)^{2} \geq (μ + s)^{2}]

Pr [(- (Z - μ) + s)^{2} \geq (μ + s)^{2}]

= (σ^{2} + s^{2}) / (μ + s)^{2}

(σ^{2} + s^{2}) / (μ + s)^{2} = (s μ + s^{2}) / (μ + s)^{2} = σ^{2} / (μ^{2} + σ^{2}) .

(σ^{2} + s^{2}) / (μ + s)^{2} = (s μ + s^{2}) / (μ + s)^{2} = σ^{2} / (μ^{2} + σ^{2}) .

φ = (1 - (1 - p_{1}^{k_{1}})^{m_{1}})^{t} (1 - (1 - p_{1}^{k_{2}})_{2}^{m}) .

φ = (1 - (1 - p_{1}^{k_{1}})^{m_{1}})^{t} (1 - (1 - p_{1}^{k_{2}})_{2}^{m}) .

k

k

k_{1}

k_{2}

m_{1}

m_{2}

η

1 - (1 - p_{1}^{k_{1}})^{m_{1}}

1 - (1 - p_{1}^{k_{1}})^{m_{1}}

\geq 1 - (1 - p_{1}^{k_{1}} m_{1} + (p_{1}^{k_{1}} m_{1})^{2} /2)

\geq p_{1}^{k_{1}} m_{1} (1 - p_{1}^{k_{1}} (1/ t p_{1}^{k_{1}} + 1) /2)

\geq p_{1}^{k_{1}} m_{1} (1 - 1/ t) .

φ \geq (p_{1}^{k_{1}} m_{1})^{t} /4 e \geq (1/ t)^{t} /4 e .

φ \geq (p_{1}^{k_{1}} m_{1})^{t} /4 e \geq (1/ t)^{t} /4 e .

L = η m_{1}^{t} m_{2} \leq (4 e / (p_{1}^{k_{1}} m_{1})^{t} + 1) m_{1}^{t} (1/ p_{1}^{k_{2}} + 1) \leq 16 e (1/ p_{1}^{k}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Management and Algorithms · Advanced Image and Video Retrieval Techniques · Graph Theory and Algorithms

Full text

\settitle

Algorithms for Similarity Search and Pseudorandomness† \setauthorTobias Christiani \setsupervisorRasmus Pagh \setdateMay 2018

\thetitlepage

Abstract

We study the problem of approximate near neighbor (ANN) search and show the following results:

•

An improved framework for solving the ANN problem using locality-sensitive hashing, reducing the number of evaluations of locality-sensitive hash functions and the word-RAM complexity compared to the standard framework.

•

A framework for solving the ANN problem with space-time tradeoffs as well as tight upper and lower bounds for the space-time tradeoff of framework solutions to the ANN problem under cosine similarity.

•

A novel approach to solving the ANN problem on sets along with a matching lower bound, improving the state of the art. A self-tuning version of the algorithm is shown through experiments to outperform existing similarity join algorithms.

•

Tight lower bounds for asymmetric locality-sensitive hashing which has applications to the approximate furthest neighbor problem, orthogonal vector search, and annulus queries.

•

A proof of the optimality of a well-known Boolean locality-sensitive hashing scheme.

We study the problem of efficient algorithms for producing high-quality pseudorandom numbers and obtain the following results:

•

A deterministic algorithm for generating pseudorandom numbers of arbitrarily high quality in constant time using near-optimal space.

•

A randomized construction of a family of hash functions that outputs pseudorandom numbers of arbitrarily high quality with space usage and running time nearly matching known cell-probe lower bounds.

Resumé

Vi undersøger et grundlæggende problem indenfor approksimativ søgning: tilnærmelsesvis nær nabo (TNN) problemet, og viser følgende resultater:

•

En forbedret generel løsning af TNN problemet som reducerer antal evalueringer af afstandsfølsomme spredefunktioner.

•

En generel løsning af TNN problemet som giver mulighed for tid-plads afvejning samt tætte øvre og nedre grænser for TNN problemet med tid-plads afvejning under kosinuslighed.

•

En ny tilgang til løsning af TNN problemet på mængder samt en matchende nedre grænse. En adaptiv version af algoritmen til approksimativ sammenføjning vises ved eksperimenter at være konkurrencedygtig.

•

Tætte nedre grænser for asymmetrisk afstandsfølsom spredning som har anvendelser til approksimativ søgning efter fjerne naboer, ortogonale vektorer, og annulus forespørgsler.

•

Et optimalitetsbevis for en velkendt familie af Boolske afstandsfølsomme spredefunktioner.

Vi undersøger problemet at finde effektive algoritmer til produktion af pseudotilfældighed af høj kvalitet og opnår følgende resultater:

•

En deterministisk algoritme til generation af pseudotilfældige tal af vilkårlig høj kvalitet i konstant tid og med tæt på optimalt pladsforbrug.

•

En randomiseret konstruktion af en familie af spredefunktioner som afbilder til pseudotilfældige tal af vilkårlig høj kvalitet, med evalueringstid og pladsforbrug tæt på den nedre grænse.

Acknowledgements.

I am very grateful to Rasmus Pagh for advising me for the past almost five years. Most of my favorite results have come through my collaboration with Rasmus where his overview and technical strength complements my intuition. I always feel that Rasmus is ready to listen to my ideas, and to encourage me and guide my research in the right direction. I cannot imagine a better advisor. I would like to thank my collegues in the 4B corridor at ITU for creating a friendly academic environment where I enjoy spending my time. In particular I would like to thank my current and former office mates Johan Sivertsen, Matteo Dusefante, Thomas Ahle, Martin Aumüller, Morten Stöckel, and Ninh Pham. Thore Husfeldt also deserves special thanks for his stimulating lunch discussions on issues ranging from superintelligence to immigration policy. I would like to thank Greg Valiant for hosting my stay at Stanford in the fall of 2015 and Michael Mitzenmacher for hosting my stay at Harvard in the fall of 2017. Specifically I would like to thank Josh Alman, Michael Kim, Zhao Song, and Aviad Rubinstein for making my time spent abroad a pleasant and social experience. Finally I want to thank my parents Tom and Kirsten for supporting me, my brother Anders for tolerating living with a somewhat disorganized PhD student, and my girlfriend Elisabeth for listening to me and helping me through difficult times.

*All that is gold does not glitter,

Not all those who wander are lost;

The old that is strong does not wither,

Deep roots are not reached by the frost.

*From the ashes, a fire shall be woken,

A light from the shadows shall spring;

Renewed shall be blade that was broken,

The crownless again shall be king.

J. R. R. Tolkien (1892–1973)

1 Introduction
1 Part I: Similarity search
2 Part II: Pseudorandom hashing and generators
3 Overview and contributions
4 Conclusion and open problems
I Similarity search
2 Fast locality-sensitive hashing frameworks
5 Introduction
6 Preliminaries
7 Frameworks
8 Reducing the word-RAM complexity
9 The number of hash functions in corner cases
10 Conclusion and open problems
11 Appendix: Inequalities
12 Appendix: Analysis of the Andoni-Indyk framework
3 Space-time tradeoffs for similarity search
13 Introduction
14 A framework with space-time tradeoffs
15 Gaussian filters on the unit sphere
16 Space-time tradeoffs under kernel similarity
17 Lower bounds
18 Open problems
19 Appendix: Framework
20 Appendix: Gaussian filters
21 Appendix: Approximate feature maps, characteristic functions, and Bochner’s Theorem
22 Appendix: Proof of tradeoff lower bound
23 Appendix: Comparison to Kapralov
24 Appendix: Details about dynamization and the model of computation
25 Addendum: An improved framework
4 Set similarity search beyond MinHash
26 Introduction
27 Preliminaries
28 Upper Bound
29 Lower bound
30 Equivalent set similarity problems
31 Conclusion and open problems
32 Appendix: Details behind the lower bound
33 Appendix: Comparisons
5 Adaptive similarity join
34 Introduction
35 Preliminaries
36 Overview of approach
37 Chosen Path Similarity Join
38 Implementation
39 Experiments
40 Conclusion
6 Lower bounds for asymmetric locality-sensitive hashing
41 Preliminaries
42 Lower bounding the collision probability
43 Upper bounding the collision probability
44 Extension to negative correlation
45 Conclusion and open problems
7 Optimal Boolean locality-sensitive hashing
46 Introduction
47 Related work
48 Preliminaries
49 Bit-sampling is optimal
50 Open problems
II Pseudorandomness
8 Generating $k$ -independent random variables in constant time
51 Introduction
52 Preliminaries
53 Explicit constant time generators
54 Constant time generators with optimal seed length
55 Faster multipoint evaluation for $k$ -generators
56 Finite field arithmetic on the word RAM
57 A load balancing application
58 Experiments
9 Near-optimal $k$ -independent hashing
59 Introduction
60 Background and overview
61 Our constructions
62 Conclusion
63 Appendix: Details behind the prefix technique

\midsloppy\sloppybottom

Chapter 1 Introduction

1 Part I: Similarity search

Similarity search in large collections of high-dimensional objects is a problem that is well-motivated by numerous applications. Consider for example the representation of an image by a $d$ -dimensional feature vector $x$ , where each entry $x_{i}$ denotes the fraction of pixels of color $i$ in the image. Given a collection of images $P$ and a query image $q$ , we could for example be interested in finding the nearest neighbor of $q$ : the image $x\in P$ such that the distance $\operatorname{dist}(q,x)$ is minimized, for some appropriate choice of distance function. Applications of near neighbor search include:

•

Classification: Given a collection $P$ of labelled objects and an unlabelled object $q$ , classify $q$ according to the label of its nearest neighbor in $P$ .

•

Recommender systems: Find similar users, movies, songs, books etc. to be used for recommendation.

•

Duplicate detection: Remove near-identical objects from a collection, for example duplicate web pages from the index of a search engine.

The trivial solution to the near neighbor problem would be to iterate through every $x\in P$ and compute $\operatorname{dist}(q,x)$ while keeping track of the nearest neighbor found so far. If we let $n=|P|$ denote the size of our collection and assume that it takes time $O(d)$ to compute the distance between a pair of objects, then the trivial solution uses time $O(dn)$ .

Suppose we are interested in preprocessing the collection $P$ into a data structure that supports answering queries faster than the trivial solution. In two-dimensional Euclidean space there exists a solution based on the Voronoi diagram of $P$ with space usage $O(n)$ and query time $O(\log n)$ [68]. In higher dimensions, the best known solutions to the nearest neighbor problem either suffer from space usage or query time that is exponential in $d$ [86]. This phenomenon is known as the “curse of dimensionality” and has recently been substantiated by conditional hardness results [6, 67, 170, 139], showing for example that the problem of finding all nearest neighbors in a collection of $n$ points in $d$ -dimensional Euclidean space cannot be solved in time subquadratic in $n$ when $d=\omega(\log n)$ unless the Strong Exponential Time Hypothesis (SETH) is false [169].

In order to efficiently solve similarity search problems in high dimensional spaces, researchers and practitioners have turned to approximate solutions. Instead of finding the exact nearest neighbor of a query point, we settle for finding a point that is some approximation factor $c>1$ times further away than the nearest neighbor. The algorithms and data structures for similarity search in this thesis are primarily aimed at providing efficient solutions to the approximate near neighbor problem defined as follows:

Definition 1.1.

Let $P\subseteq X$ be a collection of $|P|=n$ points in a distance space $(X,\operatorname{dist})$ . A solution to the $(r,cr)$ -near neighbor problem is a data structure that supports the following query operation: Given a query $q\in X$ , if there exists $x\in P$ with $\operatorname{dist}(q,x)\leq r$ return $x^{\prime}\in P$ with $\operatorname{dist}(q,x^{\prime})<cr$ .

The $(r,cr)$ -near neighbor problem differs from the nearest neighbor problem by searching for any point within a fixed radius $r$ of the query point and allowing us to return points at distance up to $cr$ even though better candidates exist. It will be convenient to also define the $(s_{1},s_{2})$ -similarity problem as the natural equivalent of the $(r,cr)$ -near neighbor problem where we measure similarities rather than distances, i.e., we wish to report find a point with similarity $\operatorname{sim}(q,x)\geq s_{1}$ and we are willing to accept points with similarity $s_{2}<s_{1}$ .

By allowing an approximation factor $c>1$ it is possible to solve the $(r,cr)$ -near neighbor problem in Euclidean space (and many other spaces) with query time that is sublinear in $n$ and polynomial in $d$ using space polynomial in $d$ and $n$ [91, 66]. However, even approximation has its limits when it comes to alleviating the curse of dimensionality. Rubinstein [139] has recently shown that unless SETH is false, for every choice of constants $\gamma,\delta>0$ there exists $\varepsilon>0$ such that a solution to the $(1+\varepsilon)$ -approximate near neighbor problem with $O(n^{\gamma})$ preprocessing time must use query time $\Omega(n^{\delta})$ .

1.1 Locality-sensitive hashing

One of the most successful approaches for finding solutions to the approximate near neighbor problem in various spaces is known as locality-sensitive hashing, commonly abbreviated as LSH (see [12, 165] for more information). The idea behind locality-sensitive hashing is to construct a distribution $\mathcal{H}$ over functions $h\colon X\to R$ that are used to partition the space $X$ . This randomized partitioning scheme is locality-sensitive in the sense that close points $x,y\in X$ are more likely to hash to the same part of a randomly sampled partition. When discussing locality-sensitive hashing, we will sometimes refer to a distribution of locality-sensitive hash functions as a family.

Definition 1.2 (Locality-sensitive hashing [91]).

Let $(X,\operatorname{dist})$ be a distance space and let $\mathcal{H}$ be a distribution over functions $h\colon X\to R$ . We say that $\mathcal{H}$ is $(r,cr,p_{1},p_{2})$ -sensitive if for $x,y\in X$ and $h\sim\mathcal{H}$ we have that:

•

If $\operatorname{dist}(x,y)\leq r$ then $\Pr[h(x)=h(y)]\geq p_{1}$ .

•

If $\operatorname{dist}(x,y)\geq cr$ then $\Pr[h(x)=h(y)]\leq p_{2}$ .

We can speed up approximate near neighbor searches at the cost of some additional preprocessing by partitioning the set of points $P$ according to $L$ randomly sampled locality-sensitive hash functions $h_{1},\dots,h_{L}$ . A query for a point $q$ proceeds by considering the points of $P$ that collide with $q$ under $h_{1},\dots,h_{L}$ . Intuitively we want to sample enough hash functions such that the ball of radius $r$ around every potential query point $q\in X$ is covered by the union of the parts $h_{1}^{-1}(q),\dots,h_{L}^{-1}(q)$ . This approach yields the following general LSH framework for solving the approximate near neighbor problem (for more details see Chapter 2).

Theorem 1.1 (Indyk-Motwani [91, 86], simplified).

Let $\mathcal{H}$ be $(r,cr,p_{1},p_{2})$ -sensitive and let $\rho=\frac{\log(1/p_{1})}{\log(1/p_{2})}$ , then there exists a solution to the $(r,cr)$ -near neighbor problem using $O(n^{1+\rho})$ words of space and with query time dominated by $O(n^{\rho}\log n)$ evaluations of functions from $\mathcal{H}$ .

1.2 Examples

To further introduce locality-sensitive hashing and the approach to solving the approximate near neighbor problem used in this thesis, we will present three simple and powerful families of locality-sensitive hash functions: Bit-sampling by Indyk and Motwani [91], MinHash by Broder [35], and SimHash by Charikar [47]. Indyk, Broder, and Charikar received the 2012 ACM Paris Kanellakis Theory and Practice Award “for their groundbreaking work on Locality-Sensitive Hashing that has had great impact in many fields of computer science including computer vision, databases, information retrieval, machine learning, and signal processing” [1]. We proceed by describing each of these families in turn, introducing relevant notation as we go along.

Bit-sampling.

Indyk and Motwani introduced a simple family of locality-sensitive hash functions $\mathcal{H}_{H}$ for the $d$ -dimensional Boolean hypercube $\{0,1\}^{d}$ under Hamming distance $\operatorname{dist}_{H}(x,y)=|\{i\in[d]\mid x_{i}\neq y_{i}\}|$ where $[d]$ denotes the set $\{1,2,\dots,d\}$ . We sample a function $h\sim\mathcal{H}$ by sampling $i$ uniformly at random in $[d]$ and setting $h(x)=x_{i}$ . It is easy to see that a pair of points fail to collide under a random hash function $h(x)=x_{i}$ if and only if $i$ is sampled from the set of coordinates where $x$ and $y$ differ.

[TABLE]

Suppose we want to use this function to solve the $(r,cr)$ -near neighbor problem in Hamming space $(\{0,1\}^{d},\operatorname{dist}_{H})$ . Then, from Theorem 1.1 we optain a query exponent of

[TABLE]

where the details behind the last inequality can be found in [86]. In conclusion, bit-sampling gives a solution to the $(r,cr)$ -near neighbor problem in Hamming space with query time roughly $n^{1/c}$ and space usage and preprocessing time roughly $n^{1+1/c}$ .

MinHash.

MinHash is a family of locality-sensitive hash functions with applications to similarity search and similarity estimation on sets under Jaccard similarity. Given sets $x,y\subseteq[d]$ their Jaccard similarity is defined by $\operatorname{sim}_{J}(x,y)=|x\cap y|/|x\cup y|$ .

A random hash function $h$ from the MinHash distribution $\mathcal{H}_{J}$ is specified by a random permutation of $[d]$ and hashes a set $x$ to the first element of $x$ in this permutation. The permutation can be specified by a uniformly random hash function $f\colon[d]\to[0,1]$ where $[0,1]$ denotes the closed interval from [math] to $1$ . Specifically, we sample a random $h\sim\mathcal{H}_{J}$ by sampling a uniformly random hash function $f\colon[d]\to[0,1]$ and setting

[TABLE]

Two sets $x$ and $y$ collide under a random hash function $h\sim\mathcal{H}_{J}$ if and only if the smallest element of $x\cup y$ is contained in $x\cap y$ . Otherwise, the smallest element of $x$ is in $x{\setminus}y$ or the smallest element of $y$ is in $y{\setminus}x$ and there is no way the sets hash to the same element. Since the smallest element of $x\cup y$ is uniformly distributed we get that

[TABLE]

MinHash gives a solution to the $(s_{1},s_{2})$ -similarity problem with exponent $\rho=\log(1/s_{1})/\log(1/s_{2})$ .

SimHash.

SimHash is a family of Boolean-valued locality-sensitive hash functions for d under cosine similarity $\operatorname{sim}_{C}(x,y)=\cos(\theta(x,y))$ where $\theta(x,y)$ denotes the angle between $x$ and $y$ . We sample a function $h\sim\mathcal{H}_{C}$ by sampling a $d$ -dimensional standard normal random variable $z\sim\mathcal{N}^{d}(0,1)$ and setting

[TABLE]

Intuitively, we sample a random hyperplane that goes through the origin and hash points depending on which side of the hyperplane they are on (the sign of the inner product $\langle{x},{y}\rangle=\sum_{i}x_{i}z_{u}$ ). Due to the rotational invariance of the standard normal distribution the properties of this scheme can be analyzed in two dimensions. The probability that two points on the unit circle are separated by a random line through the origin is exactly

[TABLE]

This scheme yields a solution to the $(s_{1},s_{2})$ -similarity problem under cosine similarity with $\rho=\log(1-\arccos(s_{1})/\pi)/\log(1-\arccos(s_{2})/\pi)$ .

1.3 Lower bounds

Given a space $(X,\operatorname{dist})$ and distance thresholds $r$ , $cr$ we are interested in finding a $(r,cr,p_{1},p_{2})$ -sensitive family with a value of $\rho=\log(1/p_{1})/\log(1/p_{2})$ that is as small as possible. The primary technique for deriving locality-sensitive hashing lower bounds has been Fourier analysis of Boolean functions under noisy inputs (see the excellent book by O’Donnell for a comprehensive introduction [120]). Lower bounds for locality-sensitive hashing schemes (distributions over functions) often follow from lower bounds on the behaviour of a single function $f\colon\{-1,1\}^{d}\to R$ under randomly $\alpha$ -correlated inputs, defined as follows:

Definition 1.3.

For $-1\leq\alpha\leq 1$ and $x,y\in\{-1,1\}^{d}$ we say that $(x,y)$ is randomly $\alpha$ -correlated if the pairs $(x_{i},y_{i})$ are i.i.d. with $x_{i}$ uniform in $\{-1,1\}^{d}$ and

[TABLE]

If two vectors $(x,y)$ are randomly $\alpha$ -correlated their expected cosine similarity is $\alpha$ , and their expected Hamming distance is given by $(1-\alpha)d/2$ . As the dimensionality increases, the empirical correlation between $x$ and $y$ will be tightly concentrated around $\alpha$ .

Let $0\leq\beta<\alpha<1$ and consider a $((1-\alpha)d/2,(1-\beta)d/2,p_{1},p_{2})$ -sensitive family $\mathcal{H}$ for Hamming space $(\{-1,1\}^{d},\operatorname{dist}_{H})$ . Combining lower bounds by O’Donnell et al. [121] and Andoni and Razenshteyn [19] (building on work by Motwani et al. [114]), we have that

[TABLE]

The lower bounds require that $p_{2}$ is not too small as a function of $d$ . In particular, only the trivial lower bound of $\rho\geq 0$ holds if $p_{2}$ can be exponentially small in $d$ , but such families are typically not of interest for high-dimensional similarity search where we want $p_{2}\approx 1/n$ . For a more comprehensive discussion of this issue see [121].

Compared against different constructions of locality-sensitive hash families, the two lower bounds comprising equation (1) reveal interesting properties of the Boolean hypercube. As $\alpha,\beta$ approach $1$ the lower bound of $\log(1/\alpha)/\log(1/\beta)$ is the larger of the two bounds. If we convert the lower bound to Hamming distance we get that $\rho\geq\log(1/(1-2r/d))/\log(1/(1-2cr/d)\approx 1/c$ for an $(r,cr,p_{1},p_{2})$ -sensitive family when $r,cr\ll d$ . This lower bound is tight against the bit-sampling LSH of Indyk and Motwani. The bit-sampling family can be described as randomly partitioning the Boolean hypercube into subcubes, so in a sense subcubes are an optimal “shape” for distinguishing between very short random walks and slightly longer random walks in the Boolean hypercube. The lower bound of $1/c$ in Hamming space gives a lower bound of $1/c^{p}$ for $\ell_{p}^{d}$ -spaces (vectors in d under the $\ell_{p}$ -norm $\left\lVert x-y\right\rVert_{p}=(\sum_{i}|x_{i}-y_{i}|^{p})^{1/p}$ ). This follows from a direct embedding of the Boolean hypercube in $\ell_{p}$ -space.

As $\beta$ approaches [math] the lower bound of $(1-\alpha)/(1+\alpha-2\beta)$ dominates. Converted to Hamming distance this bound becomes $\rho\geq 1/(2c-1)$ . For $\beta=0$ this is tight against existing constructions that use balls to partition the hypercube [73, 14, 13]. Loosely speaking, in this regime we see that balls in Hamming space are optimal for simultaneously minimizing volume (capturing [math]-correlated points) while maximizing the probability of capturing positively correlated points.

In Hamming space, the family of locality-sensitive hash functions that give the best known upper bound on the $\rho$ -value can essentially be described as follows: We sample a function $h\sim\mathcal{H}$ by sampling a sequence of $d$ balls of radius slightly below $d/2$ with the center of each ball being sampled uniformly at random from $\{-1,1\}^{d}$ . A point $x\in\{-1,1\}^{d}$ is then hashed to the index of the first ball in the sequence that contains $x$ . As we increase $d$ and decrease the radius of the balls, this scheme has a $\rho$ -value for the $((1-\alpha)d/2,(1-\beta)d/2)$ -near neighbor problem of

[TABLE]

This scheme also works on the unit sphere if we replace the balls by spherical caps [156, 13]. The size of the gap between the lower bound in equation (1) and the upper bound (2) is shown in Figure 1. Since the gap is less than $0.06$ it is difficult to argue that closing the gap would have huge practical implications, especially since the lower order terms in existing constructions exceed this for most realistic applications [13]. Nevertheless, considering the tools that have gone into proving the existing lower bounds, we believe that it is of fundamental mathematical interest to understand how to best separate $\beta$ -correlated points from $\alpha$ -correlated points.

1.4 Beyond locality-sensitive hashing

A common theme among recent advances in the area of theoretical approximate similarity search has been to move beyond standard locality-sensitive hashing [14, 100, 149, 24, 16, 54, 56, 22]. The results in this direction usually modify part of the framework, for example by constructing the locality-sensitive family by looking at the data, but the underlying approach of using locality-sensitive mappings from points to buckets remains the same. This thesis explores several variations of standard locality-sensitive hashing and we therefore briefly introduce some of this work here.

Data-dependent locality-sensitive hashing.

A sequence of papers [14, 17, 19, 18, 16] has explored the idea of data-dependent locality-sensitive hashing: If we allow the construction of $\mathcal{H}$ to depend on the set of data points $P$ , how fast can we then solve the approximate near neighbor problem? Andoni and Razenshteyn was able to show matching upper and lower bounds of $\rho=1/(2c^{2}-1)+o_{d}(1)$ in Euclidean space [17, 19]. This matches standard LSH upper and lower bounds in the case of random instances on the unit sphere, and indeed the construction by Andoni and Razenshteyn is based on a reduction to this case. Unfortunately the construction and its analysis is complicated and suffer from large lower order terms [16], although recent work has found some success in striking a balance between algorithmic simplicity and theoretical optimality using data-dependence in Hamming space [18].

Asymmetric locality-sensitive hashing.

Asymmetric locality-sensitive hashing extends the concept of standard locality-sensitive hashing to cover distributions over pairs of functions $(h,g)\sim\mathcal{A}$ and studies how the probability of collision between pairs of points can be made to depend on the distance/similarity between the points [149, 22]. This modification to standard locality-sensitive hashing opens up new applications such as approximate search for furthest neighbors, orthogonal vectors [163], and annulus queries (see [22] for an overview). In Chapter 6 we show lower bounds for asymmetric locality-sensitive hashing.

Space-time tradeoffs.

The standard locality-sensitive hashing framework offers a balanced space-time tradeoff that is the result of a symmetric query and update procedure: Every data point is stored in $O(n^{\rho})$ buckets and during queries we probe $O(n^{\rho})$ buckets. A line of work has investigated how the query and update parts of the algorithm can be modified to yield different tradeoffs between space usage and query time [129, 106, 9, 95, 100, 54, 16]. Typically the performance of such solutions is expressed by two exponents: $\rho_{u}$ and $\rho_{q}$ . During updates we store points in $O(n^{\rho_{u}})$ buckets and during queries we probe $O(n^{\rho_{q}})$ buckets.

Early work in this area focused on how to modify the standard locality-sensitive hashing query and update algorithms using an idea known as multi-probing [106]. Regular locality-sensitive hashing uses $L=O(n^{\rho})$ hash functions $h_{1},\dots,h_{L}$ . Suppose $h_{l}(q)$ denotes the $l$ th bucket to be probed during the standard LSH query algorithm. By inspecting buckets in the neighborhood of $h_{l}(q)$ , for example by adding some noise $z$ to $q$ and probing $h(q+z)$ , we can increase the probability of finding a near neighbor of $q$ , which in turn allows us to reduce $L$ while maintaining correctness.

Recent breakthroughs in this area have come by abandoning the locality-sensitive hashing framework in favor of a more direct approach based on locality-sensitive filtering [100, 54]. Finally, Andoni et al. [16] have combined their data-dependent approach to locality-sensitive hashing with the best known space-time tradeoff solutions for random data to obtain optimal space-time tradeoffs, as shown by matching lower bounds. The optimal trade-off between $\rho_{q},\rho_{u}\geq 0$ for the $(r,cr)$ -near neighbor problem in Euclidean space can be described by the equation

[TABLE]

For a balanced tradeoff this collapses to $1/(2c^{2}-1)$ which is tight for data-dependent locality-sensitive hashing, but the bound has been shown to be tight for every choice of $\rho_{q},\rho_{u}$ that satisfies the equation.

Locality-sensitive filters and maps.

Locality-sensitive filtering [24] differs from locality-sensitive hashing in that it uses locality-sensitive subsets of space (filters) rather than locality-sensitive partitions (hash functions) to solve the approximate near neighbor problem. An example of a locality-sensitive filter family is the distribution over balls of some fixed radius in Hamming space. This idea is further extended to allow asymmetry by using different filters for queries and updates [100, 54]. It turns out that the filter family of consisting of pairs of concentric balls in Hamming space can be used to solve the approximate near-neighbor problem with optimal space-time tradeoffs, matching the lower bound of Andoni et al. [16] for random data. Chapter 3 further introduces locality-sensitive filtering and space-time tradeoffs.

In even greater generality we can think of locality-sensitive hashing and filtering as being approaches to constructing randomized mappings $M\colon X\to 2^{R}$ (where $2^{R}$ denotes the power set of $R$ ) from a space $(X,\operatorname{dist})$ to a collection of $|R|$ buckets that satisfy certain properties. Recent work on set similarity search (Chapter 4) and improvements to the standard locality-sensitive hashing framework (Chapter 2) explores these ideas and obtains efficient search algorithms by deviating from the standard approach.

2 Part II: Pseudorandom hashing and generators

The second part of this thesis contains results on efficient pseudorandom hash functions and pseudorandom number generators. We are interested in replacing the use of true randomness in randomized algorithms and data structures with the output of a pseudorandom hash function or generator, stretching a small random seed into a much larger output of pseudorandom values, while retaining guarantees on the performance of these algorithms. For a primer on the general study of pseudorandom generators see [79].

Universal hashing.

The pseudorandomness part of this thesis focuses on one specific type of pseudorandomness known as $k$ -wise independence or $k$ -independence, first introduced to the field of computer science through the concept of universal hashing by Carter and Wegman [40].

Definition 1.4.

Let $k$ be a positive integer and let $\mathcal{F}$ be a family of functions from $U$ to $R$ . We say that $\mathcal{F}$ is a $k$ -independent family of functions if for every choice of $k$ distinct keys $x_{1},\dots,x_{k}$ and arbitrary values $y_{1},\dots,y_{k}$ we have that

[TABLE]

Furthermore, we say that $f$ is $k$ -independent when it is selected uniformly at random from a family of $k$ -independent functions.

We can sample a $k$ -independent hash function $f(x)=\sum_{i=0}^{k-1}a_{i}x^{i}\bmod p$ by sampling each $a_{i}$ uniformly at random from the set $\{0,1,\dots,p-1\}$ where $p$ is prime. In fact, the family of polynomials of degree at most $k-1$ over a finite field is $k$ -independent [92]. We are typically interested in applications where the size of the universe $u=|U|$ is much larger than the degree of independence $k$ .

Different types of hashing-based dictionaries work for $k$ -independent hash functions with $k$ much smaller than the number of elements in the dictionary which we denote by $n$ . For example, it was shown that $5$ -independence suffices for linear probing to ensure expected constant time per operation [124]. It is known that $\Theta(\log n)$ -independence suffices for Cuckoo hashing [127], but $5$ -independence is not enough to ensure constant amortized cost per operation [63]. For a brief introduction to the use of random hashing in algorithms and data structures see [69].

Fast hashing and lower bound.

For applications that require super-constant independence, the time to evalute the hash function can be a performance bottleneck. A $k$ -independent polynomial hash function can be stored using $O(k)$ words and evaluated using time $O(k)$ on a word-RAM, assuming constant time arithmetic over the finite field. What if we are willing to use more space to represent a $k$ -independent hash function $f\sim\mathcal{F}$ in order to reduce the evaluation time? Siegel [151] gave a powerful cell-probe lower bound for this problem, showing that for a $k$ -independent hash function with domain size $u$ , even if we use space roughly $O(ku^{1/t})$ for some $t\geq 1$ the evaluation time has to be $\Omega(t)$ .

Siegel also showed the existence of a matching upper bound based on highly unbalanced bipartite expander graphs $G=(U\cup V,E)$ with left vertex set $U$ corresponding to the domain of the hash function, right vertex set $V$ of size $|V|=O(ku^{1/t})$ , and left outdegree $d=O(t)$ . Given an appropriate expander graph $G$ we can sample a $k$ -independent hash function $f\colon U\to R$ by associating each vertex $v\in V$ with a random element from $R$ where we assume that $(R,+)$ is an abelian group, such as the the integers under modular arithmetic. To compute $f(x)$ we take the sum of the random elements associated with the neighbors of the vertex $x\in U$ and return the result.

Unfortunately we only know of the existence of such optimal expander graphs by the probablistic method: a random bipartite graph has the right properties for optimal $k$ -independent hashing with overwhelming probability if we parameterize the graph generation process correctly. Several works, Siegel’s original paper included, attempt to approach the performance of such optimal bipartite expander graph by the use of probabilistic constructions [72, 123, 158, 58]. In Chapter 9 we show a probablistic construction with space usage and evaluation time that almost matches the lower bound. Finding optimal explicit constructions remains a major open problem.

Other approaches to the problem of finding fast hash functions with theoretical guarantees include the study of tabulation hashing and its variations which has guarantees beyond what can be derived from the degree of $k$ -independence [159], to simulate uniformly random hashing in constant time on a subset of the universe [123], reusing randomness by splitting the problem into sub-problems that share a single highly random hash function [71], or extracting additional randomness from the input to the hash function [59].

Generating $k$ -independent random variables.

The generation of $k$ -independent random variables differs from random hashing by allowing the algorithm designer to specify where to evaluate a $k$ -independent function $f$ in order to generate a sequence of variables $f(x_{1}),f(x_{2}),\dots$ that is $k$ -independent. The problem of generating a sequence of $k$ -independent random variables is therefore easier than the problem of constructing a data structure to represent a random $k$ -independent hash function that an adversary can choose to evaluate in an arbitrary point.

We can take a standard $k$ -independent polynomial hash function and evaluate it in $k$ points in time $k\operatorname{poly}\log k$ using fast multipoint evaluation algorithms [27, 164], giving us a generator of $k$ -independent random variables with generation time $\operatorname{poly}\log k$ per variable that uses space $O(k)$ . This in itself shows that the task of generation is easier than hashing, as it would be impossible to evaluate a $k$ -independent hash function in time $\operatorname{poly}\log k$ using space $O(k)$ , if for example $k=\Theta(\log u)$ . In Chapter 8 we show how to generate $k$ -independent variables in constant time, independent of $k$ , using space $k\operatorname{poly}\log k$ .

3 Overview and contributions

This thesis is divided into two parts. The first part presents algorithms and lower bounds for various problems related to similarity search. The second part presents algorithms for the efficient generation of high-quality pseudorandom numbers, as well as efficient hash functions. The chapters are based on the following papers:

I.

Similarity search.

2.

Tobias Christiani: Fast locality-sensitive hashing frameworks for approximate near neighbor search [53]. 2017. Unpublished.

3.

Tobias Christiani: A Framework for Similarity Search with Space-Time Tradeoffs using Locality-Sensitive Filtering [54]. SODA 2017.

4.

Tobias Christiani and Rasmus Pagh: Set similarity search beyond MinHash [56]. STOC 2017.

5.

Tobias Christiani, Rasmus Pagh and Johan Sivertsen: Scalable and robust set similarity join [57]. ICDE 2018.

6.

Martin Aumüller, Tobias Christiani, Rasmus Pagh and Francesco Silvestri: Distance-sensitive hashing [22]. PODS 2018.

7.

Tobias Christiani: Optimal Boolean locality-sensitive hashing. 2018. Unpublished.

II.

Pseudorandomness.

8.

Tobias Christiani and Rasmus Pagh: Generating $k$ -independent variables in constant time [55]. FOCS 2014.

9.

Tobias Christiani, Rasmus Pagh and Mikkel Thorup: From Independence to Expansion and Back Again [58]. STOC 2015.

We proceed by giving a brief description of the contribution of each chapter.

3.1 Part I: Similarity search

Chapter 2: Fast locality-sensitive hashing frameworks.

This chapter begins by surveying different techniques for constructing a solution to the approximate near neighbor problem from a family of locality-sensitive hash functions. Given a family $\mathcal{H}$ of locality-sensitive hash functions, the standard Indyk-Motwani framework (Theorem 1.1) uses $O(n^{\rho}\log n)$ functions from $\mathcal{H}$ to solve the approximate near neighbor problem. During a query all of these hash functions are evaluated, dominating the query time. For many LSH schemes the time to evaluate a single function is $O(d)$ or greater, as witnessed for example by SimHash or MinHash, further exacerbating the problem. Building on recent work by Dahlgaard et al. [64] we show that the number of locality-sensitive hash functions can be reduced to $O(\log^{2}n)$ in general, yielding an improved LSH framework. We combine this result with a technique from another LSH framework by Andoni and Indyk [10] to reduce the word-RAM complexity of this improved framework by a logarithmic factor to $O(n^{\rho})$ .

Chapter 3: Space-time tradeoffs for similarity search.

This chapter introduces a framework for solving the approximate near neighbor problem with space-time tradeoffs using locality-sensitive filtering. We show concrete solutions on the unit sphere under cosine similarity with extensions to $\ell_{p}$ -space for every $0<p\leq 2$ . These results improve and generalize prior work [100, 95]. We also include a lower bound on space-time tradeoff that is tight, but suffers from some important restrictions. A paper by Andoni et al. [16] has since shown a strengthened lower bound and an improved upper bound through the use of data-dependent techniques. An early version of the paper behind this chapter formed part of my master’s thesis. At the end of the chapter we have added an improved locality-sensitive filtering framework compared to the one in the main text, building on ideas introduced in Chapter 2 and 4.

Chapter 4: Set similarity search beyond MinHash.

In this chapter we consider the problem of set similarity search under Braun-Blanquet similarity $\operatorname{sim}_{B}(x,y)=|x\cap y|/\max(|x|,|y|)$ . We show that the $(s_{1},s_{2})$ -similarity problem in this setting can be solved with an exponent of $\rho=\log(1/s_{1})/\log(1/s_{2})$ and that this is tight among solutions based on data-independent locality-sensitive maps. The upper bound is based on a novel construction inspired by branching processes and interestingly, although it is data-independent, it outperforms the best known data-dependent techinques for a large portion of the parameter space $0\leq s_{2}<s_{1}<1$ . The lower bound follows from a reduction to the standard $(r,cr)$ -near neighbor problem in Hamming space for $r,cr\ll d/2$ . In this setting the lower bound by O’Donnell et al. [121] is tight and we are able to show that it extends to Braun-Blanquet similarity for every choice of $0\leq s_{2}<s_{1}<1$ . This is interesting in the light of the gap in our knowledge when it comes to the usual $(\alpha,\beta)$ -similarity problem for cosine similarity, as explained in the introduction.

Chapter 6: Lower bounds for asymmetric locality-sensitive hashing.

In this chapter we derive lower bounds (on the $\rho$ -value) for asymmetric locality-sensitive hashing. Our lower bound covers the case of asymmetric families for approximate near neighbor search, as well as the case of approximate furthest neighbor search where we are interested in having the collision probability of $(h,g)\sim\mathcal{A}$ increase in the distance between points. We show that our lower bounds are tight against existing symmetric constructions in the case of the application to near neighbor search, and that this construction can easily be modified to yield an optimal asymmetric construction for furthest neighbor search.

Chapter 7: Optimal Boolean locality-sensitive hashing.

In this chapter we show that, among the class of Boolean locality-sensitive hash functions $h\colon\{-1,1\}^{d}\to\{-1,1\}$ , bit-sampling is an optimal LSH (minimizes the $\rho$ -value) for the $((1-\alpha)d/2,(1-\beta)d/2)$ -near neighbor problem in Hamming space for every choice of $0\leq\beta<\alpha<1$ . This stands in contrast to the lower bound by O’Donnell et al. [121] which is unrestricted with respect to the range of the locality-sensitive hash functions. Bit-sampling only matches this unrestricted lower bound in the case where $\alpha,\beta$ approach $1$ . Our result settles the question of optimal Boolean locality-sensitive hashing for Hamming space and shows that we have to look towards families of hash functions with a larger range in order to further improve the $\rho$ -value compared to bit-sampling. Andoni et al. [13] have shown lower bounds on the $\rho$ -value on the unit sphere as a function of the size of the range of the hash function.

3.2 Part II: Pseudorandom hashing and number generation

Chapter 8: Generating $k$ -independent random variables in constant time.

We investigate the problem of efficiently generating $k$ -independent random variables and give an explicit generator of $k$ -independent random variables with constant generation time, independent of $k$ . The explicit construction combines multipoint evaluation of polynomials over finite fields with a cascading construction of explicit bipartite expander graphs by Capalbo et al. [39]. The space usage of this construction is $k\operatorname{poly}\log k$ with a very large exponent in the polynomial. We also show a randomized version of the same construction that uses a randomly generated bipartite graph. This reduces the space overhead to $O(\log^{3}k)$ at the cost of introducing an error probability (the generated sequence may fail to be $k$ -independent) that is polynomially small in $k$ . We implement a version of the generator that combines a random bipartite expander with fast multipoint evaluation of polynomials over $\mathbb{F}_{2^{64}}$ and show that it scales well, even for generating $k=2^{20}$ -independent variables.

Chapter 9: Near-optimal $k$ -independent hashing.

In this chapter we attack the problem of constructing fast $k$ -independent random hash functions. We use the fact that there is a sort of duality between randomized bipartite expander graphs and $k$ -independent random hash functions. A bipartite expander graph that expands on subsets of size $k$ can be used to construct a $k$ -independent family of functions, and a $k$ -independent function is likely to represent a bipartite expander that expands on subsets of size $k$ . We take a small bipartite expander graph and apply an inefficient graph product that preserves its expansion properties while increasing the size of the left vertex set (the size of the domain of the resulting hash function). Then we use this resulting bipartite expander graph to construct a $k$ -independent random hash function that now represents a new expander on a larger domain with optimal properties. By applying this strategy recursively using different graph products we are able to give randomized constructions of $k$ -independent hash functions in the word-RAM model that almost match Siegel’s cell probe lower bound [151].

4 Conclusion and open problems

4.1 Similarity search

We have shown new upper and lower bounds for problems related to approximate similarity search in high-dimensional spaces, showing improved locality-sensitive hashing frameworks, lower bounds for Boolean locality-sensitive hashing, and going beyond locality-sensitive hashing in several different directions with asymmetric locality-sensitive hashing, space-time tradeoffs through locality-sensitive filtering, and locality-sensitive maps for set similarity search.

Optimal data-independent locality-sensitive hashing.

It remains open to close the gap between the upper and lower bounds on the $\rho$ -value of $((1-\alpha)d/2,(1-\beta)d/2,p_{1},p_{2})$ -sensitive families in Hamming space (shown in Figure 1). Existing lower bounds seem to have explored the limits of what can be shown with our current understanding of hypercontractive inequalities and Fourier analysis of Boolean functions. We conjecture that the ball-based LSH construction with the $\rho$ -value given in equation 2 is asymptotically optimal for every choice of $0\leq\beta<\alpha<1$ .

Orthogonal search.

Suppose we are interested in an asymmetric locality-sensitive hashing scheme for the unit sphere under cosine similarity that can be used to search for orthogonal vectors. For this purpose we want the probability of collison to be as high as possible for [math]-correlated (orthogonal) vectors and have the probability of collision decrease at the correlation becomes positive or negative. Let $p(\alpha)$ denote the probability of collision of the asymmetric locality-sensitive hashing scheme for a pair of $\alpha$ -correlated vectors. The current best upper bound on $\rho=\log(1/p(0))/\log(1/\max(p(\alpha),p(-\alpha)))$ is given by $(1-\alpha^{2})/(1+\alpha^{2})$ [22]. The lower bound presented in Chapter 6 only implies $\rho\geq(1-|\alpha|)/(1+|\alpha|)$ . Obtaining a “two-sided” lower bound that simultanously relates $p(0)$ to both $p(\alpha)$ and $p(-\alpha)$ has close ties to the open symmetric Gaussian problem [119]. It is conjectured that the upper bound is tight.

Simple data-dependent constructions.

It is an important open problem to find simpler data-dependent solutions to approximate near neighbor search. Despite the intuitive appeal of using the data to inform the construction of the solution, relatively few people have succeded in making theoretical progress in this area [14, 17, 18]. Perhaps by relaxing the problem slightly, for example by only requiring that queries that follow a specific distribution succeed with constant probability, progress can be made. An example of such a query distribution could be to sample one of the $n$ data points uniformly at random and sample the query from a ball around the data point. Attacking the problem for data structures that use near-linear space in $n$ also seems like a promising approach.

4.2 $k$ -independent hashing and generation

We have shown near-optimal results for $k$ -independent hashing and generation.

Optimal explicit unbalanced bipartite expander graphs.

The main open problem in this area is the explicit construction of highly unbalanced bipartite expander graphs with optimal properties. We would like to be able to evaluate the neighbor function $\Gamma\colon U\to V^{d}$ of a left $d$ -regular bipartite expander graph with optimal parameters (matching Siegel’s lower bound for $k$ -independent hashing) using time that is at most polynomial in the bit-length of the input. For the application to random hashing we would furthermore like to be able to list the $d$ neighbors of a vertex in time $O(d)$ . The construction in Chapter 9 is essentially able to solve this task in time $O(d\log d)$ , so it would require a very clean explicit construction to yield an improvement to the efficiency of random hashing in practice. Results on the construction of explicit bipartite expanders by Guruswami et al. [84] and preprocessing polynomials [96] are based directly on results such as the fundamental theorem of algebra and the Chinese remainder theorem and give hope that there exists a simple explicit construction.

Constant time generators with minimal space.

The fast generators in Chapter 8 uses polynomials over finite fields and require space $k\operatorname{poly}\log k$ . Through the sequential evaluation of hash functions presented in Chapter 9 we can remove the need for arithmetic over finite fields, but it seems that if we want to use minimal space the evalution time will still be $O(\log u)$ with space usage $k\operatorname{poly}(\log u,\log k)$ . Is it possible to get constant-time generation in a restricted word-RAM model without multiplication using space $O(k)$ ?

Part I Similarity search

Chapter 2 Fast locality-sensitive hashing frameworks

‘Renewed shall be blade that was broken’

The Indyk-Motwani Locality-Sensitive Hashing (LSH) framework (STOC 1998) is a general technique for constructing a data structure to answer approximate near neighbor queries by using a distribution $\mathcal{H}$ over locality-sensitive hash functions that partition space. For a collection of $n$ points, after preprocessing, the query time is dominated by $O(n^{\rho}\log n)$ evaluations of hash functions from $\mathcal{H}$ and $O(n^{\rho})$ hash table lookups and distance computations where $\rho\in(0,1)$ is determined by the locality-sensitivity properties of $\mathcal{H}$ . It follows from a recent result by Dahlgaard et al. (FOCS 2017) that the number of locality-sensitive hash functions can be reduced to $O(\log^{2}n)$ , leaving the query time to be dominated by $O(n^{\rho})$ distance computations and $O(n^{\rho}\log n)$ additional word-RAM operations. We state this result as a general framework and provide a simpler analysis showing that the number of lookups and distance computations closely match the Indyk-Motwani framework. Using ideas from another locality-sensitive hashing framework by Andoni and Indyk (SODA 2006) we are able to reduce the number of additional word-RAM operations to $O(n^{\rho})$ .

5 Introduction

The $(r_{1},r_{2})$ -approximate near neighbor problem is the problem of preprocessing a collection $P$ of $n$ points in a space $(X,\operatorname{dist})$ into a data structure that after preprocessing supports the following query operation: Given a query point $q\in X$ , if there exists a point $x\in P$ with $\operatorname{dist}(q,x)\leq r_{1}$ , then the data structure is guaranteed to return a point $x^{\prime}\in P$ such that $\operatorname{dist}(q,x^{\prime})<r_{2}$ .

Indyk and Motwani [91] introduced a general framework for constructing solutions to the approximate near neighbor problem using a technique known as locality-sensitive hashing (LSH). The framework takes a distribution over hash functions $\mathcal{H}$ with the property that near points are more likely to collide under a random $h\sim\mathcal{H}$ . During preprocessing a number of locality-sensitive hash functions are sampled from $\mathcal{H}$ and used to hash the points of $P$ into buckets. The query algorithm evaluates the same hash functions on the query point and looks into the associated buckets to find an approximate near neighbor.

The locality-sensitive hashing framework of Indyk and Motwani has had a large impact in both theory and practice (see surveys [12] and [165] for an introduction), and many of the best known (data-independent) solutions to the approximate near neighbor problem in high-dimensional spaces, such as Euclidean space [11], the unit sphere under inner product similarity [13], and sets under Jaccard similarity [33] come in the form of families of locality-sensitive hash functions that can be plugged into the Indyk-Motwani LSH framework. Recent work on data-dependent locality-sensitive hashing has further improved solutions for $\ell_{p}$ -spaces and cosine similarity [14, 17, 16], but these solutions typically do not come directly in the form of a distribution over locality-sensitive hash functions and as such it is unclear whether the techniques in this paper can yield further speedups to these results.

Definition 2.1 (Locality-sensitive hashing [91]).

Let $(X,\operatorname{dist})$ be a distance space and let $\mathcal{H}$ be a distribution over functions $h\colon X\to R$ . We say that $\mathcal{H}$ is $(r_{1},r_{2},p_{1},p_{2})$ -sensitive if for $x,y\in X$ and $h\sim\mathcal{H}$ we have that:

•

If $\operatorname{dist}(x,y)\leq r_{1}$ then $\Pr[h(x)=h(y)]\geq p_{1}$ .

•

If $\operatorname{dist}(x,y)\geq r_{2}$ then $\Pr[h(x)=h(y)]\leq p_{2}$ .

The Indyk-Motwani framework takes a $(r_{1},r_{2},p_{1},p_{2})$ -sensitive family $\mathcal{H}$ and constructs a data structure that solves the approximate near neighbor problem for parameters $r_{1}<r_{2}$ with some positive constant probability of success. We will refer to this randomized approximate version of the near neighbor problem as the $(r_{1},r_{2})$ -near neighbor problem, where we require queries to succeed with probability at least $1/2$ (see Definition 2.2). To simplify the exposition we will assume throughout the introduction, unless otherwise stated, that $0<p_{1}<p_{2}<1$ are constant, that a hash function $h\in\mathcal{H}$ can be stored in $n/\log n$ words of space, and for $\rho=\log(1/p_{1})/\log(1/p_{2})\in(0,1)$ that a point $x\in X$ can be stored in $O(n^{\rho})$ words of space. The assumption of a constant gap between $p_{1}$ and $p_{2}$ allows us to avoid performing distance computations by instead using the $1$ -bit sketching scheme of Li and König [103] together with the family $\mathcal{H}$ to approximate distances (see Section 8.1 for details). In the remaining part of the paper we will state our results without any such assumptions to ensure, for example, that our results hold in the important case where $p_{1},p_{2}$ may depend on $n$ or the dimensionality of the space [11, 13].

Theorem 2.1 (Indyk-Motwani [91, 86], simplified).

Let $\mathcal{H}$ be $(r_{1},r_{2},p_{1},p_{2})$ -sensitive and let $\rho=\frac{\log(1/p_{1})}{\log(1/p_{2})}$ , then there exists a solution to the $(r_{1},r_{2})$ -near neighbor problem using $O(n^{1+\rho})$ words of space and with query time dominated by $O(n^{\rho}\log n)$ evaluations of functions from $\mathcal{H}$ .

The query time of the Indyk-Motwani framework is dominated by the number of evaluations of locality-sensitive hash functions. To make matters worse, almost all of the best known and most widely used locality-sensitive families have an evalution time that is at least linear in the dimensionality of the underlying space [33, 47, 66, 11, 13]. Significant effort has been devoted to the problem of reducing the evaluation complexity of locality-sensitive hash families [156, 75, 65, 13, 97, 147, 148, 64], while the question of how many independent locality-sensitive hash functions are actually needed to solve the $(r_{1},r_{2})$ -near neighbor problem has received relatively little attention [10, 64].

This paper aims to bring attention to, strengthen, generalize, and simplify results that reduce the number of locality-sensitive hash functions used to solve the $(r_{1},r_{2})$ -near neighbor problem. In particular, we will extract a general framework from a technique introduced by Dahlgaard et al. [64] in the context of set similarity search under Jaccard similarity, showing that the number of locality-sensitive hash functions can be reduced to $O(\log^{2}n)$ in general. Reducing the number of locality-sensitive hash functions allows us to spend time $O(n^{\rho}/\log^{2}n)$ per hash function evaluation without increasing the overall complexity of the query algorithm — something which is particularly useful in Euclidean space where the best known LSH upper bounds offer a tradeoff between the $\rho$ -value that can be achieved and the evaluation complexity of the locality-sensitive hash function [11, 13, 97].

The main technical contribution of this paper is to reduce the word-RAM complexity of the general LSH framework from $O(n^{\rho}\log n)$ to $O(n^{\rho})$ by combining techniques from Dahlgaard et al. and Andoni and Indyk [10].

5.1 Related work

Indyk-Motwani.

The Indyk-Motwani framework uses $L=O(n^{\rho})$ independent partitions of space, each formed by overlaying $k=O(\log n)$ random partitions induced by $k$ random hash functions from a locality-sensitive family $\mathcal{H}$ . The parameter $k$ is chosen such that a random partition has the property that a pair of points $x,y\in X$ with $\operatorname{dist}(x,y)\leq r_{1}$ has probability $n^{-\rho}$ of ending up in the same part of the partition, while a pair of points with $\operatorname{dist}(x,y)\geq r_{2}$ has probability $n^{-1}$ of colliding. By randomly sampling $L=O(n^{\rho})$ such partitions we are able to guarantee that a pair of near points will collide with constant probability in at least one of them. Applying these $L$ partitions to our collection of data points $P$ and storing the result of each partition of $P$ in a hash table we obtain a data structure that solves the $(r_{1},r_{2})$ -near neighbor problem as outlined in Theorem 2.1 above. Section 7 and 7.1 contains a more complete description of LSH-based frameworks and the Indyk-Motwani framework.

Andoni-Indyk.

As previously mentioned, many locality-sensitive hash functions happen to have a super-constant evaluation time. This motivated Andoni and Indyk to introduce a replacement to the Indyk-Motwani framework in a paper on substring near neighbor search [10]. The key idea is to re-use hash functions from a small collection of size $m\ll L$ by forming all combinations of $\binom{m}{t}$ hash functions. This technique is also known as tensoring and has seen some use in the work on alternative solutions to the approximate near neighbor problem, in particular the work on locality-sensitive filtering [73, 24, 54]. By applying the tensoring technique the Andoni-Indyk framework reduces the number of hash functions to $O(\exp(\sqrt{\rho\log n\log\log n}))=n^{o(1)}$ as stated in Theorem 2.2.

Theorem 2.2 (Andoni-Indyk [10], simplified).

Let $\mathcal{H}$ be $(r_{1},r_{2},p_{1},p_{2})$ -sensitive and let $\rho=\frac{\log(1/p_{1})}{\log(1/p_{2})}$ , then there exists a solution to the $(r_{1},r_{2})$ -near neighbor problem using $O(n^{1+\rho})$ words of space and with query time dominated by $O(\exp(\sqrt{\rho\log n\log\log n}))$ evaluations of functions from $\mathcal{H}$ and $O(n^{\rho})$ other word-RAM operations.

The paper by Andoni and Indyk did not state this result explicitly as a theorem in the same form as the Indyk-Motwani framework; the analysis made some implicit restrictive assumptions on $p_{1},p_{2}$ and ignored integer constraints. Perhaps for these reasons the result does not appear to have received much attention, although it has seen some limited use in practice [152]. In Section 7.2 we present a slightly different version of the Andoni-Indyk framework together with an analysis that satisfies integer constraints, providing a more accurate assessment of the performance of the framework in the general, unrestricted case.

Dahlgaard-Knudsen-Throup.

The paper by Dahlgaard et al. [64] introduced a different technique for constructing the $L$ hash functions/partitions from a smaller collection of $m$ hash functions from $\mathcal{H}$ . Instead of forming all combinations of subsets of size $t$ as the Andoni-Indyk framework they instead sample $k$ hash functions from the collection to form each of the $L$ partitions. The paper focused on a particular application to set similarity search under Jaccard similarity, and stated the result in terms of a solution to this problem. In Section 7.3 we provide a simplified and tighter analysis to yield a general framework:

Theorem 2.3 (Dahlgaard-Knudsen-Thorup [64], simplified).

Let $\mathcal{H}$ be $(r_{1},r_{2},p_{1},p_{2})$ -sensitive and let $\rho=\frac{\log(1/p_{1})}{\log(1/p_{2})}$ , then there exists a solution to the $(r_{1},r_{2})$ -near neighbor problem using $O(n^{1+\rho})$ words of space and with query time dominated by $O(\log^{2}n)$ evaluations of functions from $\mathcal{H}$ and $O(n^{\rho}\log n)$ other word-RAM operations.

The analysis of [64] indicates that the Dahlgaard-Knudsen-Thorup framework, when compared to the Indyk-Motwani framework, would use at least $50$ times as many partitions (and a corresponding increase in the number of hash table lookups and distance computations) to solve the $(r_{1},r_{2})$ -near neighbor problem with success probability at least $1/2$ . Using elementary tools, the analysis in this paper shows that we only have to use twice as many partitions as the Indyk-Motwani framework to obtain the same guarantee of success.

Number of hash functions.

To provide some idea of the number of hash functions $H$ used by the different frameworks, Figure 2 shows the value of $\log_{2}H$ that is obtained by the Indyk-Motwani (IM), Andoni-Indyk (AI), and Dahlgaard-Knudsen-Thorup (DKT) frameworks according to the analysis in Section 7 for $p_{1}=1/2$ and every value of $0<p_{2}<1/2$ for a solution to the $(r_{1},r_{2})$ -near neighbor problem on a collection of $n=2^{30}$ points with success probability at least $1/2$ . Note that Figure 2 shows an upper bound on the number of hash functions used by the frameworks according to the analysis in order to provide a solution with theoretical guarantees to the approximate near neighbor problem for any data set, and not the actual setting required for a particular data set (we haven’t actually performed an experiment on $2^{30}$ points). In the analysis behind Figure 2 we have attempted to minimize $H$ within each respective framework.

Figure 2 reveals that the number of hash functions used by the Indyk-Motwani framework exceeds $2^{30}$ , the size of the collection of points $P$ , as $p_{2}$ approaches $p_{1}$ . In addition, locality-sensitive hash functions used in practice such as Charikar’s SimHash [47] and $p$ -stable LSH [66] have evaluation time $O(d)$ for points in $\mathbb{R}^{d}$ . These two factors might help explain why a linear scan over sketches of the entire collection of points is a popular approach to solve the approximate near neighbor problem in practice [168, 80]. The Andoni-Indyk framework reduces the number of hash functions by several orders of magnitude, and the Dahlgaard-Knudsen-Thorup framework presents another improvement of several orders of magnitude. Since the word-RAM complexity of the DKT framework matches the the number of hash functions used by the IM framework, the gap between the solid line (DKT) and the dotted line (IM) gives some indication of the time we can spend on evaluating a single hash function in the DKT framework without suffering a noticeable increase in the query time.

5.2 Contribution

Improved word-RAM complexity.

In addition to our work on the Andoni-Indyk and Dahlgaard-Knudsen-Thorup frameworks as mentioned above, we show how the word-RAM complexity of the DKT framework can be reduced by a logarithmic factor. The solution is a simple combination of the DKT sampling technique and the AI tensoring technique: First we use the DKT sampling technique twice to construct two collections of $\sqrt{L}$ partitions. Then we use the AI tensoring technique to form $L=\sqrt{L}\times\sqrt{L}$ pairs of partitions from the two collections. Below we state our main Theorem 2.4 in its general form where we make no implicit assumptions about $\mathcal{H}$ ( $p_{1}$ and $p_{2}$ are not assumed to be constant and can depend on for example $n$ ) or about the complexity of storing a point or a hash function, or computing the distance between pairs of points in the space $(X,\operatorname{dist})$ .

Theorem 2.4.

Let $\mathcal{H}$ be $(r_{1},r_{2},p_{1},p_{2})$ -sensitive and let $\rho=\log(1/p_{1})/\log(1/p_{2})$ , then there exists a solution to the $(r_{1},r_{2})$ -near neighbor with the following properties:

•

The query complexity is dominated by $O(\log_{1/p_{2}}^{2}(n)/p_{1})$ evaluations of functions from $\mathcal{H}$ , $O(n^{\rho})$ distance computations, and $O(n^{\rho}/p_{1})$ other word-RAM operations.

•

The solution uses $O(n^{1+\rho}/p_{1})$ words of space in addition to the space required to store the data and $O(\log_{1/p_{2}}^{2}(n)/p_{1})$ functions from $\mathcal{H}$ .

Under the same simplifying assumptions used in the statements of Theorem 2.1, 2.2, and 2.3, our main Theorem 2.4 can be stated as Theorem 2.3 with the word-RAM complexity reduced by a logarithmic factor to $O(n^{\rho})$ . This improvement in the word-RAM complexity comes at the cost of a (rather small) constant factor increase in the number of hash functions, lookups, and distance computations compared to the DKT framework. By varying the size $m$ of the collection of hash functions from $\mathcal{H}$ and performing independent repetitions we can obtain a tradeoff between the number of hash functions and the number of lookups. In Section 9 we remark on some possible improvements in the case where $p_{2}$ is large.

Distance sketching using LSH.

Finally, we combine Theorem 2.4 with the 1-bit sketching scheme of Li and König [103] where we use the locality-sensitive hash family to create sketches that allow us to leverage word-level parallelism and avoid direct distance computations. This sketching technique is well known and has been used before in combination with LSH-based approximate similarity search [57], but we believe there is some value in the simplicity of the analysis and in a clear statement of the combination of the two results as given in Theorem 2.5, for example in the important case where $0<p_{2}<p_{1}<1$ are constant.

Theorem 2.5.

Let $\mathcal{H}$ be $(r_{1},r_{2},p_{1},p_{2})$ -sensitive and let $\rho=\log(1/p_{1})/\log(1/p_{2})$ , then there exists a solution to the $(r_{1},r_{2})$ -near neighbor with the following properties:

•

The complexity of the query operation is dominated by $O(\log^{2}(n)/(p_{1}-p_{2})^{2})$ evaluations of hash functions from $\mathcal{H}$ and $O(n^{\rho}/(p_{1}-p_{2})^{2})$ other word-RAM operations.

•

The solution uses $O(n^{1+\rho}/p_{1}+n/(p_{1}-p_{2})^{2})$ words of space in addition to the space required to store the data and $O(\log^{2}(n)/(p_{1}-p_{2})^{2})$ hash functions from $\mathcal{H}$ .

6 Preliminaries

Problem and dynamization.

We begin by defining the version of the approximate near neighbor problem that the frameworks presented in this paper will be solving:

Definition 2.2.

Let $P\subseteq X$ be a collection of $|P|=n$ points in a distance space $(X,\operatorname{dist})$ . A solution to the $(r_{1},r_{2})$ -near neighbor problem is a data structure that supports the following query operation: Given a query point $q\in X$ , if there exists a point $x\in P$ with $\operatorname{dist}(q,x)\leq r_{1}$ , then, with probability at least $1/2$ , return a point $x^{\prime}\in P$ such that $\operatorname{dist}(q,x^{\prime})<r_{2}$ .

We aim for solutions with a failure probability that is upper bounded by $1/2$ . The standard trick of using $\eta$ independent repetitions of the data structure allows us to reduce the probability of failure to $1/2^{\eta}$ . For the sake of simplicity we restrict our attention to static solutions, meaning that we do not concern ourselves with the complexity of updates to the underlying set $P$ , although it is simple to modify the static solutions presented in this paper to dynamic solutions where the update complexity essentially matches the query complexity [122, 86]

LSH powering.

The Indyk-Motwani framework and the Andoni-Indyk framework will make use of the following standard powering technique described in the introduction as “overlaying partitions”. Let $k\geq 1$ be an integer and let $\mathcal{H}$ denote a locality-sensitive family of hash functions as in Definition 2.1. We will use the notation $\mathcal{H}^{k}$ to denote the distribution over functions $h^{\prime}\colon X\to R^{k}$ where

[TABLE]

and $h_{1},\dots,h_{k}$ are sampled independently at random from $\mathcal{H}$ . It is easy to see that $\mathcal{H}^{k}$ is $(r_{1},r_{2},p_{1}^{k},p_{2}^{k})$ -sensitive. To deal with some special cases we define $\mathcal{H}^{0}$ to be the family consisting of a single constant function.

Model of computation.

We will work in the standard word-RAM model of computation [85] with a word length of $\Theta(\log n)$ bits where $n$ denotes the size of the collection $P$ to be searched in the $(r_{1},r_{2})$ -near neighbor problem. During the preprocessing stage of our solutions we will assume access to a source of randomness that allows us to sample independently from a family $\mathcal{H}$ and to seed pairwise independent hash functions [41, 42]. The latter can easily be accomplished by augmenting the model with an instruction that generates a uniformly random word in constant time and using that to seed the tables of a Zobrist hash function [173].

7 Frameworks

Overview.

We will describe frameworks that take as input a $(r_{1},r_{2},p_{1},p_{2})$ -sensitive family $\mathcal{H}$ and a collection $P$ of $n$ points and constructs a data structure that solves the $(r_{1},r_{2})$ -near neighbor problem. The frameworks described in this paper all use the same high-level technique of constructing $L$ hash functions $g_{1},\dots,g_{L}$ that are used to partition space such that a pair of points $x,y$ with $\operatorname{dist}(x,y)\leq r_{1}$ will end up in the same part of one of the $L$ partitions with probability at least $1/2$ . That is, for $x,y$ with $\operatorname{dist}(x,y)\leq r_{1}$ we have that $\Pr[\exists l\in[L]\colon g_{l}(x)=g_{l}(y)]\geq 1/2$ where $[L]$ is used to denote the set $\{1,2,\dots,L\}$ . At the same time we ensure that the expected number of collisions between pairs of points $x,y$ with $\operatorname{dist}(x,y)\geq r_{2}$ is at most one in each partition.

Preprocessing and queries.

During the preprocessing phase, for each of the $L$ hash functions $g_{1},\dots,g_{L}$ we compute the partition of the collection of points $P$ induced by $g_{l}$ and store it in a hash table in the form of key-value pairs $(z,\{x\in P\mid g_{l}(x)=z\})$ . To reduce space usage we store only a single copy of the collection $P$ and store references to $P$ in our $L$ hash tables. To guarantee lookups in constant time we can use the perfect hashing scheme by Fredman et al. [76] to construct our hash tables. We will assume that hash values $z=g_{l}(x)$ fit into $O(1)$ words. If this is not the case we can use universal hashing [40] to operate on fingerprints of the hash values.

We perform a query for a point $q$ as follows: for $l=1,\dots,L$ we compute $g_{l}(q)$ , retrieve the set of points $\{x\in P\mid g_{l}(x)=g_{l}(q)\}$ , and compute the distance between $q$ and each point in the set. If we encounter a point $x^{\prime}$ with $\operatorname{dist}(q,x^{\prime})<r_{2}$ then we return $x^{\prime}$ and terminate. If after querying the $L$ sets no such point is encountered we return a special symbol $\varnothing$ and terminate.

We will proceed by describing and analyzing the solutions to the $(r_{1},r_{2})$ -near neighbor problem for different approaches to sampling, storing, and computing the $L$ hash functions $g_{1},\dots,g_{L}$ , resulting in the different frameworks as mentioned in the introduction.

7.1 Indyk-Motwani

To solve the $(r_{1},r_{2})$ -near neighbor problem using the Indyk-Motwani framework we sample $L$ hash functions $g_{1},\dots,g_{L}$ independently at random from the family $\mathcal{H}^{k}$ where we set $k=\lceil\log(n)/\log(1/p_{2})\rceil$ and $L=\lceil(\ln 2)/p_{1}^{k}\rceil$ . Correctness of the data structure follows from the observation that the probability that a pair of points $x,y$ with $\operatorname{dist}(x,y)\leq r_{1}$ does not collide under a randomly sampled $g_{l}\sim\mathcal{H}^{k}$ is at most $1-p_{1}^{k}$ . We can therefore upper bound the probability that a near pair of points does not collide under any of the hash functions by $(1-p_{1}^{k})^{L}\leq\exp(-p_{1}^{k}L)\leq 1/2$ using a standard bound stated as Lemma 2.3 in Appendix 11.

In the worst case, the query operation computes $L$ hash functions from $\mathcal{H}^{k}$ corresponding to $Lk$ hash functions from $\mathcal{H}$ . For a query point $q$ the expected number of points $x^{\prime}\in P$ with $\operatorname{dist}(q,x^{\prime})\geq r_{2}$ that collide with $q$ under a randomly sampled $g_{l}\sim\mathcal{H}^{k}$ is at most $np_{2}^{k}\leq np_{2}^{\log(n)/\log(1/p_{2})}=1$ . It follows from linearity of expectation that the total expected number of distance computations during a query is at most $L$ . The result is summarized in Theorem 2.6 from which the simplified Theorem 2.1 follows.

Theorem 2.6 (Indyk-Motwani [91, 86]).

Given a $(r_{1},r_{2},p_{1},p_{2})$ -sensitive family $\mathcal{H}$ we can construct a data structure that solves the $(r_{1},r_{2})$ -near neighbor problem such that for $k=\lceil\log(n)/\log(1/p_{2})\rceil$ and $L=\lceil(\ln 2)/p_{1}^{k}\rceil$ the data structure has the following properties:

•

The query operation uses at most $Lk$ evaluations of hash functions from $\mathcal{H}$ , expected $L$ distance computations, and $O(Lk)$ other word-RAM operations.

•

The data structure uses $O(nL)$ words of space in addition to the space required to store the data and $Lk$ hash functions from $\mathcal{H}$ .

Theorem 2.6 gives a bound on the expected number of distance computations while the simplified version stated in Theorem 2.1 uses Markov’s inequality and independent repetitions to remove the expectation from the bound by treating an excessive number of distance computations as a failure.

7.2 Andoni-Indyk

In 2006 Andoni and Indyk, as part of a paper on the substring near neighbor problem, introduced an improvement to the Indyk-Motwani framework that reduces the number of locality-sensitive hash functions [10]. Their improvement comes from the use of a technique that we will refer to as tensoring: setting the hash functions $g_{1},\dots,g_{L}$ to be all $t$ -tuples from a collection of $m$ functions sampled from $\mathcal{H}^{k/t}$ where $m\ll L$ . The analysis in [10] shows that by setting $m=n^{\rho/t}$ and repeating the entire scheme $t!$ times, the total number of hash functions can be reduced to $O(\exp(\sqrt{\rho\log n\log\log n}))$ when setting $t=\sqrt{\frac{\rho\log n}{\log\log n}}$ . This analysis ignores integer constraints on $t$ , $k$ , and $m$ , and implicitly place restrictions on $p_{1}$ and $p_{2}$ in relation to $n$ (e.g. $0<p_{2}<p_{1}<1$ are constant). We will introduce a slightly different scheme that takes into account integer constraints and analyze it without restrictions on the properties of $\mathcal{H}$ .

Assume that we are given a $(r_{1},r_{2},p_{1},p_{2})$ -sensitive family $\mathcal{H}$ . Let $\eta,t,k_{1},k_{2},m_{1},m_{2}$ be non-negative integer parameters. Each of the $L$ hash functions $g_{1},\dots,g_{L}$ will be formed by concatenating one hash function from each of $t$ collections of $m_{1}$ hash functions from $\mathcal{H}^{k_{1}}$ and concatenating a last hash function from a collection of $m_{2}$ hash functions from $\mathcal{H}^{k_{2}}$ . We take all $m_{1}^{t}m_{2}$ hash functions of the above form and repeat $\eta$ times for a total of $L=\eta m_{1}^{t}m_{2}$ hash functions constructed from a total of $H=\eta(m_{1}k_{1}t+m_{2}k_{2})$ hash functions from $\mathcal{H}$ . In Appendix 12 we set parameters, leaving $t$ variable, and provide an analysis of this scheme, showing that $L$ matches the Indyk-Motwani framework bound of $O(1/p_{1}^{k})$ up to a constant where $k=\lceil\log(n)/\log(1/p_{2})\rceil$ as in Theorem 2.6.

Setting $t$ .

It remains to show how to set $t$ to obtain a good bound on the number of hash functions $H$ . Note that in practice we can simply set $t=\operatorname*{arg\,min}_{t}H$ by trying $t=1,\dots,k$ . If we ignore integer constraints and place certain restrictions of $\mathcal{H}$ as in the original tensoring scheme by Andoni and Indyk we want to set $t$ to minimize the expression $t^{t}n^{\rho/t}$ . This minimum is obtained when setting $t$ such that $t^{2}\log t=\rho\log n$ . We therefore cannot do much better than setting $t=\sqrt{\rho\log(n)/\log\log n}$ which gives the bound $H=O(\exp(\sqrt{\rho\log(n)\log\log n}))$ as shown in [10]. To allow for easy comparison with the Indyk-Motwani framework without placing restrictions on $\mathcal{H}$ we set $t=\lceil\sqrt{k}\rceil$ , resulting in Theorem 2.7.

Theorem 2.7.

Given a $(r_{1},r_{2},p_{1},p_{2})$ -sensitive family $\mathcal{H}$ we can construct a data structure that solves the $(r_{1},r_{2})$ -near neighbor problem such that for $k=\lceil\log(n)/\log(1/p_{2})\rceil$ , $H=k(\sqrt{k}/p_{1})^{\sqrt{k}}$ , and $L=\lceil 1/p_{1}^{k}\rceil$ the data structure has the following properties:

•

The query operation uses $O(H)$ evaluations of functions from $\mathcal{H}$ , $O(L)$ distance computations, and $O(L+H)$ other word-RAM operations.

•

The data structure uses $O(nL)$ words of space in addition to the space required to store the data and $O(H)$ hash functions from $\mathcal{H}$ .

Thus, compared to the Indyk-Motwani framework we have gone from using $O(k(1/p_{1})^{k})$ locality-sensitive hash functions to $O(k(\sqrt{k}/p_{1})^{\sqrt{k}})$ locality-sensitive hash functions. Figure 2 shows the actual number of hash functions of the revised version of the Andoni-Indyk scheme as analyzed in Appendix 12 when $t$ is set to minimize $H$ .

7.3 Dahlgaard-Knudsen-Thorup

In a recent paper Dahlgaard et al. [64] introduce a different technique for reducing the number of locality-sensitive hash functions. The idea is to construct each hash value $g_{l}(x)$ by sampling and concatenating $k$ hash values from a collection of $km$ pre-computed hash functions from $\mathcal{H}$ . Dahlgaard et al. applied this technique to provide a fast solution the approximate near neighbor problem for sets under Jaccard similarity. In this paper we use the same technique to derive a general framework solution that works with every family of locality-sensitive hash functions, reducing the number of locality-sensitive hash functions compard to the Indyk-Motwani and Andoni-Indyk frameworks.

Let $[n]$ denote the set of integers $\{1,2,\dots,n\}$ . For $i\in[k]$ and $j\in[m]$ let $h_{i,j}\sim\mathcal{H}$ denote a hash function in our collection. To sample from the collection we use $k$ pairwise independent hash functions [42] of the form $f_{i}\colon[L]\to[m]$ and set

[TABLE]

To show correctness of this scheme we will use make use of an elementary one-sided version of Chebyshev’s inequality stating that for a random variable $Z$ with mean $\mu>0$ and variance $\sigma^{2}<\infty$ we have that $\Pr[Z\leq 0]\leq\sigma^{2}/(\mu^{2}+\sigma^{2})$ . For completeness we have included the proof of this inequality in Lemma 2.5 in Appendix 11. We will apply this inequality to lower bound the probability that there are no collisions between close pairs of points. For two points $x$ and $y$ let $Z_{l}=\mathds{1}\{g_{l}(x)=g_{l}(y)\}$ so that $Z=\sum_{l=1}^{L}Z_{l}$ denotes the sum of collisions under the $L$ hash functions. To apply the inequality we need to derive an expression for the expectation and the variance of the random variable $Z$ . Let $p=\Pr_{h\sim\mathcal{H}}[h(x)=h(y)]$ then by linearity of expectation we have that $\mu=\operatorname*{\mathbb{E}}[Z]=Lp^{k}$ . To bound $\sigma^{2}=\operatorname*{\mathbb{E}}[Z^{2}]-\mu^{2}$ we proceed by bounding $\operatorname*{\mathbb{E}}[Z^{2}]$ where we note that $Z_{l}=\Pi_{i=1}^{k}Y_{l,i}$ for $Y_{l,i}=1\{h_{i,f_{i}(l)}(x)=h_{i,f_{i}(l)}(x)\}$ and make use of the independence between $Y_{l,i}$ and $Y_{l^{\prime},i^{\prime}}$ for $i\neq i^{\prime}$ .

[TABLE]

We have that $\operatorname*{\mathbb{E}}[Y_{l,i}Y_{l^{\prime},i}]=\Pr[f_{i}(l)=f_{i}(l^{\prime})]p+\Pr[f_{i}(l)\neq f_{i}(l^{\prime})]p^{2}=(1/m)p+(1-1/m)p^{2}$ which follows from the pairwise independence of $f_{i}$ . Let $\varepsilon>0$ and set $m=\lceil\frac{1-p_{1}}{p_{1}}\frac{k}{\ln(1+\varepsilon)}\rceil$ then for $p\geq p_{1}$ we have that $\left(\operatorname*{\mathbb{E}}[Y_{l,i}Y_{l^{\prime},i}]\right)^{k}\leq(1+\varepsilon)p^{2k}$ . This allows us to bound the variance of $Z$ by $\sigma^{2}\leq\varepsilon\mu^{2}+\mu$ resulting in the following lower bound on the probability of collision between similar points.

Lemma 2.1.

For $\varepsilon>0$ let $m\geq\lceil\frac{1-p_{1}}{p_{1}}\frac{k}{\ln(1+\varepsilon)}\rceil$ , then for every pair of points $x,y$ with $\operatorname{dist}(x,y)\leq r_{1}$ we have that

[TABLE]

By setting $\varepsilon=1/4$ and $L=\lceil(2\ln(2))/p_{1}^{k}\rceil$ we obtain an upper bound on the failure probability of $1/2$ . Setting the size of each of the $k$ collections of pre-computed hash values to $m=\lceil 5k/p_{1}\rceil$ is sufficient to yield the following solution to the $(r_{1},r_{2})$ -near neighbor problem where provide exact bounds on the number of lookups $L$ and hash functions $H$ :

Theorem 2.8 (Dahlgaard-Knudsen-Thorup [64]).

Given a $(r_{1},r_{2},p_{1},p_{2})$ -sensitive family $\mathcal{H}$ we can construct a data structure that solves the $(r_{1},r_{2})$ -near neighbor problem such that for $k=\lceil\log(n)/\log(1/p_{2})\rceil$ , $H=k\lceil 5k/p_{1}\rceil$ , and $L=\lceil(2\ln(2))/p_{1}^{k}\rceil$ the data structure has the following properties:

•

The query operation uses at most $H$ evaluations of hash functions from $\mathcal{H}$ , expected $L$ distance computations, and $O(Lk)$ other word-RAM operations.

•

The data structure uses $O(nL)$ words of space in addition to the space required to store the data and $H$ hash functions from $\mathcal{H}$ .

Compared to the Indyk-Motwani framework we have reduced the number of locality-sensitive hash functions $H$ from $O(k(1/p_{1})^{k})$ to $O(k^{2}/p_{1})$ at the cost of using twice as many lookups. To reduce the number of lookups further we can decrease $\varepsilon$ and perform several independent repetitions. This comes at the cost of an increase in the number of hash functions $H$ .

8 Reducing the word-RAM complexity

One drawback of the DKT framework is that each hash value $g_{l}(x)$ still takes $O(k)$ word-RAM operations to compute, even after the underlying locality-sensitive hash functions are known. This results in a bound on the total number of additional word-RAM operations of $O(Lk)$ . We show how to combine the DKT universal hashing technique with the AI tensoring technique to ensure that the running time is dominated by $O(L)$ distance computations and $O(H)$ hash function evaluations. The idea is to use the DKT scheme to construct two collections of respectively $L_{1}$ and $L_{2}$ hash functions, and then to use the AI tensoring approach to form $g_{1},\dots,g_{L}$ as the $L=L_{1}\times L_{2}$ combinations of functions from the two collections. The number of lookups can be reduced by applying tensoring several times in independent repetitions, but for the sake of simplicity we use a single repetition. For the usual setting of $k=\lceil\log(n)/\log(1/p_{2})\rceil$ let $k_{1}=\lceil k/2\rceil$ and $k_{2}=\lfloor k/2\rfloor$ . Set $L_{1}=\lceil 6(1/p_{1})^{k_{1}}\rceil$ and $L_{2}=\lceil 6(1/p_{1})^{k_{2}}\rceil$ . According to Lemma 2.1 if we set $\varepsilon=1/6$ the success probability of each collection is at least $3/4$ and by a union bound the probability that either collection fails to contain a colliding hash function is at most $1/2$ . This concludes the proof of our main Theorem 2.4.

8.1 Sketching

The theorems of the previous section made no assumptions on the word-RAM complexity of distance computations and instead stated the number of distance computations as part of the query complexity. We can use a $(r_{1},r_{2},p_{1},p_{2})$ -sensitive family $\mathcal{H}$ to create sketches that allows us to efficiently approximate the distance between pairs of points, provided that the gap between $p_{1}$ and $p_{2}$ is sufficiently large. In this section we will re-state the results of Theorem 2.4 when applying the family $\mathcal{H}$ to create sketches using the 1-bit sketching scheme of Li and König [103]. Let $b$ be a positive integer denoting the length of the sketches in bits. The advantage of this scheme is that we can use word level parallelism to evaluate a sketch of $b$ bits in time $O(b/\log n)$ in our word-RAM model with word length $\Theta(\log n)$ .

For $i=1,\dots,b$ let $h_{i}\colon X\to R$ denote a randomly sampled locality-sensitive hash function from $\mathcal{H}$ and let $f_{i}\colon R\to\{0,1\}$ denote a randomly sampled universal hash function. We let $s(x)\in\{0,1\}^{b}$ denote the sketch of a point $x\in X$ where we set the $i$ th bit of the sketch $s(x)_{i}=f_{i}(h(x))$ . For two points $x,y\in X$ the probability that they agree on the $i$ th bit is $1$ if the points collide under $h_{i}$ and $1/2$ otherwise.

[TABLE]

We will apply these sketches during our query procedure instead of direct distance computations when searching through the points in the $L$ buckets, comparing them to our query point $q$ . Let $\lambda\in(0,1)$ be a parameter that will determine whether we report a point or not. For sketches of length $b$ we will return a point $x$ if $\left\lVert s(q)-s(x)\right\rVert_{1}>\lambda b$ . An application of Hoeffiding’s inequality gives us the following properties of the sketch:

Lemma 2.2.

Let $\mathcal{H}$ be a $(r_{1},r_{2},p_{1},p_{2})$ -sensitive family and let $\lambda=(1+p_{2})/2+(p_{1}-p_{2})/4$ , then for sketches of length $b\geq 1$ and for every pair points $x,y\in X$ :

•

If $\operatorname{dist}(x,y)\leq r_{1}$ then $\Pr[\left\lVert s(x)-s(y)\right\rVert_{1}\leq\lambda b]\leq e^{-b(p_{1}-p_{2})^{2}/8}$ .

•

If $\operatorname{dist}(x,y)\geq r_{2}$ then $\Pr[\left\lVert s(x)-s(y)\right\rVert_{1}>\lambda b]\leq e^{-b(p_{1}-p_{2})^{2}/8}$ .

If we replace the exact distance computations with sketches we want to avoid two events: Failing to report a point with $\operatorname{dist}(q,x)\leq r_{1}$ and reporting a point $x$ with $\operatorname{dist}(q,x)\geq r_{2}$ . By setting $b=O(\ln(n)/(p_{1}-p_{2})^{2})$ and applying a union bound over the $n$ events that the sketch fails for a point in our collection $P$ we obtain Theorem 2.5.

9 The number of hash functions in corner cases

When the collision probabilities of the $(r_{1},r_{2},p_{1},p_{2})$ -sensitive family $\mathcal{H}$ are close to one we get the behavior displayed in Figure 3 where we have set $p_{1}=0.9$ . Here it may be possible to reduce the number of hash functions by applying the DKT framework to the family $\mathcal{H}^{\tau}$ for some positive integer $\tau$ . That is, instead of applying the DKT technique directly to $\mathcal{H}$ we first apply the powering trick to produce the family $\mathcal{H}^{\tau}$ . The number of locality-sensitive hash functions from $\mathcal{H}$ used by the DKT framework is given by $H=O((\log(n)/\log(1/p_{2}))^{2}/p_{1})$ . If we instead use the family $\mathcal{H}^{\tau}$ the expression becomes $H=O(\tau(\log(n)/\log(1/p_{2}^{\tau}))^{2}/p_{1}^{\tau})=O((\log(n)/\log(1/p_{2}))^{2}/\tau p_{1}^{\tau})$ . Ignoring integer constraints, the value of $\tau$ that maximizes $\tau p_{1}^{\tau}$ , thereby minimizing $H$ , is given by $\tau=1/\ln(1/p_{1})$ . Discretizing, the resulting number of hash functions when setting $\tau=\lceil 1/\ln(1/p_{1})\rceil$ is given by $H=O(\rho(\log n)^{2}/(p_{1}\log(1/p_{2})))$ . For constant $\rho$ and large $p_{2}$ this reduces the number of hash functions by a factor $1/\log(1/p_{2})$ .

The behavior for small values of $p_{1}$ is displayed in Figure 4 where we have set $p_{1}=0.1$ .

10 Conclusion and open problems

We have shown that there exists a simple and general framework for solving the $(r_{1},r_{2})$ -near neighbor problem using only few locality-sensitive hash functions and with a reduced word-RAM complexity matching the number of lookups. The analysis in this paper indicates that the performance of the Dahlgaard-Knudsen-Thorup framework is highly competitive compared to the Indyk-Motwani framework in practice, especially when locality-sensitive hash functions are expensive to evaluate, as is often the case.

An obvious open problem is to provide a framework that uses fewer than $O(k^{2}/p_{1})$ locality-sensitive hash function. Another direction would be to find a lower bound on the number of independent locality-sensitive hash functions required to solve the ANN problem using LSH in a suitably restricted model.

Acknowledgement

I want to thank Rasmus Pagh commenting on an earlier version of this manuscript and for making me aware of the application of the tensoring technique in [152] that led me to the Andoni-Indyk framework [10].

11 Appendix: Inequalities

We make use of the following standard inequalities for the exponential function. See [111, Chapter 3.6.2] for more details.

Lemma 2.3.

Let $n,t\in\mathbb{R}$ such that $n\geq 1$ and $|t|\leq n$ then $e^{-t}(1-t^{2}/n)\leq(1-t/n)^{n}\leq e^{-t}$ .

Lemma 2.4.

For $t\geq 0$ we have that $e^{-t}\leq 1-t+t^{2}/2$ .

We make use of a one-sided version of Chebyshev’s inequality to show correctness of the Dahlgaard-Knudsen-Thorup LSH framework.

Lemma 2.5 (Cantelli’s inequality).

Let $Z$ be a random variable with $\operatorname*{\mathbb{E}}[Z]=\mu>0$ and $\mathrm{Var}[Z]=\sigma^{2}<\infty$ then $\Pr[Z\leq 0]\leq\sigma^{2}/(\mu^{2}+\sigma^{2})$ .

Proof.

For every $s\in\mathbb{R}$ we have that

[TABLE]

Next we apply Markov’s inequality

[TABLE]

Set $s=\sigma^{2}/\mu$ and use that $\sigma^{2}=s\mu$ to simplify

[TABLE]

∎

To analyze the 1-bit sketching scheme by Li and König we make use of Hoeffding’s inequality:

Lemma 2.6 (Hoeffding [88, Theorem 1]).

Let $X_{1},X_{2},\dots,X_{n}$ be independent random variables satisfying $0\leq X_{i}\leq 1$ for $i\in[n]$ . Define $\bar{X}=(X_{1}+X_{2}+\dots+X_{n})/n$ and $\mu=\operatorname*{\mathbb{E}}[\bar{X}]$ , then:

For $0<\varepsilon<1-\mu$ we have that $\Pr[\bar{X}-\mu\geq\varepsilon]\leq e^{-2n\varepsilon^{2}}$ .

-

For $0<\varepsilon<\mu$ we have that $\Pr[\bar{X}-\mu\leq-\varepsilon]\leq e^{-2n\varepsilon^{2}}$ .

12 Appendix: Analysis of the Andoni-Indyk framework

Let $\varphi$ denote the probability that a pair of points $x,y$ with $\operatorname{dist}(x,y)\leq r_{1}$ collide in a single repetition of the scheme. A collision occurs if and only if there there exists at least one hash function in each of the underlying $t+1$ collections where the points collide. It follows that

[TABLE]

To guarantee a collision with probability at least $1/2$ it suffices to set $\eta=\lceil\ln(2)/\varphi\rceil$ .

We will proceed by analyzing this scheme where we let $t\geq 1$ be variable and set parameters as followers:

[TABLE]

To upper bound $L$ we begin by lower bounding $\varphi$ . The second part of $\varphi$ can be lower bounded using Lemma 2.3 to yield $(1-(1-p_{1}^{k_{2}})^{m_{2}})\geq 1-1/e$ . To lower bound $(1-(1-p_{1}^{k_{1}})^{m_{1}})^{t}$ we first note that in the case where $p_{1}^{k_{1}}>1/t$ we have $m_{1}=1$ and the expression can be lower bounded by $p_{1}^{k_{1}t}=(p_{1}^{k_{1}}m_{1})^{t}\geq(p_{1}^{k_{1}}m_{1})^{t}/2e$ . The same lower bound holds in the case there $t=1$ . In the case where $p_{1}^{k_{1}}\leq 1/t$ and $t\geq 2$ we make use of Lemma 2.3 and 2.4 to derive the lower bound.

[TABLE]

Using the bound $(p_{1}^{k_{1}}m_{1}(1-1/t))^{t}\geq(p_{1}^{k_{1}}m_{1})^{t}/2e$ we have that

[TABLE]

We can then bound the number of lookups and the expected number of distance computations

[TABLE]

Note that this matches the upper bound of the Indyk-Motwani LSH framework up to a constant factor.

To bound the number of hash functions from $\mathcal{H}$ we use that $k_{1}\leq k/t\leq k$ and $k_{2}<t$ .

[TABLE]

Chapter 3 Space-time tradeoffs for similarity search

‘All that is gold does not glitter’

We present a framework for similarity search based on Locality-Sensitive Filtering (LSF), generalizing the Indyk-Motwani (STOC 1998) Locality-Sensitive Hashing (LSH) framework to support space-time tradeoffs. Given a family of filters, defined as a distribution over pairs of subsets of space that satisfies certain locality-sensitivity properties, we can construct a dynamic data structure that solves the approximate near neighbor problem on a collection of $n$ points in $d$ -dimensional space with query time $dn^{\rho_{q}+o(1)}$ , update time $dn^{\rho_{u}+o(1)}$ , and space usage $dn+n^{1+\rho_{u}+o(1)}$ . The space-time tradeoff is tied to the tradeoff between query time and update time (insertions/deletions), controlled by the exponents $\rho_{q},\rho_{u}$ that are determined by the filter family.

Locality-sensitive filtering was introduced by Becker et al. (SODA 2016) together with a framework yielding a single, balanced, tradeoff between query time and space, further relying on the assumption of an efficient oracle for the filter evaluation algorithm. We extend the LSF framework to support space-time tradeoffs and through a combination of existing techniques we remove the oracle assumption.

Laarhoven (arXiv 2015), building on Becker et al., introduced a family of filters with space-time tradeoffs for the high-dimensional unit sphere under inner product similarity and analyzed it for the important special case of random data. We show that a small modification to the family of filters gives a simpler analysis that we use, together with our framework, to provide guarantees for worst-case data. Through an application of Bochner’s Theorem from harmonic analysis by Rahimi & Recht (NIPS 2007), we are able to extend our solution on the unit sphere to d under the class of similarity measures corresponding to real-valued characteristic functions. For the characteristic functions of $s$ -stable distributions we obtain a solution to the $(r,cr)$ -near neighbor problem in $\ell_{s}^{d}$ -spaces with query and update exponents $\rho_{q}=\frac{c^{s}(1+\lambda)^{2}}{(c^{s}+\lambda)^{2}}$ and $\rho_{u}=\frac{c^{s}(1-\lambda)^{2}}{(c^{s}+\lambda)^{2}}$ where $\lambda\in[-1,1]$ is a tradeoff parameter. This result improves upon the space-time tradeoff of Kapralov (PODS 2015) and is shown to be optimal in the case of a balanced tradeoff, matching the LSH lower bound by O’Donnell et al. (ITCS 2011) and a similar LSF lower bound proposed in this paper. Finally, we show a lower bound for the space-time tradeoff on the unit sphere that matches Laarhoven’s and our own upper bound in the case of random data.

13 Introduction

Let $(X,\operatorname{dist})$ denote a space over a set $X$ equipped with a symmetric measure of dissimilarity $\operatorname{dist}$ (a distance function in the case of metric spaces). We consider the $(r,cr)$ -near neighbor problem first introduced by Minsky and Papert [110, p. 222] in the 1960’s. A solution to the $(r,cr)$ -near neighbor problem for a set $P$ of $n$ points in $(X,\operatorname{dist})$ takes the form of a data structure that supports the following operation: given a query point $x\in X$ , if there exists a data point $y\in P$ such that $\operatorname{dist}(x,y)\leq r$ then report a data point $y^{\prime}\in P$ such that $\operatorname{dist}(x,y^{\prime})\leq cr$ . In some spaces it turns out to be convenient to work with a measure of similarity rather than dissimilarity. We use $\operatorname{sim}$ to denote a symmetric measure of similarity and define the $(\alpha,\beta)$ -similarity problem to be the $(-\alpha,-\beta)$ -near neighbor problem in $(X,-\operatorname{sim})$ .

A solution to the $(r,cr)$ -near neighbor problem can be viewed as a fundamental building block that yields solutions to many other similarity search problems such as the $c$ -approximate nearest neighbor problem [89, 86]. In particular, the $(r,cr)$ -near neighbor problem is well-studied in $\ell_{s}^{d}$ -spaces where the data points lie in d and distances are measured by $\operatorname{dist}(x,y)=\left\lVert x-y\right\rVert_{s}=(\sum_{i=1}^{d}|x_{i}-y_{i}|^{s})^{1/s}$ . Notable spaces include the Euclidean space $({}^{d},\left\lVert\cdot\right\rVert_{2})$ , Hamming space $(\{0,1\}^{d},\left\lVert\cdot\right\rVert_{1})$ , and the $d$ -dimensional unit sphere $\mathbb{S}^{d}=\{x\in{}^{d}\mid\left\lVert x\right\rVert_{2}=1\}$ under inner product (cosine) similarity $\operatorname{sim}(x,y)=\langle{x},{y}\rangle=\sum_{i=1}^{d}x_{i}y_{i}$ .

Curse of dimensionality.

All known solutions to the $(r,cr)$ -near neighbor problem for $c=1$ (the exact near neighbor problem) either suffer from a space usage that is exponential in $d$ or a query time that is linear in $n$ [86]. This phenomenon is known as the “curse of dimensionality” and has been observed both in theory and practice. For example, Alman and Williams [6] recently showed that the existence of an algorithm for determining whether a set of $n$ points in $d$ -dimensional Hamming space contains a pair of points that are exact near neighbors with a running time strongly subquadratic in $n$ would refute the Strong Exponential Time Hypothesis (SETH) [169]. This result holds even when $d$ is rather small, $d=O(\log n)$ . From a practical point of view, Weber et al. [167] showed that the performance of many of the tree-based approaches to similarity search from the field of computational geometry [68] degrades rapidly to a linear scan as the dimensionality increases.

Approximation to the rescue.

If we allow an approximation factor of $c>1$ then there exist solutions to the $(r,cr)$ -near neighbor problem with query time that is strongly sublinear in $n$ and space polynomial in $n$ where both the space and time complexity of the solution depends only polynomially on $d$ . Techniques for overcoming the curse of dimensionality through approximation were discovered independently by Kushilevitz et al. [99] and Indyk and Motwani [91]. The latter, classical work by Indyk and Motwani [91, 86] introduced a general framework for solving the $(r,cr)$ -near neighbor problem known as Locality-Sensitive Hashing (LSH). The introduction of the LSH framework has inspired an extensive literature (see e.g. [12, 165] for surveys) that represents the state of the art in terms of solutions to the $(r,cr)$ -near neighbor problem in high-dimensional spaces [91, 47, 66, 129, 11, 12, 9, 13, 17, 95, 19, 24, 100].

Hashing and filtering frameworks.

The LSH framework and the more recent LSF framework introduced by Becker et al. [24] produce data structures that solve the $(r,cr)$ -near neighbor problem with query and update time $dn^{\rho+o(1)}$ and space usage $dn+n^{1+\rho+o(1)}$ . The LSH (LSF) framework takes as input a distribution over partitions (subsets) of space with the locality-sensitivity property that close points are more likely to be contained in the same part (subset) of a randomly sampled element from the distribution. The frameworks proceed by constructing a data structure that associates each point in space with a number of memory locations or “buckets” where data points are stored. During a query operation the buckets associated with the query point are searched by computing the distance to every data point in the bucket, returning the first suitable candidate. The set of memory locations associated with a particular point is independent of whether an update operation or a query operation is being performed. This symmetry between the query and update algorithm results in solutions to the near neighbor problem with a balanced space-time tradeoff. The exponent $\rho$ is determined by the locality-sensitivity properties of the family of partitions/hash functions (LSH) or subsets/filters (LSF) and is typically upper bounded by an expression that depends only on the aproximation factor $c$ . For example, Indyk and Motwani [91] gave a simple locality-sensitive family of hash functions for Hamming space with an exponent of $\rho\leq 1/c$ . This exponent was later shown to be optimal by O’Donnell et al. [121] who gave a lower bound of $\rho\geq 1/c-o_{d}(1)$ in the setting where $r$ and $cr$ are small compared to $d$ . The advantage of having a general framework for similarity search lies in the reduction of the $(r,cr)$ -near neighbor problem to the, often simpler and easier to analyze, problem of finding a locality-sensitive family of hash functions or filters for the space of interest.

Space-time tradeoffs.

Space-time tradeoffs for solutions to the $(r,cr)$ -near neighbor problem is an active line of research that can be motivated by practical applications where it is desirable to choose the tradeoff between query time and update time (space usage) that is best suited for the application and memory hierarchy at hand [129, 106, 9, 95, 100]. Existing solutions typically have query time $dn^{\rho_{q}+o(1)}$ , update time (insertions/deletions) $dn^{\rho_{u}+o(1)}$ , and use space $dn+n^{1+\rho_{u}+o(1)}$ where the query and update exponents $\rho_{q},\rho_{u}$ that control the space-time tradeoff depend on the approximation factor $c$ and on a tradeoff parameter $\lambda\in[-1,1]$ .

This paper combines a number of existing techniques [24, 100, 73] to provide a general framework for similarity search with space-time tradeoffs. The framework is used to show improved upper bounds on the space-time tradeoff in the well-studied setting of $\ell_{s}$ -spaces and the unit sphere under inner product similarity. Finally, we show a new lower bound on the space-time tradeoff for the unit sphere that matches an upper bound for random data on the unit sphere by Laarhoven [100]. We proceed by stating our contribution and briefly surveying the relevant literature in terms of frameworks, upper bounds, and lower bounds as well as some recent developments. See table Table 1 for an overview.

13.1 Contribution

Before stating our results we give a definition of locality-sensitive filtering that supports asymmetry in the framework query and update algorithm, yielding space-time tradeoffs.

Definition 3.1.

Let $(X,\operatorname{dist})$ be a space and let $\mathcal{F}$ be a probability distribution over $\{(Q,U)\mid Q\subseteq X,U\subseteq X\}$ . We say that $\mathcal{F}$ is $(r,cr,p_{1},p_{2},p_{q},p_{u})$ -sensitive if for all points $x,y\in X$ and $(Q,U)$ sampled randomly from $\mathcal{F}$ the following holds:

•

If $\operatorname{dist}(x,y)\leq r$ then $\Pr[x\in Q,y\in U]\geq p_{1}$ .

•

If $\operatorname{dist}(x,y)>cr$ then $\Pr[x\in Q,y\in U]\leq p_{2}$ .

•

$\Pr[x\in Q]\leq p_{q}$ and $\Pr[x\in U]\leq p_{u}$ .

We refer to $(Q,U)$ as a filter and to $Q$ as the query filter and $U$ as the update filter.

Our main contribution is a general framework for similarity search with space-time tradeoffs that takes as input a locality-sensitive family of filters.

Theorem 3.1.

Suppose we have access to a family of filters that is $(r,cr,p_{1},p_{2},p_{q},p_{u})$ -sensitive. Then we can construct a fully dynamic data structure that solves the $(r,cr)$ -near neighbor problem with query time $dn^{\rho_{q}+o(1)}$ , update time $dn^{\rho_{u}+o(1)}$ , and space usage $dn+n^{1+\rho_{u}+o(1)}$ where $\rho_{q}=\frac{\log(p_{q}/p_{1})}{\log(p_{q}/p_{2})}$ and $\rho_{u}=\frac{\log(p_{u}/p_{1})}{\log(p_{q}/p_{2})}$ .

We give a worst-case analysis of a slightly modified version of Laarhoven’s [100] filter family for the unit sphere and plug it into our framework to obtain the following theorem.

Theorem 3.2.

For every choice of $0\leq\beta<\alpha<1$ and $\lambda\in[-1,1]$ there exists a solution to the $(\alpha,\beta)$ -similarity problem in $(\mathbb{S}^{d},\langle{\cdot},{\cdot}\rangle)$ that satisfies the guarantees from Theorem 3.1 with exponents $\rho_{q}=\left.\frac{(1-\alpha^{1+\lambda})^{2}}{1-\alpha^{2}}\middle/\frac{(1-\alpha^{\lambda}\beta)^{2}}{1-\beta^{2}}\right.$ and $\rho_{u}=\left.\frac{(\alpha^{\lambda}-\alpha)^{2}}{1-\alpha^{2}}\middle/\frac{(1-\alpha^{\lambda}\beta)^{2}}{1-\beta^{2}}\right.$ .

We show how an elegant and powerful application of Bochner’s Theorem [140] by Rahimi and Recht [138] allows us to extend the solution on the unit sphere to a large class of similarity measures, yielding as a special case solutions for $\ell_{s}$ -space.

Theorem 3.3.

For every choice of $c\geq 1$ , $s\in(0,2]$ , and $\lambda\in[-1,1]$ there exists a solution to the $(r,cr)$ -near neighbor problem in $\ell_{s}^{d}$ that satisfies the guarantees from Theorem 3.1 with exponents $\rho_{q}=\frac{c^{s}(1+\lambda)^{2}}{(c^{s}+\lambda)^{2}}$ and $\rho_{u}=\frac{c^{s}(1-\lambda)^{2}}{(c^{s}+\lambda)^{2}}$ .

This result improves upon the state of the art for every choice of asymmetric query/update exponents $\rho_{q}\neq\rho_{u}$ [129, 11, 9, 95]. We conjecture that this tradeoff is optimal among the class of algorithms that independently of the data determine which locations in memory to probe during queries and updates. In the case of a balanced space-time tradeoff where we set $\rho_{q}=\rho_{u}$ our approach matches existing, optimal [121], data-independent solutions in $\ell_{s}$ -spaces [91, 66, 11, 117].

The LSF framework is very similar to the LSH framework, especially in the case where the filter family is symmetric ( $Q=U$ for every filter in $\mathcal{F}$ ). In this setting we show that the LSH lower bound by O’Donnell et al. applies to the LSF framework as well [121], confirming that the results of Theorem 3.3 are optimal when we set $\rho_{q}=\rho_{u}$ .

Theorem 3.4 (informal).

Every filter family that is symmetric and $(r,cr,p_{1},p_{2},p_{q},p_{u})$ -sensitive in $\ell_{s}^{d}$ must have $\rho=\frac{\log(p_{u}/p_{1})}{\log(p_{q}/p_{2})}\geq 1/c^{s}-o_{d}(1)$ when $r=\omega_{d}(1)$ is chosen to be sufficiently small.

Finally we show a lower bound on the space-time tradeoff that can be obtained in the LSF framework. Our lower bound suffers from two important restrictions. First the filter family must be regular, meaning that all query filters and all update filters are of the same size. Secondly, the size of the query and update filter cannot differ by too much.

Theorem 3.5 (informal).

Every regular filter family that is $((1-\alpha)d/2,(1-\beta)d/2,p_{1},p_{2},p_{q},p_{u})$ -sensitive in $d$ -dimensional Hamming space with asymmetry controlled by $\lambda\in[-1,1]$ cannot simultanously have that $\rho_{q}<\frac{(1-\alpha^{1+\lambda})^{2}}{1-\alpha^{2}}-o_{d}(1)$ and $\rho_{u}<\frac{(\alpha^{\lambda}-\alpha)^{2}}{1-\alpha^{2}}-o_{d}(1)$ .

Together our upper and lower bounds imply that the filter family of concentric balls in Hamming space is asymptotically optimal for random data.

Techniques.

The LSF framework in Theorem 3.1 relies on a careful combination of “powering” and “tensoring” techniques. For positive integers $m$ and $\tau$ with $m\gg\tau$ the tensoring technique, a variant of which was introduced by Dubiner [73], allows us to simulate a collection of $\binom{m}{\tau}$ filters from a collection of $m$ filters by considering the intersection of all $\tau$ -subsets of filters. Furthermore, given a point $x\in X$ we can efficiently list the simulated filters that contain $x$ . This latter property is crucial as we typically need $\operatorname{poly}(n)$ filters to split our data into sufficiently small buckets for the search to be efficient. The powering technique lets us amplify the locality-sensitivity properties of a filter family in the same way that powering is used in the LSH framework [91, 12, 121].

To obtain results for worst-case data on the unit sphere we analyze a filter family based on standard normal projections using the same techniques as Andoni et al. [13] together with existing tail bounds on bivariate Gaussians. The approximate kernel embedding technique by Rahimi and Recht [138] is used to extend the solution on the unit sphere to a large class of similarity measures, yielding Theorem 3.3 as a special case.

The lower bound in Theorem 3.4 relies on an argument of contradiction against the LSH lower bounds by O’Donnell [121] and uses a theoretical, inefficient, construction of a locality-sensitive family of hash functions from a locality-sensitive family of filters that is similar to the spherical LSH by Andoni et al. [14].

Finally, the space-time tradeoff lower bound from Theorem 3.5 is obtained through an application of an isoperimetric inequality by O’Donnell [120, Ch. 10] and is similar in spirit to the LSH lower bound by Motwani et al. [114].

13.2 Related work

The LSH framework takes a distribution $\mathcal{H}$ over hash functions that partition space with the property that the probability of two points landing in the same partition is an increasing function of their similarity.

Definition 3.2.

Let $(X,\operatorname{dist})$ be a space and let $\mathcal{H}$ be a probability distribution over functions $h\colon X\to R$ . We say that $\mathcal{H}$ is $(r,cr,p,q)$ -sensitive if for all points $x,y\in X$ and $h$ sampled randomly from $\mathcal{H}$ the following holds:

•

If $\operatorname{dist}(x,y)\leq r$ then $\Pr[h(x)=h(y)]\geq p$ .

•

If $\operatorname{dist}(x,y)>cr$ then $\Pr[h(x)=h(y)]\leq q$ .

The properties of $\mathcal{H}$ determines a parameter $\rho<1$ that governs the space and time complexity of the solution to the $(r,cr)$ -near neighbor problem.

Theorem 3.6 (LSH framework [91, 86]).

Suppose we have access to a $(r,cr,p,q)$ -sensitive hash family. Then we can construct a fully dynamic data structure that solves the $(r,cr)$ -near neighbor problem with query time $dn^{\rho+o(1)}$ , update time $dn^{\rho+o(1)}$ , and with a space usage of $dn+n^{1+\rho+o(1)}$ where $\rho=\frac{\log(1/p)}{\log(1/q)}$ .

The LSF framework by Becker et al. [24] takes a symmetric $(r,cr,p_{1},p_{2},p_{q},p_{u})$ -sensitive filter family $\mathcal{F}$ and produces a data structure that solves the $(r,cr)$ -near neighbor problem with the same properties as the one produced by the LSH framework where instead we have $\rho=\frac{\log(p_{q}/p_{1})}{\log(p_{q}/p_{2})}$ . In addition, the framework assumes access to an oracle that is able to efficiently list the relevant filters containing a point $x\in X$ out of a large collection of filters. The LSF framework in this paper removes this assumption, showing how to construct an efficient oracle as part of the framework.

In terms of frameworks that support space-time tradeoffs, Panigrahy [129] developed a framework based on LSH that supports the two extremes of the space-time tradeoff. In the language of Theorem 3.1, Panigrahy’s framework supports either setting $\rho_{u}=0$ for a solution that uses near-linear space at the cost of a slower query time, or setting $\rho_{q}=0$ for a solution with query time $n^{o(1)}$ at the cost of a higher space usage. To obtain near-linear space the framework stores every data point in $n^{o(1)}$ partitions induced by randomly sampled hash functions from a $(r,cr,p,q)$ -sensitive LSH family $\mathcal{H}$ . In comparison, the standard LSH framework from Theorem 3.6 uses $n^{\rho}$ such partitions where $\rho$ is determined by $\mathcal{H}$ . For each partition induced by $h\in\mathcal{H}$ the query algorithm in Panigrahy’s framework generates a number of random points $z$ in a ball around the query point $x$ and searches the parts of the partition $h(z)$ that they hash to. The query time is bounded by $n^{\hat{\rho}+o(1)}$ where $\hat{\rho}=\frac{I(h(z)|x,h)}{\log(1/q)}$ and $I(h(z)|x,h)$ denotes conditional entropy, i.e. the query time is determined by how hard it is to guess where $z$ hashes to given that we know $x$ and $h$ . Panigrahy’s technique was used in a number of follow-up works that improve on solutions for specific spaces, but to our knowledge none of them state a general framework with space-time tradeoffs [106, 9, 95].

Upper bounds.

As is standard in the literature we state results in $\ell_{s}$ -spaces in terms of the properties of a solution to the $(r,cr)$ -near neighbor problem. For results on the unit sphere under inner product similarity $(\mathbb{S}^{d},\langle{\cdot},{\cdot}\rangle)$ we instead use the $(\alpha,\beta)$ -similarity terminology, defined in the introduction, as we find it to be cleaner and more intuitive while aligning better with the analysis. The $\ell_{s}$ -spaces, particularly $\ell_{1}$ and $\ell_{2}$ , as well as $(\mathbb{S}^{d},\langle{\cdot},{\cdot}\rangle)$ are some of most well-studied spaces for similarity search and are also widely used in practice [165]. Furthermore, fractional norms ( $\ell_{s}$ for $s\neq 1,2$ ) have been shown to perform better than the standard norms in certain use cases [2] which motivates finding efficient solutions to the near neighbor problem in general $\ell_{s}$ -space.

In the case of a balanced space-time tradeoff the best data-independent upper bound for the $(r,cr)$ -near neighbor problem in $\ell_{s}^{d}$ are solutions with an LSH exponent of $\rho=1/c^{s}$ for $0<s\leq 2$ . This result is obtained through a combination of techniques. For $0<s\leq 1$ the LSH based on $s$ -stable distributions by Datar et al. [66] can be used to obtain an exponent of $(1+\varepsilon)/c^{s}$ for an arbitrarily small constant $\varepsilon>0$ . For $1<s\leq 2$ the ball-carving LSH by Andoni and Indyk [11] for Euclidean space can be extended to $\ell_{s}$ using the technique described by Nguyen [117, Section 5.5]. Theorem 3.3 matches (and potentially improves in the case of $0<s<1$ ) these results with a single unified technique and analysis that we find to be simpler.

For space-time tradeoffs in Euclidean space (again extending to $\ell_{s}$ for $1<s<2$ ) Kapralov [95], improving on Panigrahy’s results [129] in Euclidean space and using similar techniques, obtains a solution with query exponent $\rho_{q}=\frac{c^{2}(1+\lambda)^{2}}{(c^{2}+\lambda)^{2}-c^{2}(1+\lambda^{2})/2-\lambda^{2}}$ and update exponent $\rho_{u}=\frac{c^{2}(1-\lambda)^{2}}{(c^{2}+\lambda)^{2}-c^{2}(1+\lambda^{2})/2-\lambda^{2}}$ under the condition that $c^{2}\geq(1+\lambda)^{2}/2+\lambda+\varepsilon$ where $\varepsilon>0$ is an arbitrary positive constant. Comparing to our Theorem 3.3 it is easy to see that we improve upon Kapralov’s space-time tradeoff for all choices of $c$ and $\lambda$ . In addition, Theorem 3.3 represents the first solution to the $(r,cr)$ -near neighbor problem in Euclidean space that for every choice of constant $c>1$ obtains sublinear query time ( $\rho_{q}<1$ ) using only near-linear space ( $\rho_{u}=0$ ). Due to the restrictions on Kapralov’s result he is only able to obtain sublinear query time for $c>\sqrt{3}$ when the space usage is restricted to be near-linear. It appears that our improvements can primarily be attributed to our techniques allowing a more direct analysis. Kapralov uses a variation of Panigrahy’s LSH-based technique of, depending on the desired space-time tradeoff, either querying or updating additional memory locations around a point $x\in X$ in the partition induced by $h\in\mathcal{H}$ . For a query point $x$ and a near neighbor $y$ his argument for correctness is based on guaranteeing that both the query algorithm and update algorithm visit the part $h(z)$ where $z$ is a point lying between $x$ and $y$ , possibly leading to a loss of efficiency in the analysis. More details on the comparison of Theorem 3.3 to Kapralov’s result can be found in Appendix 23.

In terms of space-time tradeoffs on the unit sphere, Laarhoven [100] modifies a filter family introduced by Becker et al. [24] to support space-time tradeoffs, obtaining a solution for random data on the unit sphere (the $(\alpha,\beta)$ -similarity problem with $\beta=o_{d}(1)$ ) with query exponent $\rho_{q}=\frac{(1-\alpha^{1+\lambda})^{2}}{1-\alpha^{2}}$ and update exponent $\rho_{u}=\frac{(\alpha^{\lambda}-\alpha)^{2}}{1-\alpha^{2}}$ . Theorem 3.2 extends this result to provide a solution to the $(\alpha,\beta)$ -similarity problem on the unit sphere for every choice of $0\leq\beta<\alpha<1$ . This extension to worst case data is crucial for obtaining our results for $\ell_{s}$ -spaces in Theorem 3.3. We note that there exist other data-independent techniques (e.g. Valiant [162, Alg. 25]) for extending solutions on the unit sphere to $\ell_{2}$ , but they also require a solution for worst-case data on the unit sphere to work.

Lower bounds

The performance of an LSH-based solution to the near neighbor problem in a given space that uses a $(r,cr,p,q)$ -sensitive family of hash functions $\mathcal{H}$ is summarized by the value of the exponent $\rho=\frac{\log(1/p)}{\log(1/q)}$ . It is therefore of interest to lower bound $\rho$ in terms of the approximation factor $c$ . Motwani et al. [114] proved the first lower bound for LSH families in $d$ -dimensional Hamming space. They show that for every choice of $c\geq 1$ then for some choice of $r$ it must hold that $\rho\geq 0.462/c$ as $d$ goes to infinity under the assumption that $q$ is not too small ( $q\geq 2^{-o(d)}$ ).

As part of an effort to show lower bounds for data-dependent locality-sensitive hashing, Andoni and Razenshteyn [19] strengthened the lower bound by Motwani et al. to $\rho\geq 1/(2c-1)$ in Hamming space. These lower bounds are initially shown in Hamming space and can then be extended to $\ell_{s}$ -space and the unit sphere by the fact that a solution in these spaces can be used to yield a solution in Hamming space, contradicting the lower bound if $\rho$ is too small. Translated to $(\alpha,\beta)$ -similarity on the unit sphere, which is the primary setting for the lower bounds on LSF space-time tradeoffs in this paper, the lower bound by Andoni and Razenshteyn shows that an LSH on the unit sphere must have $\rho\geq\frac{1-\alpha}{1+\alpha}$ which is tight in the case of random data [13].

The lower bound uses properties of random walks over a partition of Hamming space: A random walk starting from a random point $x\in\{-1,1\}^{d}$ is likely to “walk out” of the the part identified by $h(x)$ in the partition induced by $h$ . The space-time tradeoff lower bound in Theorem 3.5 relies on a similar argument that lower bounds the probability that a random walk starting from a subset $Q$ ends up in another subset $U$ , corresponding nicely to query and update filters in the LSF framework.

Using related techniques O’Donnell [121] showed tight LSH lower bounds for $\ell_{s}$ -space of $\rho\geq 1/c^{s}$ . The work by Andoni et al. [15] and Panigrahy et al. [130, 131] gives cell probe lower bounds for the $(r,cr)$ -near neighbor problem, showing that in Euclidean space a solution with a query complexity of $t$ probes require space at least $n^{1+\Omega(1/tc^{2})}$ . For more details on these lower bounds and how they relate to the upper bounds on the unit sphere see [16, 100].

Data-dependent solutions

The solutions to the $(r,cr)$ -near neighbor problems considered in this paper are all data-independent. For the LSH and LSF frameworks this means that the choice of hash functions or filters used by the data structure, determining the mapping between points in space and the memory locations that are searched during the query and update algorithm, is made without knowledge of the data. Data-independent solutions to the $(r,cr)$ -near neighbor problem for worst-case data have been the state of the art until recent breakthroughs by Andoni et al. [14] and Andoni and Razenshteyn [17] showing improved solutions to the $(r,cr)$ -near neighbor problem in Euclidean space using data-dependent techniques. In this setting the solution obtained by Andoni and Razenshteyn has an exponent of $\rho=1/(2c^{2}-1)$ compared to the optimal data-independent exponent of $\rho=1/c^{2}$ . Furthermore, they show that this exponent is optimal for data-dependent solutions in a restricted model [19].

Recent developments

Recent work by Andoni et al. [16], done independently of and concurrently with this paper, shows that Laarhoven’s upper bound for random data on the unit sphere can be combined with data-dependent techniques [17] to yield a space-time tradeoff in Euclidean space with $\rho_{u},\rho_{q}$ satisfying $c^{2}\sqrt{\rho_{q}}+(c-1)\sqrt{\rho_{u}}=\sqrt{2c^{2}-1}$ . This improves the result of Theorem 3.3 and matches the lower bound in Theorem 3.5. In the same paper they also show a lower bound matching our lower bound in Theorem 3.5. Their lower bound is set in a more general model that captures both the LSH and LSF framework and they are able to remove some of the technical restrictions such as the filter family being regular that weaken the lower bound in this paper. In spite of these results we still believe that this paper presents an important contribution by providing a general and simple framework with space-time tradeoffs as well as improved data-independent solutions to nearest neighbor problems in $\ell_{s}$ -space and on the unit sphere. We would also like to point out the simplicity and power of using Rahimi and Recht’s [138] result to extend solutions on the unit sphere to spaces with similarity measures corresponding to real-valued characteristic functions, further described in Appendix 21.

14 A framework with space-time tradeoffs

We use a combination of powering and tensoring techniques to amplify the locality-sensitive properties of our initial filter family, and to simulate a large collection of filters that we can evaluate efficiently. We proceed by stating the relevant properties of these techniques which we then combine to yield our Theorem 3.1.

Lemma 3.1 (powering).

*Given a $(r,cr,p_{1},p_{2},p_{q},p_{u})$ -sensitive filter family $\mathcal{F}$ for $(X,\operatorname{dist})$ and a positive integer $\kappa$ define the family $\mathcal{F}^{\kappa}$ as follows: we sample a filter $F=(Q,U)$ from $\mathcal{F}^{\kappa}$ by sampling $(Q_{1},U_{1}),\dots,(Q_{\kappa},U_{\kappa})$ independently from $\mathcal{F}$ and setting $(Q,U)=(\bigcap_{i=1}^{\kappa}Q_{i},\bigcap_{i=1}^{\kappa}U_{i})$ . The family $\mathcal{F}^{\kappa}$ is $(r,cr,p_{1}^{\kappa},p_{2}^{\kappa},p_{q}^{\kappa},p_{u}^{\kappa})$ -sensitive for $(X,\operatorname{dist})$ . *

Let $\mathbf{F}$ denote a collection (indexed family) of $m$ filters and let $\mathbf{Q}$ and $\mathbf{U}$ denote the corresponding collections of query and update filters, that is, for $i\in\{1,\dots,m\}$ we have that $\mathbf{F}_{i}=(\mathbf{Q}_{i},\mathbf{U}_{i})$ . Given a positive integer $\tau\leq m$ (typically $\tau\ll m$ ) we define $\mathbf{F}^{\otimes\tau}\!$ to be the collection of filters formed by taking all the intersections of $\tau$ -combinations of filters from $\mathbf{F}$ , that is, for every $I\subseteq\{1,\dots,m\}$ with $|I|=\tau$ we have that

[TABLE]

The following properties of the tensoring technique will be used to provide correctness, running time, and space usage guarantees for the LSF data structure that will be introduced in the next subsection. We refer to the evaluation time of a collection of filters $\mathbf{F}$ as the time it takes, given a point $x\in X$ to prepare a list of query filters $\mathbf{Q}(x)\subseteq\mathbf{Q}$ containing $x$ and a list of update filters $\mathbf{U}(x)\subseteq\mathbf{U}$ containing $x$ such that the next element of either list can be reported in constant time. We say that a pair of points $(x,y)$ is contained in a filter $(Q,U)$ if $x\in Q$ and $y\in U$ .

Lemma 3.2 (tensoring).

Let $\mathcal{F}$ be a filter family that is $(r,cr,p_{1},p_{2},p_{q},p_{u})$ -sensitive in $(X,\operatorname{dist})$ . Let $\tau$ be a positive integer and let $\mathbf{F}$ denote a collection of $m=\lceil\tau/p_{1}\rceil$ independently sampled filters from $\mathcal{F}$ . Then the collection $\mathbf{F}^{\otimes\tau}\!$ of $\binom{m}{\tau}$ filters has the following properties:

•

If $(x,y)$ have distance at most $r$ then with probability at least $1/2$ there exists a filter in $\mathbf{F}^{\otimes\tau}\!$ containing $(x,y)$ .

•

If $(x,y)$ have distance greater than $cr$ then the expected number of filters in $\mathbf{F}^{\otimes\tau}\!$ containing $(x,y)$ is at most $p_{2}^{\tau}\binom{m}{\tau}$ .

•

In expectation, a point $x$ is contained in at most $p_{q}^{\tau}\binom{m}{\tau}$ query filters and at most $p_{u}^{\tau}\binom{m}{\tau}$ update filters in $\mathbf{F}^{\otimes\tau}\!$ .

•

The evaluation time and space complexity of $\mathbf{F}^{\otimes\tau}\!$ is dominated by the time it takes to evaluate and store $m$ filters from $\mathcal{F}$ .

Proof.

To prove the first property we note that there exists a filter in $\mathbf{F}^{\otimes\tau}\!$ containing $(x,y)$ if at least $\tau$ filters in $\mathbf{F}$ contain $(x,y)$ . The binomial distribution has the property that the median is at least as great as the mean rounded down [93]. By the choice of $m$ we have that the expected number of filters in $\mathbf{F}$ containing $(x,y)$ is at least $\tau$ and the result follows. The second and third properties follow from the linearity of expectation and the fourth is trivial. ∎

14.1 The LSF data structure

We will introduce a dynamic data structure that solves the $(r,cr)$ -near neighbor problem on a set of points $P\subseteq X$ . The data structure has access to a $(r,cr,p_{1},p_{2},p_{q},p_{u})$ -sensitive filter family $\mathcal{F}$ in the sense that it knows the parameters of the family and is able to sample, store, and evaluate filters from $\mathcal{F}$ in time $dn^{o(1)}$ .

The data structure supports an initialization operation that initializes a collection of filters $\mathbf{F}$ where for every filter we maintain a (possibly empty) set of points from $X$ . After initialization the data structure supports three operations: insert, delete, and query. The insert (delete) operation takes as input a point $x\in X$ and adds (removes) the point from the set of points associated with each update filter in $\mathbf{F}$ that contains $x$ . The query operation takes as input a point $x\in X$ . For each query filter in $\mathbf{F}$ that contains $x$ we proceed by computing the dissimilarity $\operatorname{dist}(x,y)$ to every point $y$ associated with the filter. If a point $y$ satisfying $\operatorname{dist}(x,y)\leq cr$ is encountered, then $y$ is returned and the query algorithm terminates. If no such point is found, the query algorithm returns a special symbol “ $\varnothing$ ” and terminates.

The data structure will combine the powering and tensoring techniques in order to simulate the collection of filters $\mathbf{F}$ from two smaller collections: $\mathbf{F}_{1}$ consisting of $m_{1}$ filters from $\mathcal{F}^{\kappa_{1}}$ and $\mathbf{F}_{2}$ consisting of $m_{2}$ filters from $\mathcal{F}^{\kappa_{2}}$ . The collection of simulated filters $\mathbf{F}$ is formed by taking all filters $(Q_{1}\cap Q_{2},U_{1}\cap U_{2})$ where $(Q_{1},U_{1})$ is a member of $\mathbf{F}_{1}^{\otimes\tau}\!$ and $(Q_{2},U_{2})$ is a member of $\mathbf{F}_{2}$ . It is due to the integer constraints on the parameter $\tau$ in the tensoring technique and the parameter $\kappa$ in the powering technique that we simulate our filters from two underlying collections instead of just one. This gives us more freedom to hit a target level of amplification of the simulated filters which in turn makes it possible for the framework to support efficient solutions for a wider range of parameters of LSF families.

The initialization operation takes $\mathcal{F}$ and parameters $m_{1},\kappa_{1},\tau,m_{2},\kappa_{2}$ and samples and stores $\mathbf{F}_{1}$ and $\mathbf{F}_{2}$ . The filter evaluation algorithm used by the insert, delete, and query operation takes a point $x\in X$ and computes for $\mathbf{F}_{1}$ and $\mathbf{F}_{2}$ , depending on the operation, the list of update or query filters containing $x$ . From these lists we are able to generate the list of filters in $\mathbf{F}$ containing $x$ .

Setting the parameters of the data structure to guarantee correctness while balancing the contribution to the query time from the filter evaluation algorithm, the number of filters containing the query point, and the number of distant points examined, we obtain a partially dynamic data structure that solves the $(r,cr)$ -near neighbor problem with failure probability $\delta\leq 1/2+1/e$ . Using a standard dynamization technique by Overmars and Leeuwen [122, Thm. 1] we obtain a fully dynamic data structure resulting in Theorem 3.1. The details of the proof have been deferred to Appendix 19.

15 Gaussian filters on the unit sphere

In this section we show properties of a family of filters for the unit sphere $\mathbb{S}^{d}$ under inner product similarity. Later we will show how to make use of this family to solve the near neighbor problem in other spaces, including $\ell_{s}$ for $0<s\leq 2$ .

Lemma 3.3.

For every choice of $0\leq\beta<\alpha<1$ , $\lambda\in[-1,1]$ , and $t>0$ let $\mathcal{G}$ denote the family of filters defined as follows: we sample a filter $(Q,U)$ from $\mathcal{G}$ by sampling $z\sim\mathcal{N}^{d}(0,1)$ and setting

[TABLE]

*Then $\mathcal{G}$ is locality-sensitive on the unit sphere under inner product similarity with exponents *

[TABLE]

Laarhoven’s filter family [100] is identical to $\mathcal{G}$ except that he normalizes the projection vectors $z$ to have unit length. The properties of $\mathcal{G}$ can easily be verified with a simple back-of-the-envelope analysis using two facts: First, for a standard normal random variable $Z$ we have that $\Pr[Z>t]\approx e^{-t^{2}/2}$ . Secondly, the invariance of Gaussian projections $\langle{x},{z}\rangle$ to rotations, allowing us to analyze the projection of arbitrary points $x,y\in\mathbb{S}^{d}$ with inner product $\langle{x},{y}\rangle=\alpha$ in a two-dimensional setting $x=(1,0)$ and $y=(\alpha,\sqrt{1-\alpha^{2}})$ without any loss of generality. The proof of Lemma 3.3 as well as the proof of Theorem 3.2 has been deferred to Appendix 20.

16 Space-time tradeoffs under kernel similarity

In this section we will show how to combine the Gaussian filters for the unit sphere with kernel approximation techniques in order to solve the $(\alpha,\beta)$ -similarity problem over $({}^{d},S)$ for the class of similarity measures of the form $S(x,y)=k(x-y)$ where $k\colon{}^{d}\times{}^{d}\to\real$ is a real-valued characteristic function [161]. For this class of functions there exists a feature map $\psi$ into a (possibly infinite-dimensional) dot product space such that $k(x,y)=\langle{\psi(x)},{\psi(y)}\rangle$ . Through an elegant combination of Bochner’s Theorem and Euler’s Theorem, detailed in Appendix 21, Rahimi and Recht [138] show how to construct approximate feature maps, i.e., for every $k$ we can construct a function $v$ with the property that $\langle{v(x)},{v(y)}\rangle\approx\langle{\psi(x)},{\psi(y)}\rangle=k(x-y)$ . We state a variant of their result for a mapping onto the unit sphere.

Lemma 3.4.

For every real-valued characteristic function $k$ and every positive integer $l$ there exists a family of functions $\mathcal{V}\subseteq\{v\mid v\colon{}^{d}\to\mathbb{S}^{l}\}$ such that for every $x,y\in{}^{d}$ and $\varepsilon>0$ we have that

[TABLE]

Theorem 3.10 in Appendix 21 shows that Theorem 3.2 holds with the space $(\mathbb{S}^{d},\langle{\cdot},{\cdot}\rangle)$ replaced by $({}^{d},k)$ .

16.1 Tradeoffs in $\ell_{s}^{d}$ -space

Consider the $(r,cr)$ -near neighbor problem in $\ell_{s}^{d}$ for $0<s\leq 2$ . We solve this problem by first applying the approximate feature map from Lemma 3.4 for the characteristic function of a standard $s$ -stable distribution [174], mapping the data onto the unit sphere, and then applying our solution from Theorem 3.2 to solve the appropriate $(\alpha,\beta)$ -similarity problem on the unit sphere. The characteristic functions of $s$ -stable distributions take the following form:

Lemma 3.5 (Lévy [102]).

For every positive integer $d$ and $0<s\leq 2$ there exists a characteristic function $k\colon{}^{d}\times{}^{d}\to[0,1]$ of the form

[TABLE]

A result by Chambers et al. [46] shows how to sample efficiently from an $s$ -stable distributions.

To sketch the proof of Theorem 3.3 we proceed by upper bounding the exponents $\rho_{q}$ , $\rho_{u}$ from Theorem 3.2 when applying Lemma 3.4 to get $\alpha\geq e^{-r^{s}}-\varepsilon$ and $\beta\leq e^{-c^{s}r^{s}}-\varepsilon$ . We make use of the following standard fact (see e.g. [142]) that can be derived from the Taylor expansion of the exponential function: for $x\geq 0$ it holds that $1-x\leq e^{-x}\leq 1-x+x^{2}/2$ . Scaling the data points such that $r^{s}=o(1)$ and inserting the above values of $\alpha\approx 1-r^{s}$ and $\beta\approx 1-c^{s}r^{s}$ into the expressions for $\rho_{q}$ , $\rho_{u}$ in Lemma 3.3 we can set parameters $t$ and $l$ such that Theorem 3.3 holds.

17 Lower bounds

We begin by stating the lower bound on the LSH exponent $\rho=\log(1/p)/\log(1/q)$ by O’Donnell et al. [121].

Theorem 3.7 (O’Donnell et al. [121]).

Fix $d\in\mathbb{N}$ , $1<c<\infty$ , $0<s<\infty$ and $0<q<1$ . Then for a certain choice of $r=\omega_{d}(1)$ and under the assumption that $q\geq 2^{-o(d)}$ we have that every $(r,cr,p,q)$ -sensitive family of hash functions for $\ell_{s}^{d}$ must satisfy

[TABLE]

The following lemma shows how to use a filter family $\mathcal{F}$ to construct a hash family $\mathcal{H}$ .

Lemma 3.6.

Given a symmetric family of filters that is $(r,cr,p_{1},p_{2},p_{q},p_{u})$ -sensitive in $(X,\operatorname{dist})$ we can construct a $(r,cr,p_{1}/(2p_{q}),p_{2}/p_{q})$ -sensitive family of hash functions in $(X,\operatorname{dist})$ .

Proof.

Given the filter family $\mathcal{F}$ we sample a random function $h$ from the hash family $\mathcal{H}$ taking an infinite sequence of independently sampled filters $(F_{i})_{i=0}^{\infty}$ from $\mathcal{F}$ and setting $h(x)=\min\,\{i\mid x\in F_{i}\}$ . The probability of collision is given by

[TABLE]

and the result follows from the properties of $\mathcal{F}$ . ∎

If the LSH family in Lemma 3.6 had $p=p_{1}/p_{q}$ and $q=p_{2}/p_{q}$ then the lower bound would follow immediately. We apply the powering technique from Lemma 3.1 to the underlying filter family in order make the factor $2$ in $p_{1}/(2p_{q})$ disappear in the statement of $\rho$ as $d$ tends to infinity.

Theorem 1.4.

Every symmetric $(r,cr,p_{1},p_{2},p_{q},p_{u})$ -sensitive filter family $\mathcal{F}$ for $\ell_{s}^{d}$ must satisfy the lower bound of Theorem 3.7 with $p=p_{1}/p_{q}$ and $q=p_{2}/p_{q}$ .

Proof.

Given a family $\mathcal{F}$ that satisfies the requirements from Theorem 3.7 there exists an integer $\kappa=\omega_{d}(1)$ such the hash family $\mathcal{H}$ that results from applying Lemma 3.6 to the powered family $\mathcal{F}^{\kappa}$ also satisfies the requirements from Theorem 3.7. The constructed family $\mathcal{H}$ is $(r,cr,p,q)$ -sensitive for $p=(1/2)\cdot(p_{1}/p_{q})^{\kappa}$ and $q=(p_{2}/p_{q})^{\kappa}$ . By our choice of $\kappa$ we have that $\log(1/p)/\log(1/q)=\log(p_{q}/p_{1})/\log(p_{q}/p_{2})+o_{d}(1)$ and the lower bound on $\log(1/p)/\log(1/q)$ from Theorem 3.7 applies. ∎

17.1 Asymmetric lower bound

The lower bound is based on an isoperimetric-type inequality that holds for randomly correlated points in Hamming space. We say that the pair of points $(x,y)$ is $\alpha$ -correlated if $x$ is a random point in $\{-1,1\}^{d}$ and $y$ is formed by taking $x$ and independently flipping each bit with probability $(1-\alpha)/2$ . We are now ready to state O’Donnell’s generalized small-set expansion theorem. Notice the similarity to the value of $p_{1}$ for the Gaussian filter family described in Section 15 and Appendix 20.

Lemma 3.7 ([120, p. 285]).

For every $0\leq\alpha<1$ , $-1\leq\lambda\leq 1$ , and $Q,U\subseteq\{-1,1\}^{d}$ satisfying that $|Q|/2^{d}=(|U|/2^{d})^{\alpha^{2\lambda}}$ we have

[TABLE]

The argument for the lower bound assumes a regular $(r,cr,p_{1},p_{2},p_{q},p_{u})$ -sensitive filter family $\mathcal{F}$ for Hamming space where we set $r=(1-\alpha)d/2$ and $cr=(1-\beta)d/2$ for some choice of $0<\beta<\alpha<1$ . We then proceed by deriving constraints on $p_{1}$ , $p_{2}$ , $p_{q}$ , $p_{u}$ , and minimize $\rho_{q}$ and $\rho_{u}$ subject to those constrains. The proof of Theorem 1.5 is provided in Appendix 22.

Theorem 1.5.

Fix $0<\beta<\alpha<1$ . Then for every regular $((1-\alpha)d/2,(1-\beta)d/2,p_{1},p_{2},p_{q},p_{u})$ -sensitive filter family in $d$ -dimensional Hamming space with and $|Q|/2^{d}=(|U|/2^{d})^{\alpha^{2\lambda}}$ where $\lambda$ satisfies $\alpha+2\sqrt{\ln(d)/d}\leq\alpha^{\lambda}\leq 1/(\alpha-2\sqrt{\ln(d)/d})$ it must hold that

[TABLE]

when $p_{q}$ is set to minimize $\rho_{q}$ and we assume that $|U|/2^{d}\geq 2^{-o_{d}(1)}$ .

18 Open problems

An important open problem is to find simple and practical data-dependent solutions to the $(r,cr)$ -near neighbor problem. Current solutions, the Gaussian filters in this paper included, suffer from $o(1)$ terms in the exponents that decrease very slowly in $n$ . A lower bound for the unit sphere by Andoni et al. [13] indicates that this might be unavoidable.

Another interesting open problem is finding the shape of provably exactly optimal filters in different spaces. In the random data setting in Hamming space, this problem boils down to maximizing the number of pairs of points below a certain distance threshold that is contained in a subset of the space of a certain size. This is a fundamental problem in combinatorics that has been studied by among others [94], but a complete answer remains elusive. The LSH and LSF lower bounds [114, 121, 19], along with classical isoperimetric inequalities such as Harper’s Theorem and more recent work summarized in the book by O’Donnell [120] hints that the answer is somewhere between a subcube and a generalized sphere.

A recent result by Chierichetti and Kumar [49] characterizes the set of transformations of LSH-able similarity measures as the set of probability-generating functions. This seems to have deep connections to result of this paper that uses characteristic functions that allow well-known kernel transformations. It seems possible that this paper can be viewed as a semi-explicit construction of their result, or that their result can be described as an application of Bochner’s Theorem.

Acknowledgment

I would like to thank Rasmus Pagh for suggesting the application of Rahimi & Recht’s result [138] and the MinHash-like [32] connection between LSF and LSH used in Theorem 1.4. I would also like to thank Gregory Valiant and Udi Wieder for useful discussions about locality-sensitive filtering and the analysis of boolean functions during my stay at Stanford. Finally, I would like to thank the Scalable Similarity Search group at the IT University of Copenhagen for feedback during the writing process, and in particular Martin Aumüller for pointing out the importance of a general framework for locality-sensitive filtering with space-time tradeoffs.

19 Appendix: Framework

We state a version of Theorem 3.1 where the parameters of the filter family are allowed to depend on $n$ .

Theorem 3.1.

Suppose we have access to a filter family that is $(r,cr,p_{1},p_{2},p_{q},p_{u})$ -sensitive. Then we can construct a fully dynamic data structure that solves the $(r,cr)$ -near neighbor problem. Assume that $1/p_{1}$ , $1/\log(p_{q}/p_{2})$ , and $\exp(\log(1/p_{1})/\log(\min(p_{q},p_{u})/p_{1}))$ are $n^{o(1)}$ , then the data structure has

–

query time $dn^{\rho_{q}+o(1)}$ ,

–

update time $n^{\rho_{u}+o(1)}+dn^{o(1)}$ ,

–

space usage $n^{1+\rho_{u}+o(1)}+dn+dn^{o(1)}$

where

[TABLE]

To prove Theorem 3.1, we begin by setting the parameters mentioned in the description of the LSF data structure in Section 14.1.

[TABLE]

We will now briefly explain the reasoning behind the parameter settings. Begin by observing that the powering and tensoring techniques both amplify the filters from $\mathcal{F}$ . Let $m=\binom{m_{1}}{\tau}\cdot m_{2}$ denote the number of simulated filters in our collection $\mathbf{F}$ and let $a=\tau\kappa_{1}+\kappa_{2}$ be an integer denoting the number of times each filter has been amplified. Ignoring the time it takes to evaluate the filters, the query time is determined by the sum of the number of filters that contain a query point and the number of distant points associated with those filters that the query algorithm inspects. The expected number of activated filters is given by $mp_{q}^{a}$ while the worst case expected number of distant points to be inspected by the query algorithm is given by $nmp_{2}^{a}$ . Balancing the contribution to the query time from these two effects (ignoring the $O(d)$ factor from distance computations) results in a target value of $a=\lceil\log(n)/\log(p_{q}/p_{2})\rceil$ . Compared to having an oracle that is able to list the filters from a collection that contains a point, there is a small loss in efficiency from using the tensoring technique due to the increase in the number of filters required to guarantee correctness. The parameters of the LSF data structure are therefore set to minimize the use of tensoring such that the time spent evaluating our collection of filters roughly matches the minimum of the query and update time.

Consider the initialization operation of the LSF data structure with the parameters setting from above. We have that $\kappa_{2}\leq\kappa_{1}$ implying that $m_{2}=O(m_{1})$ . The initialization time and the space usage of the data structure prior to any insertions is dominated by the time and space used to sample and store the filters in $\mathbf{F}_{1}$ . By the assumption that a filter from $\mathcal{F}$ can be sampled in $O(d)$ operations and stored using $O(d)$ words, we get a space and time bound on the initialization operation of

[TABLE]

Importantly, this bound also holds for the running time of the filter evaluation algorithm, that is, the preprocessing time required for constant time generation of the next element in the list of filters in $\mathbf{F}$ containing a point. In the following analysis of the update and query time we will temporarily ignore the running time of the filter evaluation algorithm.

The expected time to insert or delete a point is dominated by the number of update filters in $\mathbf{F}$ that contains it. The probability that a particular update filter in $\mathbf{F}$ contains a point is given by $p_{u}^{a}$ . Using a standard upper bound on the binomial coefficient we get that $m=O(e^{\tau}/p_{1}^{a})$ resulting in an expected update time of

[TABLE]

In the worst case where every data point is at distance greater than $cr$ from the query point and has collision probablity $p_{2}$ the expected query time can be upper bounded by

[TABLE]

With respect to the correctness of the query algorithm, if a near neighbor $y$ to the query point $x$ exists in $P$ , then it is found by the query algorithm if $(x,y)$ is contained in a filter in $\mathbf{F}_{1}^{\otimes\tau}\!$ as well as in a filter in $\mathbf{F}_{2}$ . By Lemma 3.2 the first event happens with probability at least $1/2$ and by the choice of $m_{2}$ , the second event happens with probability at least $1-(1-p_{1}^{\kappa_{2}})^{p_{1}^{\kappa_{2}}}\geq 1-1/e$ . From the independence between $\mathbf{F}_{1}$ and $\mathbf{F}_{2}$ we can upper bound the failure probability $\delta\leq(1/2)(1+1/e)$ . This completes the proof of Theorem 3.1.

20 Appendix: Gaussian filters

In this section we upper and lower bound the probability mass in the tail of the bivariate standard normal distribution when the correlation between the two standard normals is at most $\beta$ (upper bound) or at least $\alpha$ (lower bound). We make use of the following upper and lower bounds on the univariate standard normal as well as an upper bound for the multivariate case.

Lemma 3.8 (Follows Szarek & Werner [153]).

Let $Z$ be a standard normal random variable. Then, for every $t\geq 0$ we have that

[TABLE]

Lemma 3.9 (Lu & Li [104]).

Let $z$ be a $d$ -dimensional vector of i.i.d. standard normal random variables and let $D\subset{}^{d}$ be a closed convex domain that does not contain the origin. Let $\Delta$ denote the Euclidean distance to the unique closest point in $D$ , then we have that

[TABLE]

Lemma 3.10 (Tail upper bound).

For $\alpha,\lambda,t,\beta$ satisfying $0<\alpha<1$ , $-1\leq\lambda\leq 1$ , $t>0$ , and $-1<\beta<\alpha$ every pair of standard normal random variables $(X,Y)$ with correlation $\beta^{\prime}\leq\beta$ satisfies

[TABLE]

where $\Delta^{2}=(1+\frac{(\alpha^{\lambda}-\beta)^{2}}{1-\beta^{2}})t^{2}$ .

Proof.

For $\beta^{\prime}=-1$ the result is trivial. For values of $\beta^{\prime}$ in the range $-1<\beta^{\prime}\leq\beta$ we use the $2$ -stability of the normal distribution to analyze a tail bound for $(X,Y)$ in terms of a Gaussian projection vector $z=(Z_{1},Z_{2})$ applied to unit vectors $x,y\in{}^{2}$ . That is, we can define $X=\langle{z},{x}\rangle$ and $Y=\langle{z},{x}\rangle$ for some appropriate choice of $x$ and $y$ . Without loss of generality we set $x=(1,0)$ and note that for $\operatorname*{\mathbb{E}}[XY]=\beta^{\prime}$ we must have that $y=(\beta^{\prime},\sqrt{1-\beta^{\prime 2}})$ . If we consider the region of 2 where $z$ satisfies $X\geq t\land Y\geq\alpha^{\lambda}t$ we get a closed domain $D$ defined by $z=(Z_{1},Z_{2})$ such that $Z_{1}\geq t$ and $Z_{2}\geq(\alpha^{\lambda}t-\beta^{\prime}Z_{1})/(\sqrt{1-\beta^{\prime 2}})$ . The squared Euclidean distance from the origin to the closest point in $D$ at least $\Delta^{2}$ as can be seen by the fact that $\Delta^{2}$ decreasing in $\beta$ . Combining this observation with Lemma 3.9 we get the desired result. ∎

Lemma 3.11 (Tail lower bound).

For $\alpha,\lambda,t$ satisfying $0<\alpha<1$ , $-1\leq\lambda\leq 1$ , and $t>0$ every pair of standard normal random variables $(X,Y)$ with correlation $\alpha^{\prime}\geq\alpha$ satisfies

[TABLE]

where $\Delta^{2}=(1+\frac{(\alpha^{\lambda}-\alpha)^{2}}{1-\alpha^{2}})t^{2}$ .

Proof.

For $\alpha^{\prime}=1$ the result follows directly from Lemma 3.8. For $\alpha^{\prime}<1$ we use the trick from the proof of Lemma 3.10 and define $X=\langle{z},{x}\rangle$ and $Y=\langle{z},{x}\rangle$ where $x=(1,0)$ and $y=(\alpha,\sqrt{1-\alpha^{2}})$ and $z=(Z_{1},Z_{2})$ is a vector of two i.i.d. standard normal random variables. This allows us to rewrite the probability as follows:

[TABLE]

By the restrictions on $\alpha$ and $\lambda$ we have that $(\alpha^{\lambda}-\alpha)t/\sqrt{1-\alpha^{2}}\leq t/\alpha$ . The result follows from applying the lower bound from Lemma 3.8 and noting that the bound is increasing in $\alpha$ . ∎

20.1 Space-time tradeoffs on the unit sphere

Summarizing the bound from the previous section, the family $\mathcal{G}$ from Lemma 3.3 satisfies that

[TABLE]

We combine the Gaussian filters with Theorem 3.1 to show that we can solve the $(\alpha,\beta)$ -similarity problem efficiently for the full range of space/time tradeoffs, even when $\alpha,\beta$ are allowed to depend on $n$ , as long as the gap $\alpha-\beta$ is not too small.

Theorem 3.2.

*For every choice of $0\leq\beta<\alpha<1$ and $\lambda\in[-1,1]$ we can construct a fully dynamic data structure that solves the $(\alpha,\beta)$ -similarity problem in $(\mathbb{S}^{d},\langle{\cdot},{\cdot}\rangle)$ . Suppose that $\alpha-\beta\geq(\ln n)^{-zeta}$ for some constant $zeta<1/2$ , that satisfies the guarantees from Theorem 3.1 with exponents $\rho_{q}=\left.\frac{(1-\alpha^{1+\lambda})^{2}}{1-\alpha^{2}}\middle/\frac{(1-\alpha^{\lambda}\beta)^{2}}{1-\beta^{2}}\right.$ and $\rho_{u}=\left.\frac{(\alpha^{\lambda}-\alpha)^{2}}{1-\alpha^{2}}\middle/\frac{(1-\alpha^{\lambda}\beta)^{2}}{1-\beta^{2}}\right.$ . *

Proof.

Assuming that $\alpha-\beta\geq(\ln n)^{-zeta}$ there exists a constant $\varepsilon>0$ where by setting the parameter $t$ of $\mathcal{G}$ such that $t^{2}/2=\frac{1-\beta^{2}}{(1-\alpha^{\lambda}\beta)^{2}}(\ln n)^{\varepsilon}$ the family of filters satisfies the assumptions in Theorem 3.1 while guaranteeing that the second term in $\rho_{q}$ and $\rho_{u}$ from Lemma 3.3 are $o(1)$ . ∎

*Remark 3.1**.*

Theorem 3.2 aims for simplicity and generality while allowing $\alpha$ and $\beta$ to depend on $n$ . For specific values of $\alpha,\beta,\lambda$ it is easy to find better bounds on the probabilties (e.g. the bounds by Savage [142]) and to adjust $t$ in Lemma 3.3 to avoid powering (setting $\kappa_{1}=1,\kappa_{2}=0$ ) in the LSF framework.

21 Appendix: Approximate feature maps, characteristic functions, and Bochner’s Theorem

We begin by defining what a characteristic function is and listing some properties that are useful for our application. More information about characteristic functions can be found in the books by Lukacs [105] and Ushakov [161].

Lemma 3.12 ([105, 161]).

Let $Z$ denote a random variable with distribution function $\mu$ . Then the characteristic function $k(\Delta)$ of $Z$ is defined as

[TABLE]

and it has the following properties:

A distribution function is symmetric if and only if its characteristic function is real and even.

-

Every characteristic function $k(\Delta)$ is uniformly continuous, has $k(0)=1$ , and $|k(\Delta)|\leq 1$ for all real $\Delta$ .

-

Suppose that $k(\Delta)$ denotes the characteristic function of an absolutely continuous distribution then $\lim_{\Delta\rightarrow\infty}|k(\Delta)|=0$ .

-

Let $X$ and $Y$ be independent random variables with characteristic functions $k_{X}$ and $k_{Y}$ . Then the characteristic function of $Z=(X,Y)$ is given by $k(x,y)=k_{X}(x)k_{Y}(y)$ .

Bochner’s Theorem reveals the relation between characteristic functions and the class of real-valued functions $k(x,y)$ that admit a feature space representation $k(x,y)=\langle{\phi(x)},{\phi(y)}\rangle$

Theorem 3.8 (Bochner’s Theorem [140]).

A function $k:{}^{d}\times{}^{d}\to[0,1]$ is positive definite if and only if it can be written on the form

[TABLE]

where $\mu$ is the probability density function of a symmetric distribution.

Rahimi & Recht’s [138] family of approximate feature maps $\mathcal{V}$ is constructed from Bochner’s Theorem by making use of Euler’s Theorem as follows:

[TABLE]

Where the third equality makes use of the fact that $k(x,y)$ is real-valued to remove the complex part of the integral and the fifth equality uses that $2\cos(x)\cos(y)=\cos(x+y)+\cos(x-y)$ .

Now that we have an approximate feature map onto the sphere for the class of shift-invariant kernels, we will take a closer look at what functions this class contains, and what their applications are for similarity search. Given an arbitrary similarity function, we would like to be able to determine whether it is indeed a characteristic function. Unfortunately, there are no known simple techniques for answering this question in general. However, the machine learning literature contains many applications of different shift-invariant kernels [144] and many common distributions have real characteristic functions (see Appendix B in [161] for a long list of examples). Characteristic functions are also well studied from a mathematical perspective [105, 161], and a number of different necessary and sufficient conditions are known. A classical result by Pólya [135] gives simple sufficient conditions for a function to be a characteristic function. Through the vectorization property from Lemma 3.12, Pólya’s conditions directly imply the existence of a large class of similarity measures on d that can fit into the above framework.

Theorem 3.9 (Pólya [135]).

Every even continuous function $k:\real\to\real$ satisfying the properties

$k(0)=1$ **

-

$\lim_{\Delta\to\infty}k(\Delta)=0$ **

-

$k(\Delta)$ * is convex for $\Delta>0$ *

is a characteristic function.

Based on the results of Section 16.1 one could hope for the existence of characteristic functions of the form $k(\Delta)=e^{-|\Delta|^{s}}$ for $s>2$ but it is known that such functions cannot exist [25, Theorem D.8]. Furthermore, Marcinkiewicz [108] shows that a function of the form $k(\Delta)=\exp(-\operatorname{poly}(\Delta))$ cannot be a characteristic function if the degree of the polynomial is greater than two.

We state a more complete, constructive version of Lemma 3.4 as well as the proof here.

Lemma 3.13.

Let $k$ be a real-valued characteristic function with associated distribution function $\mu$ and let $l$ be a positive integer. Consider the family of functions $\mathcal{V}\subseteq\{v\mid v\colon{}^{d}\to\mathbb{S}^{l}\}$ where a randomly sampled function $v$ is defined by, independently for $j=1,\dots,l$ , sampling $v$ from $\mu$ and $b$ uniformly on $[0,2\pi]$ , letting $\hat{v}(x)_{j}=\sqrt{(2/l)}\cos(\langle{v},{x}\rangle+b)$ and normalizing $v(x)_{j}=\frac{\hat{v}(x)}{\left\lVert\hat{v}(x)\right\rVert}$ . The family $\mathcal{V}$ has the property that for every $x,y\in{}^{d}$ and $\varepsilon>0$ we have that

[TABLE]

Proof.

Since $l\cdot\hat{v}(x)_{j}\hat{v}(y)_{j}$ is bounded between $2$ and $-2$ , and we have independence for different values of $j$ , Hoeffding’s inequality [88] can be applied to show that for every fixed pair of points $x,y$ and $\hat{\varepsilon}>0$ it holds that

[TABLE]

From the properties of characteristic functions we have that $k(x,x)=1$ and $k(x,y)\leq 1$ . The bound on the deviation of

[TABLE]

from $k(x,y)$ follows from setting $\hat{\varepsilon}=\varepsilon/4$ and using a union bound over the probabilities that the deviation of one of the inner products is too large. ∎

Combining the approximate feature map onto the unit sphere with Theorem 3.2 we obtain the following:

Theorem 3.10.

Let $k\colon{}^{d}\to\real$ be a characteristic function and define the similarity measure $S(x,y)=k(x-y)$ . Assume that we have access to samples from the distribution associated with $k$ , then Theorem 3.2 holds with $(\mathbb{S}^{d},\langle{\cdot},{\cdot}\rangle)$ replaced by $({}^{d},S)$ .

Proof.

According to Lemma 3.13, we can set $l=n^{o(1)}$ to obtain a map $v\colon{}^{d}\to\mathbb{S}^{l}$ such that the the inner product on $\mathbb{S}^{l}$ preserves the pairwise similarity between $n^{O(1)}$ points with additive error $\varepsilon=o(1)$ . This map has a space and time complexity of $O(dl)=dn^{o(1)}$ . After applying $v$ to the data we can solve the $(\alpha,\beta)$ -similarity problem on $({}^{d},k(x-y))$ by solving the $(\alpha-\varepsilon,\beta+\varepsilon)$ -similarity problem on $(\mathbb{S}^{d},\langle{\cdot},{\cdot}\rangle)$ . We can use Theorem 3.2 to construct a fully dynamic data structure for solving this problem, adjusting the parameter $\lambda$ so that it continues to lie in the admissible range. The space and time complexities follow. ∎

22 Appendix: Proof of tradeoff lower bound

Consider $\rho_{q}=\frac{\log(p_{q}/p_{1})}{\log(p_{q}/p_{2})}$ . Subject to the (implicit) LSF constraint that $p_{q},p_{u}>p_{1}>p_{2}>0$ we see that $\rho_{q}$ is minimized by setting $p_{q},p_{2}$ as small as possible and $p_{1}$ as large as possible. We will therefore derive lower bounds on $p_{q},p_{2}$ and an upper bound on $p_{1}$ . For every value of $p_{1}$ and $p_{2}$ we minimize $\rho_{q},\rho_{u}$ by choosing $p_{q}$ as small as possible.

For a random point $x\in\{-1,1\}^{d}$ it must hold that $\Pr_{\mathcal{F}}[x\in Q]=|Q|/2^{d}$ . This implies the existence of a fixed point $y\in\{-1,1\}^{d}$ with the property that $\Pr_{\mathcal{F}}[y\in Q]\geq|Q|/2^{d}$ . A regular filter family must therefore satisfy that $p_{q}\geq|Q|/2^{d}$ and $p_{u}\geq|U|/2^{d}$ . Let $\lambda$ be defined as in Lemma 3.7 then by a similar argument we have that $p_{2}\geq(U/2^{d})^{1+\alpha^{2\lambda}}$ .

In order to upper bound $p_{1}$ we make use of Lemma 3.7 together with the following lemma that follows directly from an application of Hoeffding’s inequality [88].

Lemma 3.14.

For every $0<\varepsilon<(1-\alpha)/2$ we have that

[TABLE]

In the following derivation, assume that $\alpha,\varepsilon$ satisfies $0<\varepsilon<(1-\alpha)/2$ , let $x,y$ denote randomly $(\alpha+\varepsilon)$ -correlated vectors in $\{-1,1\}^{d}$ , and assume that $\alpha+\varepsilon\leq\alpha^{\lambda}\leq 1/(\alpha+\varepsilon)$ , then

[TABLE]

Summarizing the bounds:

[TABLE]

When minimizing $\rho_{q}$ we have that $\log(p_{q}/p_{2})=-\log(|U|/2^{d})$ . Setting $\varepsilon=2\sqrt{\ln(d)/d}$ results in $\log(1/p_{1})\geq-\frac{1+\alpha^{2\lambda}-2\alpha^{\lambda}(\alpha+\varepsilon)}{1-\alpha^{2}}\log(|U|/2^{d})-O(1/d^{2})$ . Putting things together:

[TABLE]

The derivation of the lower bound for $\rho_{u}$ is almost the same and the resulting expression is

[TABLE]

23 Appendix: Comparison to Kapralov

Kapralov uses $\alpha$ to denote a parameter controlling the space-time tradeoff for his solution to the $(r,cr)$ -near neighbor problem in Euclidean space. For every choice of tradeoff parameter $\alpha\in[0,1]$ , assuming that $c^{2}\geq 3(1-\alpha)^{2}-\alpha^{2}+\varepsilon$ for arbitrarily small constant $\varepsilon>0$ , Kapralov [95] obtains query and update exponents

[TABLE]

We convert Kapralov’s notation to our own by setting $\lambda=1-2\alpha$ . To compare, Kapralov sets $\alpha=0$ for near-linear space and we set $\lambda=1$ . We want to write Kapralov’s exponents on the form

[TABLE]

for some $x$ that we will proceed to derive. We have that $(1-\alpha)^{2}=(1+\lambda)^{2}/4$ and $\alpha^{2}=(1-\lambda)^{2}/4$ . Multiplying the numerator and denominator in Kapralov’s exponents by $c^{2}$ we can write Kapralov’s exponents as

[TABLE]

We have that

[TABLE]

For every choice of $\lambda\in[-1,1]$ , and under the assumption that $c^{2}\geq(1+\lambda)^{2}/2+\lambda+\varepsilon$ for an arbitrarily small constant $\varepsilon>0$ , this allows us to write Kapralov’s exponents as

[TABLE]

To compare Kapralov’s result against our own for search in $\ell_{s}$ -spaces we consider the exponents from Theorem 3.3, ignoring additive $o(1)$ terms:

[TABLE]

Setting $\lambda=1$ we obtain a data structure that uses near-linear space and we get a query exponent $\rho_{q}=16/25$ while Kapralov obtains an exponent of $\rho_{q}=16/20$ , ignoring $o(1)$ terms. At the other end of the tradeoff, setting $\lambda=-1$ , we get a data structure with query time $n^{o(1)}$ and update exponent $\rho_{u}=16/9$ while Kapralov gets an update exponent of $\rho_{u}=4$ , again ignoring additive $o(1)$ terms.

The assumption made by Kapralov that $c^{2}\geq(1+\lambda)^{2}/2+\lambda+\varepsilon$ means that in the case of a near-linear space data structure ( $\lambda=1$ ) sublinear query time can only be obtained for $c>\sqrt{3}$ . In contrast, Theorem 3.3 gives sublinear query time for every constant $c>1$ .

24 Appendix: Details about dynamization and the model of computation

In order to obtain fully dynamic data structures we apply a powerful dynamization result of Overmars and Leeuwen [122] for decomposable searching problems. Their result allows us to turn a partially dynamic data structure into a fully dynamic data structure, supporting arbitrary sequences of queries and updates, at the cost of a constant factor in the space and running time guarantees. Suppose we have a partially dynamic data structure that solves the $(r,cr)$ -near neighbor problem on a set of $n$ points. By partially dynamic we mean that, after initialization on a set $P$ of $n$ points, the data structure supports $\Theta(n)$ updates without changing the query time by more than a constant factor. Let $T_{q}(n)$ , $T_{u}(n)$ , and $T_{c}(n)$ denote the query time, update time, and construction time of such a data structure containing $n$ points. Then, by Theorem 1 of Overmars and Leeuwen [122], there exists a fully dynamic version of the data structure with query time $O(T_{q}(n))$ and update time $O(T_{u}(n)+T_{c}(n)/n)$ that uses only a constant factor additional space. The data structures presented in this paper, as well as most related constructions from the literature, have the property that $T_{c}(n)/n=O(T_{u}(n))$ , allowing us to go from a partially dynamic to a fully dynamic data structure “for free” in big O notation.

In terms of guaranteeing that the query operation solves the $(r,cr)$ -near neighbor problem on the set of points $P$ currently inserted into the data structure, we allow a constant failure probability $\delta<1$ , typically around $1/2$ , and omit it from our statements. We make the standard assumption that the adversary does not have knowledge of the randomness used by the data structure. Say we have a data structure with constant failure probability and a bound on the expected space usage. Then, for every positive integer $T$ we can create a collection of $O(\log T)$ independent repetitions of the data structure such that for every sequence of $T$ operations it holds with high probability in $T$ that the space usage will never exceed the expectation by more than a constant factor and no query will fail.

24.1 Model of computation

We use the standard word RAM model as defined by Hagerup [85] with a word size of $\Theta(\log n)$ bits. Unless otherwise stated, we make the assumption that a point in $(X,D)$ can be stored in $d$ words and that the dissimilarity between two arbitrary points can be computed in $d$ operations where $d$ is a positive integer that corresponds to the dimension in the various well-studied settings mentioned in the main text. Furthermore, when describing framework-based solutions to the $(r,cr)$ -near neighbor problem, we make the assumption that we can sample, evaluate, and represent elements from $\mathcal{F}$ and $\mathcal{H}$ with neglible error using space and time $dn^{o(1)}$ .

Many of the LSH and LSF families rely on random samples from the standard normal distribution. We will ignore potential problems resulting from rounding due to the fact that our model only supports finite precision arithmetic. This approach is standard in the literature and can be justified by noting that the error introduced by rounding is neglible. Furthermore, there exists small pseudorandom standard normal distributions that support sampling using only few uniformly distributed bits as noted by Charikar [47]. In much of the related literature the model of computation is left unspecified and statements about the complexity of solutions to the $(r,cr)$ -near neighbor problem are usually made with respect to particular operations such as the hash function computations, distance computations, etc., leaving out other details [91, 86].

25 Addendum: An improved framework

The LSF framework in Theorem 3.1 suffers from large lower-order terms that depend on the $(r,cr,p_{1},p_{2},p_{q},p_{u})$ -sensitivity properties of $\mathcal{F}$ . With the parameterization in Appendix 19 the framework uses $O(\tau n^{\min(\rho_{q},\rho_{u})})$ filters from $\mathcal{F}$ where $\tau\leq\log(1/p_{1})/\log(\min(p_{q},p_{u})/p_{1})$ . In addition, the query and update time have a multiplicative factor $e^{\tau}$ which can potentially be very large and where we have to assume explicitly that $e^{\tau}=n^{o(1)}$ . We will use a combination of techniques in recent work on set similarity search [56] and fast locality-sensitive hashing frameworks [64, 53] to give an improved LSF framework with more precise complexity bounds.

The data structure produced by the framework follows the high-level approach as outlined in Section 14.1: queries and updates are mapped to a collection of buckets that are searched for similar points in the case of a query, or updated to store a reference to the point in the case of an update. Let $V\colon X\to R$ denote the mapping from query points to buckets and $W\colon X\to R$ denote the corresponding map for updates. The set of buckets $V(x)$ will be identified by the “survivors” of $w$ branching processes through $k$ collections of $m$ filters, similarly to the Chosen Path algorithm [56].

The data structure is initialized by sampling $k$ collections of $m$ filters. We will use the notation $Q_{i,j}$ ( $U_{i,j}$ ) to denote the $j$ th query (update) filter in the $i$ th collection. For $i=1,\dots,k$ let $h_{i}\colon[w]\times[m]^{i}\to[0,1]$ denote a pairwise independent random hash function. Let $\lambda\in[0,1]$ be a parameter to be determined later and let $\circ$ denote vector concatenation, then the locality-sensitive map $V$ is defined recursively as follows:

[TABLE]

The map $W$ is defined in the same way except it uses $U_{i,j}$ instead of $Q_{i,j}$ .

Properties.

To show that the maps $V,W$ provide an efficient solution to the $(r,cr)$ -near neighbor problem we need to show the following:

•

An upper bound on the expected size of $V(x)$ and $W(x)$ to bound the expected number of buckets probed during queries/updates.

•

An upper bound on the expected size of $V(x)\cap W(y)$ when $\operatorname{dist}(x,y)>cr$ to bound the expected number of distant points that will be encountered during the linear scan part of the query algorithm.

•

That $V(x)\cap W(y)$ is non-empty with constant probability when $\operatorname{dist}(x,y)\leq r$ to guarantee that the query algorithm encounters a point at distance at most $r$ with constant probability, provided such a point exists.

By the independence between the different levels $i=1,\dots,k$ in the branching process we have that

[TABLE]

Given $x,y\in X$ define $Z_{i}=V_{i}(x)\cap W_{i}(y)$ . Define $p=\Pr[x\in Q,y\in U]$ where $(Q,U)$ is sampled from $\mathcal{F}$ . The expected number of collisions between $x$ and $y$ at level $i$ is then given by

[TABLE]

To show correctness of the scheme we will use Chebyshev’s inequality to show that with constant probability we have $|Z_{i}|>0$ for points $x,y$ with $\operatorname{dist}(x,y)\leq r$ . We proceed by upper bounding $\operatorname*{\mathbb{E}}[|Z_{i}|^{2}]$ in order to bound the variance $\mathrm{Var}[|Z_{i}|]=\operatorname*{\mathbb{E}}[|Z_{i}|^{2}]-\operatorname*{\mathbb{E}}[|Z_{i}|]^{2}$ . To ease the derivation we define $Y_{p,j}=\mathds{1}\{h_{i}(p\circ j)<\lambda\land(x,y)\in(Q_{i,j},U_{i,j})\}$ where we suppress the subscript $i$ . Without loss of generality we can assume that $p=p_{1}$ since $\operatorname{dist}(x,y)\leq r$ .

[TABLE]

Since $i\leq k$ if we set $m\geq\ln(1+\varepsilon)k/p_{1}$ we have that

[TABLE]

We will set the parameters in order to give a simple upper bound the worst-case performance of the data structure. The constants can be improved.

[TABLE]

We can now bound the variance of $|Z_{k}|$ as follows:

[TABLE]

where we use the fact that $p_{1}\leq\min(p_{q},p_{u})$ . By Chebyshev’s inequality we have that

[TABLE]

By our parameter setting we have $\operatorname*{\mathbb{E}}[|Z_{k}|]=w(m\lambda p_{1})^{k}\geq 10k$ so $x,y$ collide with probability at least $7/10$ under $V,W$ , ensuring correctness.

25.1 Fast evaluation

We will use a hashing trick to compute $V_{k}(x)$ in expected time $O(k\operatorname*{\mathbb{E}}[|V_{k}(x)|])$ . This technique is only briefly mentioned in [56]. Observe that for the correctness argument to hold, it suffices that the hash functions $h_{1},\dots,h_{k}$ are sampled independently from a pairwise independent family [41, 42]. At the $i$ th step in the computation of $V_{k}(x)$ we wish to determine, for each $p\in V_{i-1}(x)$ the set of $j\in[m]$ satisfying $h_{i}(p\circ j)<\lambda$ and $x\in Q_{i,j}$ . In order to answer this efficiently we will make use of the property that a pairwise independent hash function can be decomposed as

[TABLE]

where $g_{i},f_{i}$ are pairwise independent and $\oplus$ denotes addition in an abelian group. For concreteness assume that $g_{i},f_{i}$ map to $b$ -bit strings and let $\oplus$ denote the exclusive-or operator. If we view the $b$ -bit output of $h_{i}(p\circ j)$ as an integer in the set $0,1,\dots,2^{b}-1$ using the standard base two representation, the original condition $h_{i}(p\circ j)<\lambda$ can be transformed into the condition $h_{i}(p\circ j)<\lambda M$ where $M=2^{b}-1$ . By choosing $b=\Theta(\log n)$ we can with high probability determine whether the condition is satisfied without reading more than $b$ bits, so we can effectively treat the output of the hash function as a real number at the cost of a small increase in the failure probability of the data structure.

Continuing with the new representation, in order for $h_{i}(p\circ j)<\lambda M$ we must have that the leading $\kappa=b-\lceil\log_{2}(\lambda M)\rceil-1$ bits of the output of $g_{i}(p)\oplus f_{i}(j)$ is all zeroes. Given the leading $\kappa$ bits of $g_{i}(p)$ we can restrict our attention to $j\in[m]$ with the same value in the leading $\kappa$ bits of $f_{i}(j)$ . At the beginning of the query algorithm, for each $i\in k$ we determine the subset $J\subseteq[m]$ such that $x\in Q_{i,j}$ We then create a table with $2^{\kappa}$ linked lists and for each $j\in J$ we append $j$ to the linked list at the table entry given by the leading $\kappa$ bits of $f_{i}(j)$ . The running time and space usage of preparing these additional data structures is dominated by the complexity of evaluating and storing $O(k^{2}/p_{1})$ filters from $\mathcal{F}$ .

Now, given $V_{i-1}(x)$ we can compute $V_{i}(x)$ in expected time $O(m\lambda p_{q}|V_{i-1}(x)|)$ by, for each $p\in V_{i}(x)$ , looking up the relevant table entry (given by the leading $\kappa$ bits of $g_{i}(p)$ ) and verifying whether the elements of the linked list satisfy the hashing condition. Every element of the linked list found in this way satisfies the hashing condition with constant probability by our setting of $\kappa$ . To implement $g_{i}$ and $f_{i}$ we can use simple tabulation hashing [173].

One problem remains: long paths $p\in[w]\times[m]^{i}$ can take super constant time to hash. To prevent this we again use hashing to create $O(\log n)$ -bit fingerprints of the paths that we work on instead. A conservative upper bound on the expected time to compute $V_{k}(x)$ is $O(k\operatorname*{\mathbb{E}}[|V_{k}(x)|])$ since $\operatorname*{\mathbb{E}}[|V_{i}(x)|]$ is non-decreasing in $i$ and the expected time spent at level $i$ is upper bounded by $O(\operatorname*{\mathbb{E}}[|V_{i}(x)|])$ . We use the same approach to compute $W_{k}(x)$ .

25.2 Framework

We are now ready to state the properties of the new framework.

Theorem 3.11.

Given a $(r,cr,p_{q},p_{u},p_{1},p_{2})$ -sensitive family $\mathcal{F}$ we can construct a fully dynamic data structure that solves the $(r,cr)$ -near neighbor problem. Define $k=\lceil\log(n)/\log(p_{q}/p_{2})\rceil$ , then:

•

The data structure uses $O(kn(p_{u}/p_{1})^{k})$ words of space in addition to the space required to store $n$ data points and $O(k^{2}/p_{1})$ filters from $\mathcal{F}$ .

•

The query operation uses $O(k^{2}(p_{q}/p_{1})^{k}))$ word-RAM operations, $O(k(p_{q}/p_{1})^{k})$ distance computations, and $O(k^{2}/p_{1})$ filter evaluations.

•

The update operation uses $O(k^{2}(p_{u}/p_{1})^{k}))$ word-RAM operations and $O(k^{2}/p_{1})$ filter evaluations.

Compared to the usual formulation where the query time is stated as $n^{\rho_{q}+o(1)}$ Theorem 3.11 offers a more precise statement of the complexity and can be converted to the other formulation. The lower order terms are now confined to the multiplicative factor $k$ which is a standard expression that also appears in the LSH framework as $k=\lceil\log(n)/\log(1/p_{2})\rceil$ where $p_{2}$ is an upper bound on the collision probability between pairs of points $x,y$ with $\operatorname{dist}(x,y)>cr$ . The analysis can be tightened further by not using $k$ as an upper bound for $\sum_{s=0}^{k-1}(p_{1}/\min(p_{q},p_{u}))^{s}$ when bounding the variance, but removing the multiplicative dependence on $k$ entirely as in the improved LSH framework [53] is an interesting open problem.

Chapter 4 Set similarity search beyond MinHash

‘From the ashes, a fire shall be woken’

We consider the problem of approximate set similarity search under Braun-Blanquet similarity $B(x,y)=|x\cap y|/\max(|x|,|y|)$ . The $(b_{1},b_{2})$ -approximate Braun-Blanquet similarity search problem is to preprocess a collection of sets $P$ such that, given a query set $q$ , if there exists $x\in P$ with $B(q,x)\geq b_{1}$ , then we can efficiently return $x^{\prime}\in P$ with $B(q,x^{\prime})>b_{2}$ .

We present a simple data structure that solves this problem with space usage $O(n^{1+\rho}\log n+\sum_{x\in P}|x|)$ and query time $O(|q|n^{\rho}\log n)$ where $n=|P|$ and $\rho=\log(1/b_{1})/\log(1/b_{2})$ . Making use of existing lower bounds for locality-sensitive hashing by O’Donnell et al. [121] we show that this value of $\rho$ is tight across the parameter space, i.e., for every choice of constants $0<b_{2}<b_{1}<1$ .

In the case where all sets have the same size our solution strictly improves upon the value of $\rho$ that can be obtained through the use of state-of-the-art data-independent techniques in the Indyk-Motwani locality-sensitive hashing framework [91] such as Broder’s MinHash [35] for Jaccard similarity and Andoni et al.’s cross-polytope LSH [13] for cosine similarity. Surprisingly, even though our solution is data-independent, for a large part of the parameter space we outperform the currently best data-dependent method by Andoni and Razenshteyn [17].

26 Introduction

In this paper we consider the approximate set similarity problem or, equivalently, the problem of approximate Hamming near neighbor search in sparse vectors. Data that can be represented as sparse vectors is ubiquitous — a typical example is the representation of text documents as term vectors, where non-zero vector entries correspond to occurrences of words (or shingles). In order to perform identification of near-identical text documents in web-scale collections, Broder et al. [30, 36] designed and implemented MinHash (a.k.a. min-wise hashing), now understood as a locality-sensitive hash function [86]. This allowed approximate answers to similarity queries to be computed much faster than by other methods, and in particular made it possible to cluster the web pages of the AltaVista search engine (for the purpose of eliminating near-duplicate search results). Almost two decades after it was first described, MinHash remains one of the most widely used locality-sensitive hashing methods as witnessed by thousands of citations of [30, 36] as well as the ACM Paris Kanellakis Theory and Practice Award that Broder shared with Indyk and Charikar in 2012.

A similarity measure maps a pair of vectors to a similarity score in $[0,1]$ . It will often be convenient to interpret a vector $x\in\{0,1\}^{d}$ as the set $\{i\;|\;x_{i}=1\}$ . With this convention the Jaccard similarity of two vectors can be expressed as $J(x,y)=|x\cap y|/|x\cup y|$ . In approximate similarity search we are interested the problem of searching a data set $P\subseteq\{0,1\}^{d}$ for a vector of similarity at least $j_{1}$ with a query vector $q\in\{0,1\}^{d}$ , but allow the search algorithm to return a vector of similarity $j_{2}<j_{1}$ . To simplify the exposition we will assume throughout the introduction that all vectors are $t$ -sparse, i.e., have the same Hamming weight $t$ .

Recent theoretical advances in data structures for approximate near neighbor search in Hamming space [17] make it possible to beat the asymptotic performance of MinHash-based Jaccard similarity search (using the LSH framework of [86]) in cases where the similarity threshold $j_{2}$ is not too small. However, numerical computations suggest that MinHash is always better when $j_{2}<1/45$ .

In this paper we address the problem: Can similarity search using MinHash be improved in general? We give an affirmative answer in the case where all sets have the same size $t$ by introducing Chosen Path: a simple data-independent search method that strictly improves MinHash, and is always better than the data-dependent method of [17] when $j_{2}<1/9$ . Similar to data-independent locality-sensitive filtering (LSF) methods [24, 100, 54] our method works by mapping each data (or query) vector to a set of keys that must be stored (or looked up). The name Chosen Path stems from the way the mapping is constructed: As paths in a layered random graph where the vertices at each layer is identified with the set $\{1,\dots,d\}$ of dimensions, and where a vector $x$ is only allowed to choose paths that stick to non-zero components $x_{i}$ . This is illustrated in Figure 5.

26.1 Related Work

High-dimensional approximate similarity search methods can be characterized in terms of their $\rho$ -value which is the exponent for which queries can be answered in time $O(dn^{\rho})$ , where $n$ is the size of the set $P$ and $d$ denotes the dimensionality of the space. Here we focus on the “balanced” case where we aim for space $O(n^{1+\rho}+dn)$ , but note that there now exist techniques for obtaining other trade-offs between query time and space overhead [16, 54].

Locality-sensitive hashing methods.

We begin by describing results for Hamming space, which is a special case of similarity search on the unit sphere (many of the results cited apply to the more general case). In Hamming space the focus has traditionally been on the $\rho$ -value that can be obtained for solutions to the $(r,cr)$ -approximate near neighbor problem: Preprocess a set of points $P\subseteq\{0,1\}^{d}$ such that, given a query point $q$ , if there exists $x\in P$ with $\left\lVert x-q\right\rVert_{1}\leq r$ , then return $x^{\prime}\in P$ with $\left\lVert x^{\prime}-q\right\rVert_{1}<cr$ . In the literature this problem is often presented as the $c$ -approximate near neighbor problem where bounds for the $\rho$ -value are stated in terms of $c$ and, in the case of upper bounds, hold for every choice of $r$ , while lower bounds may only hold for specific choices of $r$ .

O’Donnell et al. [121] have shown that the value $\rho=1/c$ for $c$ -approximate near neighbor search in Hamming space, obtained in the seminal work of Indyk and Motwani [91], is the best possible in terms of $c$ for schemes based on Locality-Sensitive Hashing (LSH). However, the lower bound only applies when the distances of interest, $r$ and $cr$ , are relatively small compared to $d$ , and better upper bounds are known for large distances. Notably, other LSH schemes for angular distance on the unit sphere such as cross-polytope LSH [13] give lower $\rho$ -values for large distances. Extensions of the lower bound of [121] to cover more of the parameter space were recently given in [16, 54]. Until recently the best $\rho$ -value known in terms of $c$ was $1/c$ , but in a sequence of papers Andoni et al. [14, 17] have shown how to use data-dependent LSH techniques to achieve $\rho=1/(2c-1)+o_{n}(1)$ , bypassing the lower bound framework of [121] which assumes the LSH to be independent of data.

Set similarity search.

There exists a large number of different measures of set similarity with various applications for which it would be desirable to have efficient approximate similarity search algorithms [51]. Given a measure of similarity assume that we have access to a family $\mathcal{H}$ of locality-sensitive hash functions (defined in Section 27) such that for every pair of sets $x,y$ it holds that

[TABLE]

when $h$ is sampled randomly from $\mathcal{H}$ . We will refer to a family of locality-sensitive hash functions with this specific property as a similarity-sensitive family. Given a similarity-sensitive family we can use the LSH framework to construct a solution for the $(s_{1},s_{2})$ -approximate similarity search problem with exponent $\rho=\log(1/s_{1})/\log(1/s_{2})$ .

Regarding the existence of similarity-sensitive families it was shown by Charikar [47] that if the similarity measure $\operatorname{sim}(x,y)$ admits a similarity-sensitive LSH, then $1-\operatorname{sim}(x,y)$ must be a metric. Recently, Chierichetti and Kumar [49] showed that, given a similarity $\sim$ that admits a similarity-sensitive LSH, the transformed similarity $f(\sim)$ will continue to admit an LSH if $f(\cdot)$ is a probability generating function. The existence of an LSH that admits a similarity measure $\operatorname{sim}$ will therefore give rise to the existence of solutions to the approximate similarity search problem for the much larger class of similarities $f(\operatorname{sim})$ . However, this still leaves open the problem of finding efficient explicit constructions, and as it turns out, the property of similarity-sensitive families $\Pr[h(x)=h(y)]=\operatorname{sim}(x,y)$ , while intuitively appealing and useful for similarity estimation, does not necessarily imply that the LSH is optimal for solving the approximate search problem. In fact, it was recently shown [50] that for Braun-Blanquet there does not exist a LSH scheme with $\Pr[h(x)=h(y)]=B(x,y)=|x\cap y|/\max(|x|,|y|)$ . Moreover, it was shown that MinHash achieves a two-approximation to Braun-Blanquet similarity and that this is optimal for LSH schemes.

The problem of finding tight upper and lower bounds on the $\rho$ -value that can be obtained through the LSH framework for data-independent $(s_{1},s_{2})$ -approximate similarity search across the entire parameter space $(s_{1},s_{2})$ remains open for two of the most common measures of set similarity: Jaccard similarity $J(x,y)=|x\cap y|/|x\cup y|$ and cosine similarity $C(x,y)=|x\cap y|/\sqrt{|x||y|}$ .

A random function from the MinHash family $\mathcal{H}_{\text{minhash}}$ hashes a set $x\subseteq\{1,\dots,d\}$ to the first element of $x$ in a random permutation of the set $\{1,\dots,d\}$ . For $h\sim\mathcal{H}_{\text{minhash}}$ we have that $\Pr[h(x)=h(y)]=J(x,y)$ , yielding an LSH solution to the approximate Jaccard similarity search problem. For cosine similarity the SimHash family $\mathcal{H}_{\text{simhash}}$ , introduced by Charikar [47], works by sampling a random hyperplane in d that passes through the origin and hashing $x$ according to what side of the hyperplane it lies on. For $h\sim\mathcal{H}_{\text{simhash}}$ we have that $\Pr[h(x)=h(y)]=1-\arccos(C(x,y))/\pi$ , which can be used to derive a solution for cosine similarity, although not the clean solution that we could have hoped for in the style of MinHash for Jaccard similarity. There exists a number of different data-independent LSH approaches [156, 14, 13] that improve upon the $\rho$ -value of SimHash. Perhaps surprisingly, it turns out that these approaches yield lower $\rho$ -values for the $(j_{1},j_{2})$ -approximate Jaccard similarity search problem compared to MinHash for certain combinations of $(j_{1},j_{2})$ . Unfortunately, while asymptotically superior these techniques suffer from a non-trivial $o_{n}(1)$ -term in the exponent that only decreases very slowly with $n$ . In comparison, both MinHash and SimHash are simple to describe and have closed expressions for their $\rho$ -values. Furthermore, MinHash and SimHash both have the advantage of being efficient in the sense that a hash function can be represented using space $O(d)$ and the time to compute $h(x)$ is $O(|x|)$ .

In Table 2 we show how the upper bounds for similarity search under different measures of set similarity relate to each other in the case where all sets are $t$ -sparse. In addition to Hamming distance and Jaccard similarity, we consider Braun-Blanquet similarity [28] defined as

[TABLE]

which for $t$ -sparse vectors is identical to cosine similarity. When the query and the sets in $P$ can have different sizes the picture becomes muddled, and the question of which of the known algorithms is best for each measure of similarity is complicated and can depend on $(s_{1},s_{2})$ . In Section 30 we treat the problem of different set sizes and provide a brief discussion for Jaccard similarity, specifically in relation to our upper bound for Braun-Blanquet similarity.

Similarity search under set similarity and the batched version often referred to as set similarity join [20, 23] have also been studied extensively in the information retrieval and database literature, but mostly without providing theoretical guarantees on performance. Recently the notion of containment search, where the similarity measure is the (unnormalized) intersection size, was studied in the LSH framework [150]. This is a special case of maximum inner product search [150, 5]. However, these techniques do not give improvements in our setting.

Similarity estimation.

Finally, we mention that another application of MinHash [30, 36] is the (easier) problem of similarity estimation, where the task is to condense each vector $x$ into a short signature $s(x)$ in such a way that the similarity $J(x,y)$ can be estimated from $s(x)$ and $s(y)$ . A related similarity estimation technique was independently discovered by Cohen [60]. Thorup [157] has shown how to perform similarity estimation using just a small amount of randomness in the definition of the function $s(\cdot)$ . In another direction, Mitzenmacher et al. [112] showed that it is possible to improve the performance of MinHash for similarity estimation when the Jaccard similarity is close to 1, but for smaller similarities it is known that succinct encodings of MinHash such as the one in [103] comes within a constant factor of the optimal space for storing $s(x)$ [128]. Curiously, our improvement to MinHash in the context of similarity search comes when the similarity is neither too large nor too small. Our techniques do not seem to yield any improvement for the similarity estimation problem.

26.2 Contribution

We show the following upper bound for approximate similarity search under Braun-Blanquet similarity:

Theorem 4.1.

For every choice of constants $0<b_{2}<b_{1}<1$ we can solve the $(b_{1},b_{2})$ -approximate similarity search problem under Braun-Blanquet similarity with query time $O(|q|n^{\rho}\log n)$ and space usage $O(n^{1+\rho}\log n+\sum_{x\in P}|x|)$ where $\rho=\log(1/b_{1})/\log(1/b_{2})$ .

In the case where the sets are $t$ -sparse our Theorem 4.1 gives the first strict improvement on the $\rho$ -value for approximate Jaccard similarity search compared to the data-independent LSH approaches of MinHash and Angular LSH. Figure 6 shows an example of the improvement for a slice of the parameter space. The improvement is based on a new locality-sensitive mapping that considers a specific random collection of length- $k$ paths on the vertex set $\{1,\dots,d\}$ , and associates each vector $x$ with the paths in the collection that only visits vertices in $\{i\;|\;x_{i}=1\}$ . Our data structure exploits that similar vectors will be associated with a common path with constant probability, while vectors with low similarity have a negligible probability of sharing a path. However, the collection of paths has size superlinear in $n$ , so an efficient method is required for locating the paths associated with a particular vector. Our choice of the collection of paths balances two opposing constraints: It is random enough to match the filtering performance of a truly random collection of sets, and at the same time it is structured enough to allow efficient search for sets matching a given vector. The search procedure is comparable in simplicity to the classical techniques of bit sampling, MinHash, SimHash, and $p$ -stable LSH, and we believe it might be practical. This is in contrast to most theoretical advances in similarity search in the past ten years that suffer from $o(1)$ terms in the exponent of complexity bounds.

Intuition.

Recall that we will think of a vector $x\in\{0,1\}^{d}$ also as a set, $\{i\;|\;x_{i}=1\}$ . MinHash can be thought of as a way of sampling an element $i_{x}$ from $x$ , namely, we let $i_{x}=\operatorname*{arg\,min}_{i\in x}h(i)$ where $h$ is a random hash function. For sets $x$ and $y$ the probability that $i_{x}=i_{y}$ equals their Jaccard similarity $J(x,y)$ , which is much higher than if the samples had been picked independently. Consider the case in which $|x|=|y|=t$ , so $J(x,y)=\frac{|x\cap y|}{2t-|x\cap y|}$ . Another way of sampling is to compute $I_{x}=x\cap{b}$ , where $\Pr[i\in{b}]=1/t$ , independently for each $i\in[d]$ . The expected size of $I_{x}$ is 1, so this sample has the same expected “cost” as the MinHash-based sample. But if the Jaccard similarity is small, the latter samples are more likely to overlap:

[TABLE]

nearly a factor of 2 improvement. In fact, whenever $|x\cap y|<0.6\,t$ we have $\Pr[I_{x}\cap I_{y}\neq\emptyset]>\Pr[i_{x}=i_{y}]$ . So in a certain sense, MinHash is not the best way of collecting evidence for the similarity of two sets. (This observation is not new, and has been made before e.g. in [62].)

Locality-sensitive maps.

The intersection of the samples $I_{x}$ does not correspond directly to hash collisions, so it is not clear how to turn this insight into an algorithm in the LSH framework. Instead, we will consider a generalization of both the locality sensitive filtering (LSF) and LSH frameworks where we define a distribution $\mathcal{M}$ over maps $M\colon\{0,1\}^{d}\to 2^{R}$ . The map $M$ performs the same task as the LSH data structure: It takes a vector $x$ and returns a set of memory locations $M(x)\subseteq\{1,\dots,R\}$ . A randomly sampled map $M\sim\mathcal{M}$ has the property that if a pair of points $x,y$ are close then $M(x)\cap M(y)\neq\emptyset$ with constant probability, while if $x,y$ are distant then the expected size $M(x)\cap M(y)$ is small (much smaller than $1$ ). It is now straightforward to see that this distribution can be used to construct a data structure for similarity search by storing each data point $x\in P$ in the set of memory locations or buckets $M(x)$ . A query for a point $y$ is performed by computing the similarity between $y$ and every point $x$ contained in the set buckets $M(y)$ , reporting the first sufficiently similar point found.

Chosen Path.

It turns out that to most efficiently filter out vectors of low similarity in the setting where all sets have equal size, we would like to map each data point $x\in\{0,1\}^{d}$ to a collection $M(x)$ of random subsets of $\{0,1\}^{d}$ that are contained in $x$ . Furthermore, to best distuinguish similar from dissimilar vectors when solving the approximate similarity search problem, we would like the random subsets of $\{0,1\}^{d}$ to have size $\Theta(\log n)$ . This leads to another obstacle: The collection of subsets of $\{0,1\}^{d}$ required to ensure that $M(x)\cap M(y)\neq\emptyset$ for similar points, i.e., that $M$ maps to a subset contained in $x\cap y$ , is very large. The space usage and evaluation time of a locality-sensitive map $M$ to fully random subsets of $\{0,1\}^{d}$ would far exceed $n$ , rendering the solution useless. To overcome this we create the samples in a gradual, correlated way using a pairwise independent branching process that turns out to yield “sufficiently random” samples for the argument to go through.

Lower bound.

On the lower bound side we show that our solution for Braun-Blanquet similarity is best possible in terms of parameters $b_{1}$ and $b_{2}$ within the class of solutions that can be characterized as data-independent locality-sensitive maps. The lower bound works by showing that a family of locality-sensitive maps for Braun-Blanquet similarity with a $\rho$ -value below $\log(1/b_{1})/\log(1/b_{2})$ can be used to construct a locality-sensitive hash family for the $c$ -approximate near neighbor problem in Hamming space with a $\rho$ -value below $1/c$ , thereby contradicting the LSH lower bound by O’Donnell et al. [121]. We state the lower bound here in terms of locality-sensitive hashing, formally defined in Section 27.

Theorem 4.2.

For every choice of constants $0<{b_{2}}<{b_{1}}<1$ any $({b_{1}},{b_{2}},p_{1},p_{2})$ -sensitive hash family $\mathcal{H}_{B}$ for $\{0,1\}^{d}$ under Braun-Blanquet similarity must satisfy

[TABLE]

The details showing how this LSH lower bound implies a lower bound for locality-sensitive maps are given in Section 29.

27 Preliminaries

As stated above we will view $x\in\{0,1\}^{d}$ both as a vector and as a subset of $[d]=\{1,\dots,d\}$ . Define $x$ to be $t$ -sparse if $|x|=t$ ; we will be interested in the setting where $t\leq d/2$ , and typically the sparse setting $t\ll d$ . Although many of the concepts we use hold for general spaces, for simplicity we state definitions in the same setting as our results: the boolean hypercube $\{0,1\}^{d}$ under some measure of similarity $\operatorname{sim}\colon\{0,1\}^{d}\times\{0,1\}^{d}\rightarrow[0,1]$ .

Definition 4.1.

(Approximate similarity search) Let $P\subset\{0,1\}^{d}$ be a set of $|P|=n$ data vectors, let $\operatorname{sim}\colon\{0,1\}^{d}\times\{0,1\}^{d}\rightarrow[0,1]$ be a similarity measure, and let $s_{1},s_{2}\in[0,1]$ such that $s_{1}>s_{2}$ . A solution to the $(s_{1},s_{2})$ -similarity search problem is a data structure that supports the following query operation: on input $q\in\{0,1\}^{d}$ for which there exists a vector $x\in P$ with $\operatorname{sim}(x,q)\geq s_{1}$ , return $x^{\prime}\in P$ with $\operatorname{sim}(x^{\prime},q)>s_{2}$ .

Our data structures are randomized, and queries succeed with probability at least $1/2$ (the probability can be made arbitrarily close to $1$ by independent repetition). Sometimes similarity search is formulated as searching for vectors that are near $q$ according to the distance measure $\operatorname{dist}(x,y)=1-\operatorname{sim}(x,y)$ . For our purposes it is natural to phrase conditions in terms of similarity, but we compare to solutions originally described as “near neighbor” methods.

Many of the best known solutions to approximate similarity search problems are based on a technique of randomized space partitioning. This technique has been formalized in the locality-sensitive hashing framework [91] and the closely related locality-sensitive filtering framework [24, 54].

Definition 4.2.

(Locality-sensitive hashing [91]) A $({s_{1}},{s_{2}},p_{1},p_{2})$ -sensitive family of hash functions for a similarity measure $\operatorname{sim}\colon\{0,1\}^{d}\times\{0,1\}^{d}\to[0,1]$ is a distribution $\mathcal{H}_{\operatorname{sim}}$ over functions $h\colon\{0,1\}^{d}\to R$ such that for all $x,y\in\{0,1\}^{d}$ and random $h$ sampled according to $\mathcal{H}_{\operatorname{sim}}$ :

•

If $\operatorname{sim}(x,y)\geq s_{1}$ then $\Pr[h(x)=h(y)]\geq p_{1}$ .

•

If $\operatorname{sim}(x,y)\leq s_{2}$ then $\Pr[h(x)=h(y)]\leq p_{2}$ .

The range $R$ of the family will typically be fairly small such that an element of $R$ can be represented in a constant number of machine words. In the following we assume for simplicity that the family of hash functions is efficient such that $h(x)$ can be computed in time $O(|x|)$ . Furthermore, we will assume that the time to compute the similarity $\sim(x,y)$ can be upper bounded by the time it takes to compute the size of the intersection of preprocessed sets, i.e., $O(\min(|x|,|y|))$ .

Given a locality-sensitive family it is quite simple to obtain a solution to the approximate similarity search problem, essentially by hashing points to buckets such that close points end up in the same bucket while distant points are kept apart.

Lemma 4.1 (LSH framework [91, 86]).

Given a $(s_{1},s_{2},p_{1},p_{2})$ -sensitive family of hash functions it is possible to solve the $(s_{1},s_{2})$ -similarity search problem with query time $O(|q|n^{\rho}\log n)$ and space usage $O(n^{1+\rho}+\sum_{x\in P}|x|)$ where $\rho=\log(1/p_{1})/\log(1/p_{2})$ .

The upper bound presented in this paper does not quite fit into the existing frameworks. However, we would like to apply existing LSH lower bound techniques to our algorithm. Therefore we define a more general framework that captures solutions constructed using the LSH and LSF framework, as well as the upper bound presented in this paper.

Definition 4.3 (Locality-sensitive map).

A $(s_{1},s_{2},m_{1},m_{2})$ -sensitive family of maps for a similarity measure $\operatorname{sim}\colon\{0,1\}^{d}\times\{0,1\}^{d}\to[0,1]$ is a distribution $\mathcal{M}_{\operatorname{sim}}$ over mappings $M\colon\{0,1\}^{d}\to 2^{R}$ (where $2^{R}$ denotes the power set of $R$ ) such that for all $x,y\in\{0,1\}^{d}$ and random $M\in\mathcal{M}_{\operatorname{sim}}$ :

$\operatorname*{\mathbb{E}}[|M(x)|]\leq m_{1}$ . 2. 2.

If $\operatorname{sim}(x,y)\leq s_{2}$ then $\operatorname*{\mathbb{E}}[|M(x)\cap M(y)|]\leq m_{2}$ . 3. 3.

If $\operatorname{sim}(x,y)\geq s_{1}$ then $\Pr[M(x)\cap M(y)\neq\emptyset]\geq 1/2$ .

Once we have a family of locality-sensitive maps $\mathcal{M}$ we can use it to obtain a solution to the $(s_{1},s_{2})$ -similarity search problem.

Lemma 4.2.

Given a $(s_{1},s_{2},m_{1},m_{2})$ -sensitive family of maps $\mathcal{M}$ we can solve the $(s_{1},s_{2})$ -similarity search problem with query time $O(m_{1}+nm_{2}|q|+T_{M})$ and space usage $O(nm_{1}+\sum_{x\in P}|x|)$ where $T_{M}$ is the time to evaluate a map $M\in\mathcal{M}$ .

Proof.

We construct the data structure by sampling a map $M$ from $\mathcal{M}$ and use it to place points in $P$ into buckets. To run a query for a point $q$ we proceed by evaluating $M(q)$ and computing the similarity between $q$ and the points in the buckets associated with $M(q)$ . If a sufficiently similar point is found we return it. We get rid of the expectation in the guarantees by independent repetitions and applying Markov’s inequality. ∎

Model of computation.

We assume the standard word RAM model [85] with word size $\Theta(\log n)$ , where $n=|P|$ . In order to be able to draw random functions from a family of functions we augment the model with an instruction that generates a machine word uniformly at random in constant time.

28 Upper Bound

We will describe a family of locality-sensitive maps $\mathcal{M}_{B}$ for solving the $(b_{1},b_{2})$ -similarity search problem under Braun-Blanquet similarity (3). After describing $\mathcal{M}_{B}$ we will give an efficient implementation of $M\in\mathcal{M}_{B}$ and show how to set parameters to obtain our Theorem 4.1.

28.1 Chosen Path

The Chosen Path family $\mathcal{M}_{B}$ is defined by $k$ random hash functions $h_{1},\dots,h_{k}$ where $h_{i}\colon[w]\times[d]^{i}\to[0,1]$ and $w$ is a positive integer. The evaluation of a map $M_{k}\in\mathcal{M}_{B}$ proceeds in a sequence of $k+1$ steps that can be analyzed as a Galton-Watson branching process, originally devised to investigate population growth under the assumption of identical and independent offspring distributions. In the first step $i=0$ we create a population of $w$ starting points

[TABLE]

In subsequent steps, every path that has survived so far produces offspring according to a random process that depends on $h_{i}$ and the element $x\in\{0,1\}^{d}$ being evaluated. We use $p\circ j$ to denote concatenation of a path $p$ with a vertex $j$ .

[TABLE]

Observe that $h_{i}(p\circ j)<\frac{x_{j}}{b_{1}|x|}$ can only hold when $x_{j}=1$ , so the paths in $M_{i}(x)$ are constrained to $j\in x$ . The set $M(x)=M_{k}(x)$ is given by the paths that survive to the $k$ th step. We will proceed by bounding the evaluation time of $M\in\mathcal{M}_{B}$ as well as showing the locality-sensitive properties of $\mathcal{M}_{B}$ . In particular, for similar points $x,y\in\{0,1\}^{d}$ with $B(x,y)\geq b_{1}$ we will show that with probability at least $1/2$ there will be a path that is chosen by both $x$ and $y$ .

Lemma 4.3 (Properties of Chosen Path).

For all $x,y\in\{0,1\}^{d}$ , integer $i\geq 0$ , and random $M\in\mathcal{M}_{B}$ :

$\operatorname*{\mathbb{E}}[|M_{i}(x)|]\leq(1/b_{1})^{i}w$ . 2. 2.

If $B(x,y)<b_{2}$ then $\operatorname*{\mathbb{E}}[|M_{i}(x)\cap M_{i}(y)|]\leq(b_{2}/b_{1})^{i}w$ . 3. 3.

If $B(x,y)\geq b_{1}$ then $\Pr[M_{i}(x)\cap M_{i}(y)\neq\emptyset]\geq w/(i+w)$ .

Proof.

We prove each property by induction on $i$ . The base cases $i=0$ follow from (4). Now consider the inductive step for property 1. Let $\mathds{1}\{\mathcal{P}\}$ denote the indicator function for predicate $\mathcal{P}$ . Using independence of the hash functions $h_{i}$ we get:

[TABLE]

The last inequality uses the induction hypothesis. We use the same approach for the second property where we let $X_{i}=M_{i}(x)\cap M_{i}(y)$ .

[TABLE]

To prove the third property we bound the variance of $|X_{i}|$ and apply Chebyshev’s inequality to bound the probability of $X_{i}=\emptyset$ . First consider the case where $|x|\leq 1/b_{1}$ and $|y|\leq 1/b_{1}$ . Here it must hold that $X_{i}>0$ as intersecting paths exist ( $b_{1}>0$ ) and always activate. In all other cases we have that

[TABLE]

Knowing the expected value we can apply Chebyshev’s inequality once we have an upper bound for $\mathrm{Var}[|X_{i}|]=\operatorname*{\mathbb{E}}[|X_{i}|^{2}]-\operatorname*{\mathbb{E}}[|X_{i}|]^{2}$ . Specifically we show that $\operatorname*{\mathbb{E}}[|X_{i}|^{2}]\leq wi(B(x,y)/b_{1})^{2i}$ , by induction on $i$ . To simplify notation we define the indicator variable

[TABLE]

where we suppress the subscript $i$ . First observe that

[TABLE]

By (5) we see that $|X_{i}|=\sum_{p\in X_{i-1}}\sum_{j\in[d]}Y_{p,j}$ , which means:

[TABLE]

The third property now follows from a one-sided version of Chebychev’s inequality applied to $|X_{i}|$ . ∎

28.2 Implementation details

Lemma 4.3 continues to hold when the hash functions $h_{1},\dots,h_{k}$ are individually 2-independent (and mutually independent) since we only use bounds on the first and second moment of the hash values. We can therefore use a simple and practical scheme such as Zobrist hashing [173] that hashes strings of $\Theta(\log n)$ bits to strings of $\Theta(\log n)$ bits in $O(1)$ time using space, say, $O(n^{1/2})$ . It is not hard to see that the domain and range of $h_{1},\dots,h_{k}$ can be compressed to $O(\log n)$ bits (causing a neglible increase in the failure probability of the data structure). We simply hash the paths $p\in M_{i}(x)$ to intermediate values of $O(\log n)$ bits, avoiding collisions with high probability, and in a similar vein, with high probability $O(\log n)$ bits of precision suffice to determine whether $h_{i}(p\circ j)<\frac{x_{j}}{b_{1}|x|}$ .

We now consider how to parameterize $\mathcal{M}_{B}$ to solve the $(b_{1},b_{2})$ -similarity problem for Braun-Blanquet similarity on a set $P$ of $|P|=n$ points for every choice of constant parameters $0<b_{2}<b_{1}<1$ , independent of $n$ . Note that we exclude $b_{1}=1$ (which would correspond to identical vectors that can be found in time $O(1)$ by resorting to standard hashing) and $b_{2}=0$ (for which every data point would be a valid answer to a query). We set parameters

[TABLE]

from which it follows that $\mathcal{M}_{B}$ is $(b_{1},b_{2},m_{1},m_{2})$ -sensitive with $m_{1}=n^{\rho}w/b_{1}$ and $m_{2}=n^{\rho-1}w$ where $\rho=\log(1/b_{1})/\log(1/b_{2})$ . To bound the expected evaluation time of $M_{k}$ we use Zobrist hashing as well as intermediate hashes for the paths as described above. In the $i$ th step in the branching process the expected number of hash function evaluations is bounded by $|q|$ times the number of paths alive at step $i-1$ . We can therefore bound the expected time to compute $M_{k}(q)$ by

[TABLE]

This completes the proof of Theorem 4.1.111We know of a way of replacing the multiplicative factor $|q|$ in equation (6) by an additive term of $O(|q|k)$ by choosing the hash functions $h_{i}$ carefully, but do not discuss this improvement here since $|q|$ can be assumed to be polylogarithmic and our focus is on the exponent of $n$ .

28.3 Comparison

We will proceed by comparing our Theorem 4.1 to results that can be achieved using existing techniques. Again we focus on the setting where data points and query points are exactly $t$ -sparse. An overview of different techniques for three measures of similarity is shown in Table 2. To summarize: The Chosen Path algorithm of Theorem 4.1 improves upon all existing data-independent results over the entire $0<b_{2}<b_{1}<1$ parameter space. Furthermore, we improve upon the best known data-dependent techniques [17] for a large part of the parameter space (see Figure 9). The details of the comparisons are given in Appendix 33.

MinHash.

For $t$ -sparse vectors there is a 1-1 mapping between Braun-Blanquet and Jaccard similarity. In this setting $J(x,y)=B(x,y)/(2-B(x,y))$ . Let $b_{1}=2j_{1}/(j_{1}+1)$ and $b_{2}=2j_{2}/(j_{2}+1)$ be the Braun-Blanquet similarities corresponding to Jaccard similarities $j_{1}$ and $j_{2}$ . The LSH framework using MinHash achieves $\rho_{\text{minhash}}=\log\left(\tfrac{b_{1}}{2-b_{1}}\right)/\log\left(\tfrac{b_{2}}{2-b_{2}}\right)$ ; this should be compared to $\rho=\log(b_{1})/\log(b_{2})$ achieved in Theorem 4.1. Since the function $f(z)=\log(\tfrac{z}{2-z})/\log z$ is monotonically increasing in $[0,1]$ we have that $\rho/\rho_{\text{minhash}}=f(b_{2})/f(b_{1})<1$ , i.e., $\rho$ is always smaller than $\rho_{\text{minhash}}$ . As an example, for $j_{1}=0.2$ and $j_{2}=0.1$ we get $\rho=0.644...$ while $\rho_{\text{minhash}}=0.698...$ . Figure 7 shows the difference for the whole parameter space.

Angular LSH.

Since our vectors are exactly $t$ -sparse Braun-Blanquet similarities correspond directly to dot products (which in turn correspond to angles). Thus we can apply angular LSH such as SimHash [47] or cross-polytope LSH [13]. As observed in [54] one can express the $\rho$ -value of cross-polytope LSH in terms of dot products as $\rho_{\text{angular}}=\tfrac{1-b_{1}}{1+b_{1}}/\tfrac{1-b_{2}}{1+b_{2}}$ . Since the function $f^{\prime}(z)=(1+z)\log(z)/(1-z)$ is negative and monotonically increasing in $[0,1]$ we have that $\rho/\rho_{\text{angular}}=f^{\prime}(b_{1})/f^{\prime}(b_{2})<1$ , i.e., $\rho$ is always smaller than $\rho_{\text{angular}}$ . In the above example, for $j_{1}=0.2$ and $j_{2}=0.1$ we have $\rho_{\text{angular}}=0.722...$ which is about $0.078$ more than Chosen Path. See Figure 8 for a visualization of the difference for the whole parameter space.

Data-dependent Hamming LSH.

The Hamming distance between two $t$ -sparse vectors with Braun-Blanquet similarity $b$ is $2t(1-b)$ , since the intersection of the vectors has size $tb$ . This means that $(b_{1},b_{2})$ -similarity search under Braun-Blanquet similarity can be reduced to Hamming similarity search with approximation factor $c=(2t(1-b_{1}))/(2t(1-b_{2}))=(1-b_{1})/(1-b_{2})$ . As mentioned above, the data dependent LSH technique of [17] achieves $\rho=1/(2c-1)$ ignoring $o_{n}(1)$ terms. In terms of $b_{1}$ and $b_{2}$ this is $\rho_{\text{datadep}}=\frac{1-b_{1}}{1+b_{1}-2b_{2}}$ , which in incomparable to the $\rho$ of Theorem 4.1. In Appendix 33 we show that $\rho<\rho_{\text{datadep}}$ whenever $b_{2}\leq 1/5$ , or equivalently, whenever $j_{2}\leq 1/9$ . Revisiting the above example, for $j_{1}=0.2$ and $j_{2}=0.1$ we have $\rho_{\text{datadep}}=0.6875$ which is about $0.043$ more than Chosen Path. Figure 9 gives a comparison covering the whole parameter space.

29 Lower bound

In this section we will show a locality-sensitive hashing lower bound for $\{0,1\}^{d}$ under Braun-Blanquet similarity. We will first show that LSH lower bounds apply to the class of solutions to the approximate similarity search problem that are based on locality-sensitive maps, thereby including our own upper bound. Next we will introduce some relevant tools from the literature, in particular the LSH lower bounds for Hamming space by O’Donnell et al. [121] which we use, through a reduction, to show LSH lower bounds under Braun-Blanquet similarity.

Lower bounds for locality-sensitive maps.

Because our upper bound is based on a locality-sensitive map $\mathcal{M}_{B}$ and not LSH-based we first show that LSH lower bounds apply to LSM-based solutions. This is not too surprising as both the LSH and LSF frameworks produce LSM-based solutions. We note that the idea of showing lower bounds for a more general class of algorithms that encompasses both LSH and LSF was used by Andoni et al. [16] in their list-of-points data structure lower bound for the space-time tradeoff of solutions to the approximate near neighbor problem in the random data regime. We use the approach of Christiani [54] to convert an LSM family into an LSH family using MinHash.

Lemma 4.4.

Suppose we have a $(s_{1},s_{2},m_{1},m_{2})$ -sensitive family of maps $\mathcal{M}$ . Then we can construct a $(s_{1},s_{2},p_{1},p_{2})$ -sensitive family of hash functions $\mathcal{H}$ with $p_{1}=1/8m$ and $p_{2}=m_{2}/m$ where $m=\lceil 8m_{1}\rceil$ .

Proof.

We sample a function $h$ from $\mathcal{H}$ by sampling a function $M$ from $\mathcal{M}$ , modify $M$ to output a set of fixed size, and apply MinHash to the resulting set. For $M\in\mathcal{M}$ we define the function $\tilde{M}$ where we ensure that the size of the output set is $m$ . We note that the purpose of this step is to be able to simultaneously lower bound $p_{1}$ and upper bound $p_{2}$ for $\mathcal{H}$ when we apply MinHash to the resulting sets.

[TABLE]

We proceed by applying MinHash to the set $\tilde{M}(x)$ . Let $\pi$ denote a random permutation of the range of $\tilde{M}$ and define

[TABLE]

We then have

[TABLE]

summing over the finite set of all possible Jaccard similarities $\xi=a/b$ with $a,b\in\{0,1,\dots,2m\}$ . It is now fairly simple to lower bound $p_{1}$ and upper bound $p_{2}$ . Assume that $x,y$ satisfy that $\operatorname{sim}(x,y)\geq s_{1}$ . To lower bound $p_{1}$ we use a union bound together with Markov’s inequality to bound the following probability:

[TABLE]

We therefore have that $\Pr[\tilde{M}(x)\cap\tilde{M}(y)\neq\emptyset]\geq 1/4$ . In the event of a nonempty intersection the probability of collision is given by $J(\tilde{M}(x)\cap\tilde{M}(y))\geq 1/2m$ allowing us to conclude that $p_{1}\geq 1/8m$ .

Bounding the collision probability for distant pairs of points $x,y$ with $\sim(x,y)\leq s_{2}$ we get

[TABLE]

∎

We are now ready to justify the statement that LSH lower bounds apply to LSM, allowing us to restrict our attention to proving LSH lower bounds for Braun-Blanquet similarity.

Corollary 4.1.

Suppose that we have an LSM-based solution to the $(s_{1},s_{2})$ -similarity search problem with query time $O(n^{\rho})$ . Then there exists a family $\mathcal{H}$ of locality-sensitive hash functions with $\rho(\mathcal{H})=\rho+O(1/\log n)$ .

Proof.

The existence of the LSM-based solution implies that for every $n$ there exists a $(s_{1},s_{2},m_{1},m_{2})$ -sensitive family of maps $\mathcal{M}$ with $m_{1}=O(n^{\rho})$ and $nm_{2}=O(n^{\rho})$ . The upper bound on $\rho$ follows from applying Lemma 4.4. ∎

LSH lower bounds for Hamming space.

There exist a number of powerful results that lower bound the $\rho$ -value that is attainable by locality-sensitive hashing and related approaches in various settings [114, 132, 121, 19, 54, 16]. O’Donnell et al. [121] showed an LSH lower bound of $\rho=\log(1/p_{1})/\log(1/p_{2})\geq 1/c-o_{d}(1)$ for $d$ -dimensional Hamming space under the assumption that $p_{2}$ is not too small compared to $d$ , i.e., $\log(1/p_{2})=o(d)$ . The lower bound by O’Donnell et al. holds for $(r,cr,p_{1},p_{2})$ -sensitive families for a particular choice of $r$ that depends on $d$ , $p_{2}$ , and $c$ , and where $r$ is small compared to $d$ (for instance, we have that $r=\tilde{\Theta}(d^{2/3})$ when $c$ and $p_{2}$ are constant).

We state a simplified version of the lower bound due to O’Donnell et al. where $r=\sqrt{d}$ that we will use as a tool to prove our lower bound for Braun-Blanquet similarity. The full proof of Lemma 4.5 is given in Appendix 32.

Lemma 4.5.

For every $d\in\mathbb{N}$ , $1/d\leq p_{2}\leq 1-1/d$ , and $1\leq c\leq d^{1/8}$ every $(\sqrt{d},c\sqrt{d},p_{1},p_{2})$ -sensitive hash family $\mathcal{H}$ for $\{0,1\}^{d}$ under Hamming distance must have

[TABLE]

In general, good lower bounds for the entire parameter space $(r,cr)$ are not known, although the techniques by O’Donnell et al. appear to yield a bound of $\rho\gtrsim\log(1-2r/d)/\log(1-2cr/d)$ . This is far from tight as can be seen by comparing it to the bit-sampling [91] upper bound of $\rho=\log(1-r/d)/\log(1-cr/d)$ . Existing lower bounds are tight in two different settings. First, in the setting where $cr\approx d/2$ (random data), lower bounds [114, 73, 19] match various instantiations of angular LSH [156, 14, 13]. Second, in the setting where $r\ll d$ , the lower bound by O’Donnell et al. [121] becomes $\rho\gtrsim\log(1-2r/d)/\log(1-2cr/d)\approx 1/c$ , matching bit-sampling LSH [91] as well as Angular LSH.

29.1 Braun-Blanquet LSH lower bound

We are now ready to prove the LSH lower bound from Theorem 4.2. The lower bound together with Corollary 4.1 shows that the $\rho$ -value of Theorem 4.1 is best possible up to $o_{d}(1)$ terms within the class of data-independent locality-sensitive maps for Braun-Blanquet similarity. Furthermore, the lower bound also applies to angular distance on the unit sphere where it comes close to matching the best known upper bounds for much of the parameter space as can be seen from Figure 8.

Proof sketch.

The proof works by assuming the existence of a $(b_{1},b_{2},p_{1},p_{2})$ -sensitive family $\mathcal{H}_{B}$ for $\{0,1\}^{d}$ under Braun-Blanquet similarity with $\rho=\log(1/b_{1})/\log(1/b_{2})-\gamma$ for some $\gamma>0$ . We use a transformation $T$ from Hamming space to Braun-Blanquet similarity to show that the existence of $\mathcal{H}_{B}$ implies the existence of a $(r,cr,p_{1}^{\prime},p_{2}^{\prime})$ -sensitive family $\mathcal{H}_{H}$ for $D$ -dimensional Hamming space that will contradict the lower bound of O’Donnell et al. [121] as stated in Lemma 4.5 for some appropriate choice of $\gamma=\gamma(d,p_{2})$ .

We proceed by giving an informal description of a simple “tensoring” technique for converting a similarity search problem in Hamming space into a Braun-Blanquet set similarity problem for target similarity thresholds $b_{1},b_{2}$ . For $x\in\{0,1\}^{d}$ define

[TABLE]

and for a positive integer $\tau$ define $x^{\otimes\tau}=\{(v_{1},\dots,v_{\tau})\mid v_{i}\in\tilde{x}\}$ . We have that $|x^{\otimes\tau}|=|\tilde{x}|^{\tau}=d^{\tau}$ and

[TABLE]

where $r=\left\lVert x-y\right\rVert_{1}$ . For every choice of constants $0<b_{2}<b_{1}<1$ we can choose $d$ , $\tau$ , $r$ , and $c\geq 1$ such that $(1-r/d)^{\tau}\approx b_{1}$ and $(1-cr/d)^{\tau}\approx b_{2}$ . Now, given an LSH family for Braun-Blanquet with $\rho<\log(1/b_{1})/\log(1/b_{2})$ we would be able to obtain an LSH family for Hamming space with

[TABLE]

For appropriate choices of parameters this would contradict the O’Donnell et al. LSH lower bound of $\rho\gtrsim 1/c$ for Hamming space. The proof itself is mostly an exercise in setting parameters and applying the right bounds and approximations to make everything fit together with the intuition above. Importantly, we use sampling in order to map to a dimension that is much lower than the $d^{\tau}$ from the proof sketch in order to make the proof hold for small values of $p_{2}$ in relation to $d$ .

Hamming distance to Braun-Blanquet similarity.

Let $d\in\mathbb{N}$ and let $0<b_{2}<b_{1}<1$ be constant as in Theorem 4.2. Let $\varepsilon\geq 1/d$ be a parameter to be determined. We want to show how to use a transformation $T\colon\{0,1\}^{D}\to\{0,1\}^{d}$ from Hamming distance to Braun-Blanquet similarity together with our family $\mathcal{H}_{B}$ to construct a $(r,cr,p_{1}^{\prime},p_{2}^{\prime})$ -sensitive family $\mathcal{H}_{H}$ for $D$ -dimensional Hamming space with parameters

[TABLE]

where $p_{1}^{\prime}$ and $p_{2}^{\prime}$ remain to be determined.

The function $T$ takes as parameters positive integers $t$ , $l$ , and $\tau$ . The output of $T$ consists of $t$ concatenated $l$ -bit strings, each of of Hamming weight one. Each of the $t$ strings is constructed independently at random according to the following process: Sample a vector of indices ${i}=(i_{1},i_{2},\dots,i_{\tau})$ uniformly at random from $[D]^{\tau}$ and define $x_{{i}}\in\{0,1\}^{\tau}$ as $x_{{i}}=x_{i_{1}}\circ x_{i_{2}}\circ\dots\circ x_{i_{\tau}}$ . Let $z(x)\in\{0,1\}^{2^{\tau}}$ be indexed by $j\in\{0,1\}^{\tau}$ and set the bits of $z(x)$ as follows:

[TABLE]

Next we apply a random function $g\colon\{0,1\}^{\tau}\to[l]$ in order to map $z(x)$ down to an $l$ -bit string ${r}(z(x))$ of Hamming weight one while approximately preserving Braun-Blanquet similarity. For $i\in[l]$ we set

[TABLE]

Finally we set

[TABLE]

where each ${r}_{i}(z_{i}(x))$ is constructed independently at random.

We state the properties of $T$ for the following parameter setting:

[TABLE]

Lemma 4.6.

For every $d\in\mathbb{N}$ and $D=2^{d}$ there exists a distribution over functions of the form $T\colon\{0,1\}^{D}\to\{0,1\}^{d}$ such that for all $x,y\in\{0,1\}^{D}$ and random $T$ :

$|T(x)|=t$ . 2. 2.

If $\left\lVert x-y\right\rVert_{1}\leq r$ then $B(T(x),T(y))\geq{b_{1}}$ with probability at least $1-e^{-t\varepsilon^{2}/2}$ . 3. 3.

If $\left\lVert x-y\right\rVert_{1}>cr$ then $B(T(x),T(y))<{b_{2}}$ with probability at least $1-2e^{-t\varepsilon^{2}/32}$ .

Proof.

The first property is trivial. For the second property we consider $x,y$ with $\left\lVert x-y\right\rVert_{1}\leq r$ where we would like to lower bound

[TABLE]

We know that $|T(x)|=|T(y)|=t$ so it remains to lower bound the size of the intersection $|T(x)\cap T(y)|$ . Consider the expectation

[TABLE]

We have that $z(x)=z(y)$ if $x$ and $y$ take on the same value in the $\tau$ underlying bit-positions that are sampled to construct $z$ . Under the assumption that $\varepsilon\geq 1/d$ , then for $d$ greater than some sufficiently large constant we can use a standard approximation to the exponential function (detailed in Lemma 4.10 in Appendix 32) to show that

[TABLE]

Seeing as $|T(x)\cap T(y)|$ is the sum of $t$ independent Bernoulli trials we can apply Hoeffding’s inequality to yield the following bound:

[TABLE]

This proves the second property of $T$ .

For the third property we consider the Braun-Blanquet similarity of distant pairs of points $x,y$ with $\left\lVert x-y\right\rVert_{1}>cr$ . Again, under our assumption that $\varepsilon\geq 1/d$ and for $d$ greater than some constant we have

[TABLE]

There are two things that can cause the event $B(T(x),T(y))<{b_{2}}$ to fail. First, the sum of the $t$ independent Bernoulli trials for the event $z(x)=z(x^{\prime})$ can deviate too much from its expected value. Second, the mapping down to $l$ -bit strings that takes place from $z(x)$ to ${r}(z(x))$ can lead to an additional increase in the similarity due to collisions. Let $Z$ denote the sum of the $t$ Bernoulli trials for the events $z(x)=z(x^{\prime})$ associated with $T$ . We again apply a standard Hoeffding bound to show that

[TABLE]

Let $X$ denote the number of collisions when performing the universe reduction to $l$ -bit strings. By our choice of $l$ we have that $E[X]\leq(\varepsilon/8)t$ . Another application of Hoeffding’s inequality shows that

[TABLE]

We therefore get that

[TABLE]

This proves the third property of $T$ . ∎

Contradiction.

To summarize, using the random map $T$ together with the LSH family $\mathcal{H}_{B}$ we can obtain an $(r,cr,{p_{1}}^{\prime},{p_{2}}^{\prime})$ -sensitive family $\mathcal{H}_{H}$ for $D$ -dimensional Hamming space with ${p_{1}}^{\prime}={p_{1}}-\delta$ and ${p_{2}}^{\prime}={p_{2}}+\delta$ for $\delta=2e^{-t\varepsilon^{2}/32}$ . For our choice of $c=\frac{\ln(1/(b_{2}-\varepsilon))}{\ln(1/(b_{1}+\varepsilon))}$ we plug the family $\mathcal{H}_{H}$ into the lower bound of Lemma 4.5 and use that $O(D^{-1/4})=O(\varepsilon)$ which follows from our constraint that $\varepsilon\geq 1/d$ .

[TABLE]

Under our assumed properties of $\mathcal{H}_{B}$ , we can upper bound the value of $\rho$ for $\mathcal{H}_{H}$ . For simplicity we temporarily define $\lambda=2\delta/{p_{2}}$ and assume that $\lambda/\ln(1/{p_{2}})\leq 1/2$ and $\ln(1/{p_{2}})\geq 1$ . The latter property holds without loss of generality through use of the standard LSH powering technique [91, 86, 121] that allows us to transform an LSH family with ${p_{2}}<1$ to a family that has ${p_{2}}\leq 1/e$ without changing its associated $\rho$ -value.

[TABLE]

We get a contradiction between our upper bound and lower bound for $\rho(\mathcal{H}_{H})$ whenever $\gamma$ violates the following relation that summarizes the bounds:

[TABLE]

In order for a contradiction to occur, the value of $\gamma$ has to satisfy

[TABLE]

By our setting of $t=\lfloor d/l\rfloor$ and $l=\lceil 8/\varepsilon\rceil$ we have that $\delta=e^{-\Omega(d\varepsilon^{3})}$ . We can cause a contradiction for a setting of $\varepsilon^{3}=K\frac{\ln(d/{p_{2}})}{d}$ where $K$ is some constant and where we assume that $d$ is greater than some constant. The value of $\gamma$ for which the lower bound holds can be upper bounded by

[TABLE]

This completes the proof of Theorem 4.2.

30 Equivalent set similarity problems

In this section we consider how to use our data structure for Braun-Blanquet similarity search to support other similarity measures such as Jaccard similarity. We already observed in the introduction that a direct translation exists between several similarity measures whenever the size of every sets is fixed to $t$ . Call an $(s_{1},s_{2})$ -similarity search problem ( $t$ , $t^{\prime}$ )-regular if $P$ is restricted to vectors of weight $t$ and queries are restricted to vectors of weight $t^{\prime}$ . Obviously, a $(t,t^{\prime})$ -regular similarity search problem is no harder than the general similarity search problem, but it also cannot be too much easier when expressed as a function of the thresholds $(s_{1},s_{2})$ : For every pair $(t,t^{\prime})\in\{0,\dots,d\}^{2}$ we can construct a ( $t$ , $t^{\prime}$ )-regular data structure (such that each point $x\in P$ is represented in the $d+1$ data structures with $t=|x|$ ), and answer a query for $q\in\{0,1\}^{d}$ by querying all data structures with $t^{\prime}=|q|$ . Thus, the time and space for the general $(s_{1},s_{2})$ -similarity search problem is at most $d+1$ times larger than the time and space of the most expensive ( $t$ , $t^{\prime}$ )-regular data structure. This does not mean that we cannot get better bounds in terms of other parameters, and in particular we expect that the difficulty of $(t,t^{\prime})$ -regular similarity search problems depends on parameters $t$ and $t^{\prime}$ .

Dimension reduction.

If the dimension is large a factor of $d$ may be significant. However, for most natural similarity measures a $(s_{1},s_{2})$ -similarity problem in $d\gg(\log n)^{3}$ dimensions can be reduced to a logarithmic number of $(s^{\prime}_{1},s^{\prime}_{2})$ -similarity problems on $P^{\prime}\subseteq\{0,1\}^{d^{\prime}}$ in $d^{\prime}=(\log n)^{3}$ dimensions with $s^{\prime}_{1}=s_{1}-O(1/\log n)$ and $s^{\prime}_{2}=s_{2}+O(1/\log n)$ . Since the similarity gap is close to the one in the original problem, $s^{\prime}_{1}-s^{\prime}_{2}=s_{1}-s_{2}-O(1/\log n)$ , where $s_{1}$ and $s_{2}$ are assumed to be independent of $n$ , the difficulty ( $\rho$ -value) remains essentially the same. First, split $P$ into $\log d$ size classes $P_{i}$ such that vectors in class $i$ have size in $[2^{i},2^{i+1})$ . For each size class the reduction is done independently and works by a standard technique: sample a sequence of random sets $I_{j}\subseteq\{1,\dots,d\}$ , $i=1,\dots,d^{\prime}$ , and set $x^{\prime}_{j}=\vee_{\ell\in I_{j}}x_{\ell}$ . The size of each set $I_{j}$ is chosen such that $Pr[x^{\prime}_{j}=1]\approx 1/\log(n)$ when $|x|=2^{i+1}$ . By Chernoff bounds this mapping preserves the relative weight of vectors up to size $2^{i}\log n$ up to an additive $O(1/\log n)$ term with high probability. Assume now that the similarity measure is such that for vectors in $P_{i}$ we only need to consider $|q|$ in the range from $2^{i}/\log n$ to $2^{i}\log n$ (since if the size difference is larger, the similarity is negligible). The we can apply Chernoff bounds to the relative weights of the dimension-reduced vectors $x^{\prime}$ , $q^{\prime}$ and the intersection $x^{\prime}\cap q^{\prime}$ . In particular, we get that the Jaccard similarity of a pair of vectors is preserved up to an additive error of $O(1/\log n)$ with high probability. The class of similarity measures for which dimension reduction to $(\log n)^{O(1)}$ dimensions is possible is large, and we do not attempt to characterize it here. Instead, we just note that for such similarity measures we can determine the complexity of similarity search up to a factor $(\log n)^{O(1)}$ by only considering regular search problems.

Equivalence of regular similarity search problems.

We call a set similarity measure on $\{0,1\}^{d}$ symmetric if it can be written in the form $S(q,x)=f_{d,|q|,|x|}(|q\cap x|)$ , where each function $f_{d,|q|,|x|}\colon\mathbb{N}\rightarrow[0,1]$ is nondecreasing. All 59 set similarity measures listed in the survey [51], normalized to yield similarities in $[0,1]$ , are symmetric. In particular this is the case for Jaccard similarity (where $J(q,x)=|q\cap x|/(|q|+|x|-|q\cap x|)$ ) and for Braun-Blanquet similarity. For a symmetric similarity measure, the predicate $\operatorname{sim}(q,x)\geq s_{1}$ is equivalent to the predicate $|q\cap x|\geq i_{1}$ , where $i_{1}=\min\{i\;|\;f_{d,t^{\prime},t}(i)\geq s_{1}\}$ , and $\operatorname{sim}(q,x)>s_{2}$ is equivalent to the predicate $|q\cap x|\geq i_{2}$ , where $i_{2}=\min\{i\;|\;f_{d,t^{\prime},t}(i)>s_{2}\}$ . This means that every ( $t$ , $t^{\prime}$ )-regular $(s_{1},s_{2})$ -similarity search problem on $P\subseteq\{0,1\}^{d}$ is equivalent to an $(i_{1}/d,i_{2}/d)$ -similarity search problem on $P$ , where $\operatorname{sim}(q,x)=|x\cap q|/d$ . In other words, all symmetric similarity search problems can be translated to each other, and it suffices to study a single one, such as Braun-Blanquet similarity.

Jaccard similarity.

We briefly discuss Jaccard similarity since it is the most widely used measure of set similarity. If we consider the problem of $(j_{1},j_{2})$ -approximate Jaccard similarity search in the $(t,t^{\prime})$ -regular case with $t\neq t^{\prime}$ then our Theorem 4.1 is no longer guaranteed to yield the lowest value of $\rho$ among competing data-independent approaches such as MinHash and Angular LSH. To simplify the comparision between different measures we introduce parameters $\beta$ and $b$ defined by $|y|=\beta|x|$ and $b=|x\cap y|/|x|$ (note that $0\leq b\leq\beta\leq 1$ ). The three primary measures of set similarity considered in this paper can then be written as follows:

[TABLE]

As shown in Figure 10 among angular LSH, MinHash, and Chosen Path, the technique with the lowest $\rho$ -value is different depending on the parameters $(j_{1},j_{2})$ and asymmetry $\beta$ .

We know that Chosen Path is optimal and strictly better than the competing data-independent techniques across the entire parameter space $(j_{1},j_{2})$ when $\beta=1$ , but it remains open to find tight upper and lower bounds in the case where $\beta\neq 1$ .

31 Conclusion and open problems

We have seen that, perhaps surprisingly, there exists a relatively simple way of strictly improving the $\rho$ -value for data-independent set similarity search in the case where all sets have the same size. To implement the required locality-sensitive map efficiently we introduce a new technique based on branching processes that could possibly lead to more efficient solutions in other settings.

It remains an open problem to find tight upper and lower bounds on the $\rho$ -value for Jaccard and cosine similarity search that hold for the entire parameter space in the general setting with arbitrary set sizes. Perhaps a modified version of the Chosen Path algorithm can yield an improved solution to Jaccard similarity search in general. One approach is to generalize the condition $h_{i}(p\circ j)<x_{j}/b_{1}|x|$ to use different thresholds for queries and updates. This yields different space-time tradeoffs when applying the Chosen Path algorithm to Jaccard similarity search.

Another interesting question is if the improvement shown for sparse vectors can be achieved in general for inner product similarity. A similar, but possibly easier, direction would be to consider weighted Jaccard similarity.

Acknowledgment

We thank Thomas Dybdahl Ahle for comments on a previous version of this manuscript.

32 Appendix: Details behind the lower bound

32.1 Tools

For clarity we state some standard technical lemmas that we use to derive LSH lower bounds.

Lemma 4.7 (Hoeffding [88, Theorem 1]).

Let $X_{1},X_{2},\dots,X_{n}$ be independent random variables satisfying $0\leq X_{i}\leq 1$ for $i\in[n]$ . Define $X=X_{1}+X_{2}+\dots+X_{n}$ , $Z=X/n$ , and $\mu=\operatorname*{\mathbb{E}}[Z]$ , then:

For $\hat{\mu}\geq\mu$ and $0<\varepsilon<1-\hat{\mu}$ we have that $\Pr[Z-\hat{\mu}\geq\varepsilon]\leq e^{-2n\varepsilon^{2}}$ .

-

For $\hat{\mu}\leq\mu$ and $0<\varepsilon<\hat{\mu}$ we have that $\Pr[Z-\hat{\mu}\leq-\varepsilon]\leq e^{-2n\varepsilon^{2}}$ .

Lemma 4.8 (Chernoff [113, Thm. 4.4 and 4.5]).

Let $X_{1},\dots,X_{n}$ be independent Poisson trials and define $X=\sum_{i=1}^{n}X_{i}$ and $\mu=\operatorname*{\mathbb{E}}[X]$ . Then, for $0<\varepsilon<1$ we have

$\Pr[X\geq(1+\varepsilon)\mu]\leq e^{-\varepsilon^{2}\mu/3}$ .

-

$\Pr[X\leq(1-\varepsilon)\mu]\leq e^{-\varepsilon^{2}\mu/2}$ .

Lemma 4.9 (Bounding the logarithm [160]).

For $x>-1$ we have that $\tfrac{x}{1+x}\leq\ln(1+x)\leq x$ .

Lemma 4.10 (Approximating the exponential function [115, Prop. B.3]).

For all $t,n\in\real$ with $|t|\leq n$ we have that $e^{t}(1-\tfrac{t^{2}}{n})\leq(1+\tfrac{t}{n})^{n}\leq e^{t}$ .

32.2 Proof of Lemma 4.5

Preliminaries.

We will reuse the notation of Section 3. from O’Donnell et al. [121].

Definition 4.4.

For $0\leq\lambda<1$ we say that $(x,y)$ are $(1-\lambda)$ -correlated if $x$ is chosen uniformly at random from $\{0,1\}^{d}$ and $y$ is constructed by rerandomizing each bit from $x$ independently at random with probability $\lambda$ .

Let $(x,y)$ be $e^{-t}$ -correlated and let $\mathcal{H}$ be a family of hash functions on $\{0,1\}^{d}$ , then we define

[TABLE]

We have that $\mathbb{K}_{\mathcal{H}}(t)$ is a log-convex function which implies the following property that underlies the lower bound:

Lemma 4.11.

For every family of hash functions $\mathcal{H}$ on $\{0,1\}^{d}$ , every $t\geq 0$ , and $c\geq 1$ we have

[TABLE]

The idea behind the proof is to tie $p_{1}$ to $\mathbb{K}_{\mathcal{H}}(t)$ and $p_{2}$ to $\mathbb{K}_{\mathcal{H}}(ct)$ through Chernoff bounds and then apply Lemma 4.11 to show that $\rho\gtrsim 1/c$ .

Proof.

Begin by assuming that we have a family $\mathcal{H}$ that satisfies the conditions of Lemma 4.5. Note that the expected Hamming distance betwee $(1-\lambda)$ -correlated points $x$ and $y$ is given by $(\lambda/2)d$ . We set $\lambda_{p_{1}}/2=d^{-1/2}-d^{-5/8}$ and $\lambda_{p_{2}}/2=cd^{-1/2}+2cd^{-5/8}$ and let $(x,y)$ denote $(1-\lambda_{p_{1}})$ -correlated random strings and $(x,x^{\prime})$ denote $(1-\lambda_{p_{2}}q$ )-correlated random strings. By standard Chernoff bounds we get the following guarantees:

[TABLE]

We will establish a relationship between $\mathbb{K}_{\mathcal{H}}(t_{p_{1}})$ and ${p_{1}}$ on the one hand, and $\mathbb{K}_{\mathcal{H}}(t_{p_{2}})$ and ${p_{2}}$ on the other hand, for the following choice of parameters $t_{p_{1}}$ and $t_{p_{2}}$ :

[TABLE]

By the properties of $\mathcal{H}$ and from the definition of $\mathbb{K}_{\mathcal{H}}$ we have that

[TABLE]

Let $\delta=\max\{\Pr[\left\lVert x-y\right\rVert_{1}\geq r],\Pr[\left\lVert x-x^{\prime}\right\rVert_{1}\leq cr]\}=e^{-\Omega(d^{1/4})}$ . By Lemma 4.11 and our setting of $t_{p_{1}}$ and $t_{p_{2}}$ we can use the bounds on the natural logarithm from Lemma 4.9 to show the following:

[TABLE]

We proceed by lower bounding $\rho$ where we make use of the inequalities derived above.

[TABLE]

By Lemma 4.11 combined with the restrictions on our parameters, for $d$ greater than some constant we have that $\mathbb{K}_{\mathcal{H}}(t_{p_{2}})\geq\mathbb{K}_{\mathcal{H}}(t_{p_{1}})^{2c}\geq({p_{1}}/2)^{2c}\geq(2d)^{-2c}\geq(2d)^{-2d^{1/8}}$ . Furthermore, we lower bound $\ln(1/\mathbb{K}_{\mathcal{H}}(t_{p_{2}}))$ by using that $\mathbb{K}_{\mathcal{H}}(t_{p_{2}})\leq{p_{2}}+\delta$ together with the restriction that ${p_{2}}\geq 1-1/d$ and the properties of $\delta$ . For $d$ greater than some constant it therefore holds that $\mathbb{K}_{\mathcal{H}}(t_{p_{2}})\leq 1-1/2d$ from which it follows that $\ln(1/\mathbb{K}_{\mathcal{H}}(t_{p_{2}}))\geq 1/2d$ .

[TABLE]

By the arguments above we have that

[TABLE]

Inserting the lower bound for $\frac{\ln(1/\mathbb{K}_{\mathcal{H}}(t_{p_{1}}))}{\ln(1/\mathbb{K}_{\mathcal{H}}(t_{p_{2}}))}$ results in the lemma.

33 Appendix: Comparisons

For completeness we state the proofs behind the comparisons between the $\rho$ -values obtained by the Chosen Path algorithm and other LSH techniques.

33.1 MinHash

For data sets with fixed sparsity and Braun-Blanquet similarities $0<b_{2}<b_{1}<1$ we have that $\rho/\rho_{\text{minhash}}=f(b_{2})/f(b_{1})$ where $f(x)=\log(x/(2-x))/\log(x)$ . If $f(x)$ is monotone increasing in $(0,1)$ then $\rho/\rho_{\text{minhash}}<1$ . For $x\in(0,1)$ we have that $\operatorname{sign}(f^{\prime}(x))=\operatorname{sign}(g(x))$ where $g(x)=\ln(x)+(2-x)\ln(2-x)$ . The function $g(x)$ equals zero at $x=1$ and has the derivative $g^{\prime}(x)=\ln(x)-\ln(2-x)$ which is negative for values of $x\in(0,1)$ . We can thefore see that $f^{\prime}(x)$ is positive in the interval and it follows that $\rho<\rho_{\text{minhash}}$ for every choice of $0<b_{2}<b_{1}<1$ .

33.2 Angular LSH

We have that $\rho/\rho_{\text{angular}}<1$ if $f(x)=\ln(x)\frac{1+x}{1-x}$ is a monotone increasing function for $x\in(0,1)$ . For $x\in(0,1)$ we have that $\operatorname{sign}(f^{\prime}(x))=\operatorname{sign}(g(x))$ where $g(x)=(1-x^{2})/2+x\ln x$ . We note that $g(1)=0$ and $g^{\prime}(x)=1-x+\ln x$ . Therefore, if $g^{\prime}(x)<0$ for $x\in(0,1)$ it holds that $g(x)>0$ and $f(x)$ is monotone increasing in the same interval. We have that $g^{\prime}(1)=0$ and $g^{\prime\prime}(x)=-1+1/x>0$ implying that $g^{\prime}(x)<0$ in the interval.

33.3 Data-dependent LSH

Lemma 4.12.

Let $0<b_{2}<b_{1}<1$ and fix $\rho=1/2$ such that $b_{1}=\sqrt{b_{2}}$ . Then we have that $\rho<\rho_{\text{datadep}}$ for every value of $b_{2}<1/4$ .

Proof.

We will compare $\rho=\log(b_{1})/\log(b_{2})$ and $\rho_{\text{datadep}}=\frac{1-b_{1}}{1+b_{1}-2b_{2}}$ when $\rho$ is fixed at $\rho=1/2$ , or equivalently, $b_{1}=\sqrt{b_{2}}$ . We can solve the quadratic equation $1/2=\frac{1-\sqrt{b_{2}}}{1+\sqrt{b_{2}}-2b_{2}}$ to see that for $0<b_{2}<1$ we have that $\rho=\rho_{\text{datadep}}$ only when $b_{2}=1/4$ . The derivative of $\rho_{\text{datadep}}$ with respect to $b_{2}$ is negative when $b_{1}=\sqrt{b_{2}}$ . Under this restriction we therefore have that $\rho<\rho_{\text{datadep}}$ for $b_{2}<1/4$ which is equivalent to $j_{2}<1/7$ in the fixed-weight setting. ∎

To compare $\rho$ -values over the full parameter space we use the following two lemmas.

Lemma 4.13.

For every choice of fixed $0<\rho<1$ let $b_{2}=b_{1}^{1/\rho}$ . Then $\rho_{\text{datadep}}=\frac{1-b_{1}}{1+b_{1}-2b_{2}}$ is decreasing in $b_{1}$ for $b_{1}\in(0,1)$ .

Proof.

The sign of the derivative of $\rho_{\text{datadep}}$ with respect to $b_{1}$ is equal to the sign of the function $g(x)=-\rho x^{-1/\rho}+\rho-1+x^{-1}$ for $x\in(0,1)$ . We have that $g(1)=0$ and $g^{\prime}(x)=x{-1/p-1}-x^{-2}>0$ for $x\in(0,1)$ which shows that $g(x)<0$ in the interval. ∎

Lemma 4.14.

For $1/5=b_{2}<b_{1}<1$ we have that $\rho<\rho_{\text{datadep}}$ .

Proof.

For fixed $b_{2}=1/5$ consider $f(b_{1})=\rho-\rho_{\text{datadep}}$ as a function of $b_{1}$ in the interval $[1/5,1]$ . We want to show that $f(b_{1})<0$ for $b_{1}\in(1/5,1)$ . In the endpoints the function takes the value [math]. Between the endpoints we find that $f^{\prime}(b_{1})=\frac{1}{\ln(5)b_{1}}+\frac{8/5}{(3/5+b_{1})^{2}}$ and that $f^{\prime}(b_{1})=0$ is a quadratic form with only one solution $b_{1}^{*}$ in $[1/5,1]$ . By Lemma 4.12 we know that that for $b_{2}=1/5$ and $b_{1}=1/\sqrt{5}$ it holds that $f(b_{1})<0$ . Since $f(1/5)=f(1)=0$ , $f^{\prime}(b_{1})=0$ only in a single point in $[1/5,1]$ , and $f(1/\sqrt{5})<0$ we can conclude that the lemma holds. ∎

Corollary 4.2.

For every choice of $b_{1},b_{2}$ satisfying $0<b_{2}\leq 1/5$ and $b_{2}<b_{1}<1$ we have that $\rho<\rho_{\text{datadep}}$ .

Proof.

If $b_{2}=1/5$ the property holds by Lemma 4.14. If $b_{2}<1/5$ we define new variables $\hat{b}_{2},\hat{b}_{2}$ , setting $\hat{b}_{1}=\hat{b}_{1}^{\rho(b_{1},b_{2})}$ and initially consider $\hat{b}_{2}=1/5$ . In this setting we again have that $\rho(\hat{b}_{1},\hat{b}_{2})<\rho_{\text{datadep}}(\hat{b}_{1},\hat{b}_{2})$ . According to Lemma 4.13 it holds that $\rho_{\text{datadep}}$ is decreasing in $b_{2}$ for fixed $\rho$ . Therefore, as $\hat{b}_{2}$ decreases to $\hat{b}_{2}=b_{2}$ where $\hat{b}_{1}=b_{1}$ we have that $\rho(\hat{b}_{1},\hat{b}_{2})=\rho$ remains constant while $\rho_{\text{datadep}}$ increases. Since it held that $\rho<\rho_{\text{datadep}}$ at the initial values of $\hat{b}_{1},\hat{b}_{2}$ it must also hold for $b_{1},b_{2}$ . ∎

Numerical comparison of MinHash and Data-dep. LSH.

Comparing $\rho_{\text{minhash}}$ to $\rho_{\text{datadep}}$ we can verify numerically that even for $b_{2}$ fixed as low as $b_{2}=1/23$ we can find values of $b_{1}$ (for example $b_{1}=0.995$ such that $\rho_{\text{minhash}}>\rho_{\text{datadep}}$ .

Chapter 5 Adaptive similarity join

‘A light from the shadows shall spring’

Set similarity join is a fundamental and well-studied database operator. It is usually studied in the exact setting where the goal is to compute all pairs of sets that exceed a given similarity threshold (measured e.g. as Jaccard similarity). But set similarity join is often used in settings where 100% recall may not be important — indeed, where the exact set similarity join is itself only an approximation of the desired result set.

We present a new randomized algorithm for set similarity join that can achieve any desired recall up to 100%, and show theoretically and empirically that it significantly improves on existing methods. The present state-of-the-art exact methods are based on prefix-filtering, the performance of which depends on the prevalence of rare elements in the sets. Our method is robust against the absence of such structure in the data. At 90% recall our algorithm is often more than an order of magnitude faster than state-of-the-art exact methods, depending on how well a data set lends itself to prefix filtering. Our experiments on benchmark data sets also show that the method is several times faster than comparable approximate methods. Our algorithm makes use of recent theoretical advances in high-dimensional sketching and indexing.

34 Introduction

It is increasingly important for data processing and analysis systems to be able to work with data that is imprecise, incomplete, or noisy. Similarity join has emerged as a fundamental primitive in data cleaning and entity resolution over the last decade [21, 48, 141]. In this paper we focus on set similarity join: Given collections $R$ and $S$ of sets the task is to compute

[TABLE]

where $\operatorname{sim}(\cdot,\cdot)$ is a similarity measure and $\lambda$ is a threshold parameter. We deal with sets $x,y\subseteq\{1,\dots,d\}$ , where the number $d$ of distinct tokens can be naturally thought of as the dimensionality of the data.

Many measures of set similarity exist [51], but perhaps the most well-known such measure is the Jaccard similarity,

[TABLE]

For example, the sets $x=\{$ IT, University, Copenhagen $\}$ and $y=\{$ University, Copenhagen, Denmark $\}$ have Jaccard similarity $J(x,y)=1/2$ which could suggest that they both correspond to the same entity. In the context of entity resolution we want to find a set $T$ that contains $(x,y)\in R\times S$ if and only if $x$ and $y$ correspond to the same entity. The quality of the result can be measured in terms of precision $|(R{\;\bowtie_{\lambda}\;}S)\cap T|/|T|$ and recall $|(R{\;\bowtie_{\lambda}\;}S)\cap T|/|R{\;\bowtie_{\lambda}\;}S|$ , both of which should be as high as possible. We will be interested in methods that achieve 100% precision, but that might not have 100% recall. We refer to methods with 100% recall as exact, and others as approximate.

34.1 Our Contributions

We present a new approximate set similarity join algorithm: Chosen Path Similarity Join (CPSJoin). We cover its theoretical underpinnings, and show experimentally that it achieves high recall with a substantial speedup compared to state-of-the-art exact techniques. The key ideas behind CPSJoin are:

•

A new recursive filtering technique inspired by the recently proposed ChosenPath index for set similarity search [56], adding new ideas to make the method parameter-free, near-linear space, and adaptive to a given data set.

•

Apply efficient sketches for estimating set similarity [103] that take advantage of modern hardware.

We compare CPSJoin to the exact set similarity join algorithms in the comprehensive empirical evaluation of Mann et al. [107], using the same data sets, and to other approximate set similarity join methods suggested in the literature. We find that CPSJoin outperforms other approximate methods and scales better than exact methods when the sets are relatively large (100 tokens or more) and the similarity threshold is low (e.g. Jaccard similarity 0.5) where we see speedups of more than an order of magnitude at 90% recall. Our experiments on benchmark datasets show that exact methods are faster in the case of high similarity thresholds, when the average set size is small, and when sets have many rare elements, whereas approximate methods are faster in the case of low similarity thresholds and when sets are large. This finding is consistent with theory and is further corroborated by experiments on synthetic datasets.

34.2 Related Work

For space reasons we present just a sample of the most related previous work, and refer to the book of Augsten and Böhlen [21] for a survey of algorithms for exact similarity join in relational databases, covering set similarity joins as well as joins based on string similarity.

Exact Similarity Join.

Early work on similarity join focused on the important special case of detecting near-duplicates with similarity close to 1, see e.g. [31, 141]. A sequence of results starting with the seminal paper of Bayardo et al. [23] studied the range of thresholds that could be handled. Recently, Mann et al. [107] conducted a comprehensive study of 7 state-of-the-art algorithms for exact set similarity join for Jaccard similarity threshold $\lambda\in\{0.5,0.6,0.7,0.8,0.9\}$ . These algorithms all use the idea of prefix filtering [48], which generates a sequence of candidate pairs of sets that includes all pairs with similarity above the threshold. The methods differ in how much additional filtering is carried out. For example, [171] applies additional length and suffix filters to prune the candidate pairs.

Prefix filtering uses an inverted index that for each element stores a list of the sets in the collection containing that element. Given a set $x$ , assume that we wish to find all sets $y$ such that $|x\cup y|>t$ . A valid result set $y$ must be contained in at least one of the inverted lists associated with any subset of $|x|-t$ elements of $x$ , or we would have $|x\cup y|\leq t$ . In particular, to speed up the search, prefix filtering looks at the elements of $x$ that have the shortest inverted lists.

The main finding by Mann et al. is that while more advanced filtering techniques do yield speedups on some data sets, an optimized version of the basic prefix filtering method (referred to as “ALL”) is always competitive within a factor 2.16, and most often the fastest of the algorithms. For this reason we will be comparing our results against ALL.

Locality-sensitive hashing.

Locality-sensitive hashing (LSH) is a theoretically well-founded randomized method for generating candidate pairs [78]. A family of locality-sensitive hash functions is a distribution over functions with the property that the probability that similar points (or sets in our case) are more likely to have the same function value. We know only of a few papers using LSH techniques to solve similarity join. Cohen et al. [61] used LSH techniques for set similarity join in a knowledge discovery context before the advent of prefix filtering. They sketch a way of choosing parameters suitable for a given data set, but we are not aware of existing implementations of this approach. Chakrabarti et al. [44] improved plain LSH with an adaptive similarity estimation technique, BayesLSH, that reduces the cost of checking candidate pairs and typically improves upon an implementation of the basic prefix filtering method by $2$ – $20\times$ . Our experiments include a comparison against both methods [61, 44]. We refer to the survey paper [125] for an overview of newer theoretical developments on LSH-based similarity joins, but point out that these developments have not matured sufficiently to yield practical improvements.

Distance estimation.

Similar to BayesLSH [44] we make use of algorithms for similarity estimation, but in contrast to BayesLSH we use algorithms that make use of bit-level parallelism. This approach works when there exists a way of picking a random hash function $h$ such that

[TABLE]

for every choice of sets $x$ and $y$ . Broder et al. [35] presented such a hash function for Jaccard similarity, now known as “minhash” or “minwise hashing”. In the context of distance estimation, 1-bit minwise hashing of Li and König [103] maps minhash values to a compact sketch, often using just 1 or 2 machine words. Still, this is sufficient information to be able to estimate the Jaccard similarity of two sets $x$ and $y$ just based on the Hamming distance of their sketches.

Locality-sensitive mappings.

Several recent theoretical advances in high-dimensional indexing [16, 54, 56] have used an approach that can be seen as a generalization of LSH. We refer to this approach as locality-sensitive mappings (also known as locality-sensitive filters in certain settings). The idea is to construct a function $F$ , mapping a set $x$ into a set of machine words, such that:

•

If $\operatorname{sim}(x,y)\geq\lambda$ then $F(x)\cap F(y)$ is nonempty with some fixed probability $\varphi>0$ .

•

If $\operatorname{sim}(x,y)<\lambda$ , then the expected intersection size $\operatorname*{\mathbb{E}}[|F(x)\cap F(y)|]$ is “small”.

Here the exact meaning of “small” depends on the difference $\lambda-\operatorname{sim}(x,y)$ , but in a nutshell, if it is the case that almost all pairs have similarity significantly below $\lambda$ then we can expect $|F(x)\cap F(y)|=0$ for almost all pairs. Performing the similarity join amounts to identifying all candidate pairs $x,y$ for which $F(x)\cap F(y)\neq\varnothing$ (for example by creating an inverted index), and computing the similarity of each candidate pair. To our knowledge these indexing methods have not been tried out in practice, probably because they are rather complicated. An exception is the recent paper [56], which is relatively simple, and indeed our join algorithm is inspired by the index described in that paper.

35 Preliminaries

The CPSJoin algorithm solves the $(\lambda,\varphi)$ -similarity join problem with a probabilistic guarantee on recall, formalized as follows:

Definition 5.1.

An algorithm solves the $(\lambda,\varphi)$ -similarity join problem with threshold $\lambda\in(0,1)$ and recall probability $\varphi\in(0,1)$ if for every $(x,y)\in S\bowtie_{\lambda}R$ the output $L\subseteq S\bowtie_{\lambda}R$ of the algorithm satisfies $\Pr[(x,y)\in L]\geq\varphi$ .

It is important to note that the probability is over the random choices made by the algorithm, and not over a random choice of $(x,y)$ . This means that for any $(x,y)\in S\bowtie_{\lambda}R$ the probability that the pair is not reported in $r$ independent repetitions of the algorithm is bounded by $(1-\varphi)^{r}$ . For example if $\varphi=0.9$ it takes just $r=3$ repetitions to bound the recall to at least $99.9\%$ .

35.1 Similarity Measures

Our algorithm can be used with a broad range of similarity measures through randomized embeddings. This allows it to be used with, for example, Jaccard and cosine similarity thresholds.

Embeddings map data from one space to another while approximately preserving distances, with accuracy that can be tuned. In our case we are interested in embeddings that map data to sets of tokens. We can transform any so-called LSHable similarity measure $\operatorname{sim}$ , where we can choose $h$ to make (9) hold, into a set similarity measure by the following randomized embedding: For a parameter $t$ pick hash functions $h_{1},\dots,h_{t}$ independently from a family satisfying (9). The embedding of $x$ is the following set of size $t$ :

[TABLE]

It follows from (9) that the expected size of the intersection $f(x)\cap f(y)$ is $t\cdot\operatorname{sim}(x,y)$ . Furthermore, it follows from standard concentration inequalities that the size of the intersection will be close to the expectation with high probability. For our experiments with Jaccard similarity thresholds $\geq 0.5$ , we found that $t=64$ gave sufficient precision for $>90\%$ recall.

In summary we can perform the similarity join $R{\;\bowtie_{\lambda}\;}S$ for any LSHable similarity measure by creating two corresponding relations $R^{\prime}=\{f(x)\;|\;x\in R\}$ and $S^{\prime}=\{f(y)\;|\;y\in S\}$ , and computing $R^{\prime}{\;\bowtie_{\lambda}\;}S^{\prime}$ with respect to the similarity measure

[TABLE]

This measure is the special case of Braun-Blanquet similarity where the sets are known to have size $t$ [51]. Our implementation will take advantage of the set size $t$ being fixed, though it is easy to extend to general Braun-Blanquet similarity.

The class of LSHable similarity measures is large, as discussed in [49]. If approximation errors are tolerable, even edit distance can be handled by our algorithm [45, 172].

35.2 Notation

We are interested in sets $S$ where an element, $x\in S$ is a set with elements from some universe $[d]=\{1,2,3,\cdots,d\}$ . To avoid confusion we sometimes use “record” for $x\in S$ and “token” for the elements of $x$ . Throughout this paper we will think of a record $x$ both as a set of tokens from $[d]$ , as well as a vector from $\{0,1\}^{d}$ , where:

[TABLE]

It is clear that these representations are equivalent. The set $\{1,4,5\}$ is equivalent to $(1,0,0,1,1,0,\cdots,0)$ , $\{1,d\}$ is equivalent to $(1,0,\cdots,0,1)$ , etc.

36 Overview of approach

Our high-level approach is recursive and works as follows. To compute $R{\;\bowtie_{\lambda}\;}S$ we consider each $x\in R$ and either:

Compare $x$ to each record in $S$ (referred to as “brute forcing” $x$ ), or 2. 2.

create several subproblems $S_{i}{\;\bowtie_{\lambda}\;}R_{i}$ with $x\in R_{i}\subseteq R$ , $S_{i}\subseteq S$ , and solve them recursively.

The approach of [56] corresponds to choosing option 2 until reaching a certain level $k$ of the recursion, where we finish the recursion by choosing option 1. This makes sense for certain worst-case data sets, but we propose an improved parameter-free method that is better at adapting to the given data distribution. In our method the decision on which option to choose depends on the size of $S$ and the average similarity of $x$ to the records of $S$ . We choose option 1 if $S$ has size below some (constant) threshold, or if the average Braun-Blanquet similarity of $x$ and $S$ , $\tfrac{1}{|S|}\sum_{y\in S}B(x,y)$ , is close to the threshold $\lambda$ . In the former case it is cheap to finish the recursion. In the latter case many records $y\in S$ will have $B(x,y)$ larger than or close to $\lambda$ , so we do not expect to be able to produce output pairs with $x$ in sublinear time in $|S|$ .

If neither of these pruning conditions apply we choose option 2 and include $x$ in recursive subproblems as described below. But first we note that the decision of which option to use can be made efficiently for each $x$ , since the average similarity of pairs from $R\times S$ can be computed from token frequencies in time $O(t|R|+t|S|)$ . Pseudocode for a self-join version of CPSJoin is provided in Algorithm 1 and 2.

36.1 Recursion

We would like to ensure that for each pair $(x,y)\in R{\;\bowtie_{\lambda}\;}S$ the pair is computed in one of the recursive subproblems, i.e., that $(x,y)\in R_{i}{\;\bowtie_{\lambda}\;}S_{i}$ for some $i$ . In particular, we want the expected number of subproblems containing $(x,y)$ to be at least 1, i.e.,

[TABLE]

To achieve (11) for every pair $(x,y)\in R{\;\bowtie_{\lambda}\;}S$ we proceed as follows: for each $i\in\{1,\dots,d\}$ we recurse with probability $1/(\lambda t)$ on the subproblem $R_{i}{\;\bowtie_{\lambda}\;}S_{i}$ with sets

[TABLE]

where $t$ denotes the size of records in $R$ and $S$ . It is not hard to check that (11) is satisfied for every pair $(x,y)$ with $B(x,y)\geq\lambda$ . Of course, expecting one subproblem to contain $(x,y)$ does not directly imply a good probability that $(x,y)$ is contained in at least one subproblem. But it turns out that we can use results from the theory of branching processes to show such a bound; details are provided in section 37.

37 Chosen Path Similarity Join

The CPSJoin algorithm solves the $(\lambda,\varphi)$ -set similarity join (Definition 5.1) for every choice of $\lambda\in(0,1)$ and with a guarantee on $\varphi$ that we will lower bound in the analysis.

To simplify the exposition we focus on a self-join version where we are given a set $S$ of $n$ subsets of $[d]$ and we wish to report $L\subseteq S{\;\bowtie_{\lambda}\;}S$ . Handling a general join $S{\;\bowtie_{\lambda}\;}R$ follows the overview in section 36 and requires no new ideas: Essentially consider a self-join on $S\cup R$ but make sure to consider only pairs in $S\times R$ for output. We also make the simplifying assumption that all sets in $S$ have a fixed size $t$ . As argued in section 35.1 the general case can be reduced to this one by embedding.

37.1 Description

The CPSJoin algorithm (see Algorithm 1 for pseudocode) works by recursively splitting the data set on elements of $[d]$ that are selected according to a random process, forming a recursion tree with $S$ at the root and subsets of $S$ that are non-increasing in size as we get further down the tree. The randomized splitting has the property that the probability of a pair of sets $(x,y)$ being in a random subproblem is increasing as a function of $|x\cap y|$ .

Before each recursive splitting step we run the BruteForce subprocedure (see Algorithm 2 for pseudocode) that identifies subproblems that are best solved by brute force. It has two parts:

If $S$ is below some constant size, controlled by the parameter limit, we report $S\bowtie_{\lambda}S$ exactly using a simple loop with $O(|S|^{2})$ distance computations (BruteForcePairs) and exit the recursion. In our experiments we have set limit to $250$ , with the precise choice seemingly not having a large effect as shown experimentally in Section 39.2.
If $S$ is larger than limit the second part activates: for every $x\in S$ we check whether the expected number of distance computations involving $x$ is going to decrease by continuing the recursion. If this is not the case, we immediately compare $x$ against every point in $S$ (BruteForcePoint), reporting close pairs, and proceed by removing $x$ from $S$ . The BruteForce procedure is then run again on the reduced set.

This procedure where we choose to handle some points by brute force crucially separates our algorithm from many other approximate similarity join methods in the literature that typically are LSH-based [126, 61]. By efficiently being able to remove points at the “right” time, before they generate too many expensive comparisons further down the tree, we are able to beat the performance of other approximate similarity join techniques in both theory and practice. Another benefit of this approach is that it reduces the number of parameters compared to the usual LSH setting where the depth of the tree has to be selected by the user.

37.2 Comparison to Chosen Path

The CPSJoin algorithm is inspired by the Chosen Path algorithm [56] for the approximate near neighbor problem and uses the same underlying random splitting tree that we will refer to as the Chosen Path Tree. In the approximate near neighbor problem, the task is to construct a data structure that takes a query point and correctly reports an approximate near neighbor, if such a point exists in the data set. Using the Chosen Path data structure directly to solve the $(\lambda,\varphi)$ -set similarity join problem has several drawbacks that we avoid in the CPSJoin algorithm. First, the Chosen Path data structure is parameterized in a non-adaptive way to provide guarantees for worst-case data, vastly increasing the amount of work done compared to the optimal parameterization when data is not worst-case. Our recursion rule avoids this and instead continuously adapts to the distribution of distances as we traverse down the tree. Secondly, the data structure uses space $O(n^{1+\rho})$ where $\rho>0$ , storing the Chosen Path Tree of size $O(n^{\rho})$ for every data point. The CPSJoin algorithm, instead of storing the whole tree, essentially performs a depth-first traversal, using only near-linear space in $n$ in addition to the space required to store the output. Finally, the Chosen Path data structure only has to report a single point that is approximately similar to a query point, and can report points with similarity $<\lambda$ . To solve the approximate similarity join problem the CPSJoin algorithm has to satisfy reporting guarantees for every pair of points $(x,y)$ in the exact join.

37.3 Analysis

The Chosen Path Tree for a set $x\subseteq[d]$ is defined by a random process: at each node, starting from the root, we sample a random hash function $r\colon[d]\to[0,1]$ and construct children for every element $j\in x$ such that $r(j)<\frac{1}{\lambda|x|}$ . Nodes at depth $k$ in the tree are identified by their path $p=(j_{1},\dots,j_{k})$ . Formally, the set of nodes at depth $k>0$ in the Chosen Path Tree for $x$ is given by

[TABLE]

where $p\circ j$ denotes vector concatenation and $F_{0}(x)=\varnothing$ . The subset of the data set $S$ that survives to a node with path $p=(j_{1},\dots,j_{k})$ is given by

[TABLE]

The random process underlying the Chosen Path Tree belongs to the well studied class of Galton-Watson branching processes [87]. Originally these where devised to answer questions about the growth and decline of family names in a model of population growth assuming i.i.d. offspring for every member of the population across generations [166]. In order to make statements about the properties of the CPSJoin algorithm we study in turn the branching processes of the Chosen Path Tree associated with a point $x$ , a pair of points $(x,y)$ , and a set of points $S$ . Note that we use the same random hash functions for different points in $S$ .

Brute forcing.

The BruteForce subprocedure described by Algorithm 2 takes two global parameters: $\mathtt{limit}\geq 1$ and $\varepsilon\geq 0$ . The parameter $\mathtt{limit}$ controls the minimum size of $S$ before we discard the CPSJoin algorithm for a simple exact similarity join by brute force pairwise distance computations. The second parameter, $\varepsilon>0$ , controls the sensitivity of the BruteForce step to the expected number of comparisons that a point $x\in S$ will generate if allowed to continue in the branching process. The larger $\varepsilon$ the more aggressively we will resort to the brute force procedure. In practice we typically think of $\varepsilon$ as a small constant, say $\varepsilon=0.05$ , but for some of our theoretical results we will need a sub-constant setting of $\varepsilon\approx 1/\log(n)$ to show certain running time guarantees. The BruteForce step removes a point $x$ from the Chosen Path branching process, instead opting to compare it against every other point $y\in S$ , if it satisfies the condition

[TABLE]

In the pseudocode of Algorithm 2 we let count denote a hash table that keeps track of the number of times each element $j\in[d]$ appears in $S$ . This allows us to evaluate the condition in equation (37.3) for an element $x\in S$ in time $O(|x|)$ by rewriting it as

[TABLE]

We claim that this condition minimizes the expected number of comparisons performed by the algorithm: Consider a node in the Chosen Path Tree associated with a set of points $S$ while running the CPSJoin algorithm. For a point ${x\in S}$ , we can either remove it from $S$ immediately at a cost of $|S|-1$ comparisons, or we can choose to let continue in the branching process (possibly into several nodes) and remove it later. The expected number of comparisons if we let it continue $k$ levels before removing it from every node that it is contained in, is given by

[TABLE]

This expression is convex and increasing in the similarity $|x\cap y|/t$ between $x$ and other points $y\in S$ , allowing us to state the following remark:

*Remark 5.1** (Recursion).*

Let $\varepsilon=0$ and consider a set $S$ containing a point $x\in S$ such that $x$ satisfies the recursion condition in equation (37.3). Then the expected number of comparisons involving $x$ if we continue branching exceeds $|S|-1$ at every depth $k\geq 1$ . If $x$ does not satisfy the condition, the opposite is observed.

Tree depth.

We proceed by bounding the maximal depth of the set of paths in the Chosen Path Tree that are explored by the CPSJoin algorithm. Having this information will allow us to bound the space usage of the algorithm and will also form part of the argument for the correctness guarantee. Assume that the parameter limit in the BruteForce step is set to some constant value, say $\mathtt{limit}=100$ . Consider a point $x\in S$ and let $S^{\prime}=\{y\in S\mid|x\cap y|/t\leq(1-\varepsilon)\lambda\}$ be the subset of points in $S$ that are not too similar to $x$ . For every $y\in S^{\prime}$ the expected number of vertices in the Chosen Path Tree at depth $k$ that contain both $x$ and $y$ is upper bounded by

[TABLE]

Since $|S^{\prime}|\leq n$ we use Markov’s inequality to show the following bound:

Lemma 5.1.

Let $x,y\in S$ satisfy that $|x\cap y|/t\leq(1-\varepsilon)\lambda$ then the probability that there exists a vertex at depth $k$ in the Chosen Path Tree that contains $x$ and $y$ is at most $e^{-\varepsilon k}$ .

If $x$ does not share any paths with points that have similarity that falls below the threshold for brute forcing, then the only points that remain are ones that will cause $x$ to be brute forced. This observation leads to the following probabilistic bound on the tree depth:

Lemma 5.2.

With high probability the maximal depth of paths explored by the CPSJoin algorithm is $O(\log(n)/\varepsilon)$ .

Correctness.

Let $x$ and $y$ be two sets of equal size $t$ such that $B(x,y)=|x\cap y|/t\geq\lambda$ . We are interested in lower bounding the probability that there exists a path of length $k$ in the Chosen Path Tree that has been chosen by both $x$ and $y$ , i.e. $\Pr\left[F_{k}(x\cap y)\neq\varnothing\right]$ . Agresti [3] showed an upper bound on the probability that a branching process becomes extinct after at most $k$ steps. We use it to show the following lower bound on the probability of a close pair of points colliding at depth $k$ in the Chosen Path Tree.

Lemma 5.3 (Agresti [3]).

If $\operatorname{sim}(x,y)\geq\lambda$ then for every $k>0$ we have that $\Pr[F_{k}(x\cap y)\neq\varnothing]\geq\frac{1}{k+1}$ .

The bound on the depth of the Chosen Path Tree for $x$ explored by the CPSJoin algorithm in Lemma 5.2 then implies a lower bound on $\varphi$ .

Lemma 5.4.

Let $0<\lambda<1$ be constant. Then for every set $S$ of $|S|=n$ points the CPSJoin algorithm solves the set similarity join problem with $\varphi=\Omega(\varepsilon/\log(n))$ .

*Remark 5.2**.*

This analysis is very conservative: if either $x$ or $y$ is removed by the BruteForce step prior to reaching the maximum depth then it only increases the probability of collision. We note that similar guarantees can be obtained when using fast pseudorandom hash functions as shown in the paper introducing the Chosen Path algorithm [56].

Space usage.

We can obtain a trivial bound on the space usage of the CPSJoin algorithm by combining Lemma 5.2 with the observation that every call to CPSJoin on the stack uses additional space at most $O(n)$ . The result is stated in terms of working space: the total space usage when not accounting for the space required to store the data set itself (our algorithms use references to data points and only reads the data when performing comparisons) as well as disregarding the space used to write down the list of results.

Lemma 5.5.

With high probability the working space of the CPSJoin algorithm is at most $O(n\log(n)/\varepsilon)$ .

*Remark 5.3**.*

We conjecture that the expected working space is $O(n)$ due to the size of $S$ being geometrically decreasing in expectation as we proceed down the Chosen Path Tree.

Running time.

We will bound the running time of a solution to the general set similarity self-join problem that uses several calls to the CPSJoin algorithm in order to piece together a list of results $L\subseteq S{\;\bowtie_{\lambda}\;}S$ . In most of the previous related work, inspired by Locality-Sensitive Hashing, the fine-grainedness of the randomized partition of space, here represented by the Chosen Path Tree in the CPSJoin algorithm, has been controlled by a single global parameter $k$ [78, 126]. In the Chosen Path setting this rule would imply that we run the splitting step without performing any brute force comparison until reaching depth $k$ where we proceed by comparing $x$ against every other point in nodes containing $x$ , reporting close pairs. In recent work by Ahle et al. [4] it was shown how to obtain additional performance improvements by setting an individual depth $k_{x}$ for every $x\in S$ . We refer to these stopping strategies as global and individual, respectively. Together with our recursion strategy, this gives rise to the following stopping criteria for when to compare a point $x$ against everything else contained in a node:

•

Global: Fix a single depth $k$ for every $x\in S$ .

•

Individual: For every $x\in S$ fix a depth $k_{x}$ .

•

Adaptive: Remove $x$ when the expected number of comparisons is non-decreasing in the tree-depth.

Let $T$ denote the running time of our similarity join algorithm. We aim to show the following relation between the running time between the different stopping criteria when applied to the Chosen Path Tree:

[TABLE]

First consider the global strategy. We set $k$ to balance the contribution to the running time from the expected number of vertices containing a point, given by $(1/\lambda)^{k}$ , and the expected number of comparisons between pairs of points at depth $k$ , resulting in the following expected running time for the global strategy:

[TABLE]

The global strategy is a special case of the individual case, and it must therefore hold that $\operatorname*{\mathbb{E}}[T_{\text{Individual}}]\leq\operatorname*{\mathbb{E}}[T_{\text{Global}}]$ . The expected running time for the individual strategy is upper bounded by:

[TABLE]

We will now argue that the expected running time of the CPSJoin algorithm under the adaptive stopping criteria is no more than a constant factor greater than $\operatorname*{\mathbb{E}}[T_{\text{Individual}}]$ when we set the global parameters of the BruteForce subroutine as follows:

[TABLE]

Let $x\in S$ and consider a path $p$ where $x$ is removed in from $S_{p}$ by the BruteForce step. Let $k_{x}^{\prime}$ denote the depth of the node (length of $p$ ) at which $x$ is removed. Compared to the individual strategy that removes $x$ at depth $k_{x}$ we are in one of three cases, also displayed in Figure 11.

The point $x$ is removed from $p$ at depth $k_{x}^{\prime}=k_{x}$ . 2. 2.

The point $x$ is removed from $p$ at depth $k_{x}^{\prime}<k_{x}$ . 3. 3.

The point $x$ is removed from $p$ at depth $k_{x}^{\prime}>k_{x}$ .

The underlying random process behind the Chosen Path Tree is not affected by our choice of termination strategy. In the first case we therefore have that the expected running time is upper bounded by the same (conservative) expression as the one used by the individual strategy. In the second case we remove $x$ earlier than we would have under the individual strategy. For every $x\in S$ we have that $k_{x}\leq 1/\varepsilon$ since for larger values of $k_{x}$ the expected number of nodes containing $x$ exceeds $n$ . We therefore have that $k_{x}-k_{x}^{\prime}\leq 1/\varepsilon$ . Let $S^{\prime}$ denote the set of points in the node where $x$ was removed by the BruteForce subprocedure. There are two rules that could have triggered the removal of $x$ : Either $|S^{\prime}|=O(1)$ or the condition in equation (37.3) was satisfied. In the first case, the expected cost of following the individual strategy would have been $\Omega(1)$ simply from the $1/\lambda$ children containing $x$ in the next step. This is no more than a constant factor smaller than the adaptive strategy. In the second case, when the condition in equation (37.3) is activated we have that the expected number of comparisons involving $x$ resulting from $S^{\prime}$ if we had continued under the individual strategy is at least

[TABLE]

which is no better than what we get with the adaptive strategy. In the third case where we terminate at depth $k_{x}^{\prime}>k_{x}$ , if we retrace the path to depth $k_{x}$ we know that $x$ was not removed in this node, implying that the expected number of comparisons when continuing the branching process on $x$ is decreasing compared to removing $x$ at depth $k_{x}$ . We have shown that the expected running time of the adaptive strategy is no greater than a constant times the expected running time of the individual strategy.

We are now ready to state our main theoretical contribution, stated below as Theorem 5.1. The theorem combines the above argument that compares the adaptive strategy against the individual strategy together with Lemma 5.2 and Lemma 5.4, and uses $O(\log^{2}n)$ runs of the CPSJoin algorithm to solve the set similarity join problem for every choice of constant parameters $\lambda,\varphi$ .

Theorem 5.1.

For every LSHable similarity measure and every choice of constant threshold $\lambda\in(0,1)$ and probability of recall $\varphi\in(0,1)$ we can solve the $(\lambda,\varphi)$ -set similarity join problem on every set $S$ of $n$ points using working space $\tilde{O}(n)$ and with expected running time

[TABLE]

38 Implementation

We implement an optimized version of the CPSJoin algorithm for solving the Jaccard similarity self-join problem. In our experiments (described in Section 39) we compare the CPSJoin algorithm against the approximate methods of MinHash LSH [78, 35] and BayesLSH [44], as well as the AllPairs [23] exact similarity join algorithm. The code for our experiments is written in C++ and uses the benchmarking framework and data sets of the recent experimental survey on exact similarity join algorithms by Mann et al. [107]. For our implementation we assume that each set $x$ is represented as a list of 32-bit unsigned integers. We proceed by describing the details of each implementation in turn.

38.1 Chosen Path Similarity Join

The implementation of the CPSJoin algorithm follows the structure of the pseudocode in Algorithm 1 and Algorithm 2, but makes use of a few heuristics, primarily sampling and sketching, in order to speed things up. The parameter setting is discussed and investigated experimentally in section 39.2.

Preprocessing.

Before running the algorithm we use the embedding described in section 35.1. Specifically $t$ independent MinHash functions $h_{1},\dots,h_{t}$ are used to map each set $x\in S$ to a list of $t$ hash values $(h_{1}(x),\dots,h_{t}(x))$ . The MinHash function is implemented using Zobrist hashing [173] from 32 bits to 64 bits with 8-bit characters. We sample a MinHash function $h$ by sampling a random Zobrist hash function $g$ and let $h(x)=\arg\!\min_{j\in x}g(j)$ . Zobrist hashing (also known as simple tabulation hashing) has been shown theoretically to have strong MinHash properties and is very fast in practice [134, 159]. We set $t=128$ in our experiments, see discussion later.

During preprocessing we also prepare sketches using the 1-bit minwise hashing scheme of Li and König [103]. Let $\ell$ denote the length in 64-bit words of a sketch $\hat{x}$ of a set $x\in S$ . We construct sketches for a data set $S$ by independently sampling $64\times\ell$ MinHash functions $h_{i}$ and Zobrist hash functions $g_{i}$ that map from 32 bits to 1 bit. The $i$ th bit of the sketch $\hat{x}$ is then given by $g_{i}(h_{i}(x))$ . In the experiments we set $\ell=8$ .

Similarity estimation using sketches.

We use 1-bit minwise hashing sketches for fast similarity estimation in the BruteForcePairs and BruteForcePoint subroutines of the BruteForce step of the CPSJoin algorithm. Given two sketches, $\hat{x}$ and $\hat{y}$ , we compute the number of bits in which they differ by going through the sketches word for word, computing the popcount of their XOR using the gcc builtin _mm_popcnt_u64 that translates into a single instruction on modern hardware. Let $\hat{J}(x,y)$ denote the estimated similarity of a pair of sets $(x,y)$ . If $\hat{J}(x,y)$ is below a threshold $\hat{\lambda}\approx\lambda$ , we exclude the pair from further consideration. If the estimated similarity is greater than $\hat{\lambda}$ we compute the exact similarity and report the pair if $J(x,y)\geq\lambda$ .

The speedup from using sketches comes at the cost of introducing false negatives: A pair of sets $(x,y)$ with $J(x,y)\geq\lambda$ may have an estimated similarity less than $\hat{\lambda}$ , causing us to miss it. We let $\delta$ denote a parameter for controlling the false negative probability of our sketches and set $\hat{\lambda}$ such that for sets $(x,y)$ with $J(x,y)\geq\lambda$ we have that $\Pr[\hat{J}(x,y)<\hat{\lambda}]<\delta$ . In our experiments we set the sketch false negative probability to be $\delta=0.05$ .

Splitting step.

In the recursive step of the CPSJoin algorithm (Algorithm 1) the set $S$ is split into buckets $S_{j}$ using the following heuristic: Instead of sampling a random hash function and evaluating it on each element $j\in x$ , we sample an expected $1/\lambda$ elements from $[t]$ and split $S$ according to the corresponding minhash values from the preprocessing step. This saves the linear overhead in the size of our sets $t$ , reducing the time spent placing each set into buckets to $O(1)$ . Internally, a collection of sets $S$ is represented as a C++ std::vector<uint32_t> of set ids. The collection of buckets $S_{j}$ is implemented using Google’s dense_hash hash map implementation from the sparse_hash package [81].

BruteForce step.

Having reduced the overhead for each set $x\in S$ to $O(1)$ in the splitting step, we wish to do the same for the BruteForce step (described in Algorithm 2), at least in the case where we do not call the BruteForcePairs or BruteForcePoint subroutines. The main problem is that we spend time $O(t)$ for each set when constructing the count hash map and estimating the average similarity of $x$ to sets in $S\setminus\{x\}$ . To get around this we construct a 1-bit minwise hashing sketch $\hat{s}$ of length $64\times\ell$ for the set $S$ using sampling and our precomputed 1-bit minwise hashing sketches. The sketch $\hat{s}$ is constructed as follows: Randomly sample $64\times\ell$ elements of $S$ and set the $i$ th bit of $\hat{s}$ to be the $i$ th bit of the $i$ th sample from $S$ . This allows us to estimate the average similarity of a set $x$ to sets in $S$ in time $O(\ell)$ using word-level parallelism. A set $x$ is removed from $S$ if its estimated average similarity is greater than $(1-\varepsilon)\lambda$ . To further speed up the running time we only call the BruteForce subroutine once for each call to CPSJoin, calling BruteForcePoint on all points that pass the check rather than recomputing $\hat{s}$ each time a point is removed. Pairs of sets that pass the sketching check are verified using the same verification procedure as the AllPairs implementation by Mann et al. [107]. In our experiments we set the parameter $\varepsilon=0.1$ . Duplicates are removed by sorting and performing a single linear scan.

Repetitions.

In theory, for any constant desired recall $\varphi\in(0,1)$ it suffices with $O(\log^{2}n)$ independent repetitions of the CPSJoin algorithm. In practice this number of repetitions is prohibitively large and we therefore set the number of independent repetitions used in our experiments to be fixed at ten. With this setting we were able to achieve more than $90\%$ recall across all datasets and similarity thresholds.

38.2 MinHash LSH

We implement a locality-sensitive hashing similarity join using MinHash according to the pseudocode in Algorithm 3. A single run of the MinHash algorithm can be divided into two steps: First we split the sets into buckets according to the hash values of $k$ concatenated MinHash functions $h(x)=(h_{1}(x),\dots,h_{k}(x))$ . Next we iterate over all non-empty buckets and run BruteForcePairs to report all pairs of points with similarity above the threshold $\lambda$ . The BruteForcePairs subroutine is shared between the MinHash and CPSJoin implementation. MinHash therefore uses 1-bit minwise sketches for similarity estimation in the same way as in the implementation of the CPSJoin algorithm described above.

The parameter $k$ can be set for each dataset and similarity threshold $\lambda$ to minimize the combined cost of lookups and similarity estimations performed by algorithm. This approach was mentioned by Cohen et al. [61] but we were unable to find an existing implementation. In practice we set $k$ to the value that results in the minimum estimated running time when running the first part (splitting step) of the algorithm for values of $k$ in the range $\{2,3,\dots,10\}$ and estimating the running time by looking at the number of buckets and their sizes. Once $k$ is fixed we know that each repetition of the algorithm has probability at least $\lambda^{k}$ of reporting a pair $(x,y)$ with $J(x,y)\geq\lambda$ . For a desired recall $\varphi$ we can therefore set $L=\lceil\ln(1/(1-\varphi))/\lambda^{k}\rceil$ . In our experiments we report the actual number of repetitions required to obtain a desired recall rather than using the setting of $L$ required for worst-case guarantees.

38.3 AllPairs

To compare our approximate methods against a state-of-the-art exact similarity join we use Bayardo et al.’s AllPairs algorithm [23] as recently implemented in the set similarity join study by Mann et al. [107]. The study by Mann et al. compares implementations of several different exact similarity join methods and finds that the simple AllPairs algorithm is most often the fastest choice. Furthermore, for Jaccard similarity, the AllPairs algorithm was at most $2.16$ times slower than the best out of six different competing algorithm across all the data sets and similarity thresholds used, and for most runs AllPairs is at most $11\%$ slower than the best exact algorithm (see Table 7 in Mann et al. [107]). Since our experiments run in the same framework and using the same datasets and with the same thresholds as Mann et al.’s study, we consider their AllPairs implementation to be a good representative of exact similarity join methods for Jaccard similarity.

38.4 BayesLSH

For a comparison against previous experimental work on approximate similarity joins we use an implementation of BayesLSH in C as provided by the BayesLSH authors [44, 43]. The BayesLSH package features a choice between AllPairs and LSH as candidate generation method. For the verification step there is a choice between BayesLSH and BayesLSH-lite. Both verification methods use sketching to estimate similarities between candidate pairs. The difference between BayesLSH and BayesLSH-lite is that the former uses sketching to estimate the similarity of pairs that pass the sketching check, whereas the latter uses an exact similarity computation if a pair passes the sketching check. Since the approximate methods in our CPSJoin and MinHash implementations correspond to the approach of BayesLSH-lite we restrict our experiments to this choice of verification algorithm. In our experiments we will use BayesLSH to represent the fastest of the two candidate generation methods, combined with BayesLSH-lite for the verification step.

39 Experiments

We run experiments using the implementations of CPSJoin, MinHash, BayesLSH, and AllPairs described in the previous section. In the experiments we perform self-joins under Jaccard similarity for similarity thresholds $\lambda\in\{0.5,0.6,0.7,0.8,0.9\}$ . We are primarily interested in measuring the join time of the algorithms, but we also look at the number of candidate pairs $(x,y)$ considered by the algorithms during the join as a measure of performance. Note that the preprocessing step of the approximate methods only has to be performed once for each set and similarity measure, and can be re-used for different similarity joins, we therefore do not count it towards our reported join times. In practice the preprocessing time is at most a few minutes for the largest data sets.

Data sets.

The performance is measured across $10$ real world data sets along with $4$ synthetic data sets described in Table 3. All datasets except for the TOKENS datasets were provided by the authors of [107] where descriptions and sources for each data set can also be found. Note that we have excluded a synthetic ZIPF dataset used in the study by Mann et al.[107] due to it having no results for our similarity thresholds of interest. Experiments are run on versions of the datasets where duplicate records are removed and any records containing only a single token are ignored.

In addition to the datasets from the study of Mann et al. we add three synthetic datasets TOKENS10K, TOKENS15K, and TOKENS20K, designed to showcase the robustness of the approximate methods. These datasets have relatively few unique tokens, but each token appears in many sets. Each of the TOKENS datasets were generated from a universe of $1000$ tokens ( $d=1000$ ) and each token is contained in respectively, $10,000$ , $15,000$ , and $20,000$ different sets as denoted by the name. The sets in the TOKENS datasets were generated by sampling a random subset of the set of possible tokens, rejecting tokens that had already been used in more than the maximum number of sets ( $10,000$ for TOKENS10K). To sample sets with expected Jaccard similarity $\lambda^{\prime}$ the size of our sampled sets should be set to $(2\lambda^{\prime}/(1+\lambda^{\prime}))d$ . For $\lambda^{\prime}\in\{0.95,0.85,0.75,0.65,0.55\}$ the TOKENS datasets each have $100$ random sets planted with expected Jaccard similarity $\lambda^{\prime}$ . This ensures an increasing number of results for our experiments where we use thresholds $\lambda\in\{0.5,0.6,0.7,0.8,0.9\}$ . The remaining sets have expected Jaccard similarity $0.2$ . We believe that the TOKENS datasets give a good indication of the performance on real-world data that has the property that most tokens appear in a large number of sets.

Recall.

In our experiments we aim for a recall of at least $90\%$ for the approximate methods. In order to achieve this for the CPSJoin and MinHash algorithms we perform a number of repetitions after the preprocessing step, stopping when the desired recall has been achieved. This is done by measuring the recall against the recall of AllPairs and stopping when reaching $90\%$ . In practice this approach is not feasible as the size of the true result set is not known. However, it can be efficiently estimated using sampling if it is not too small. Another approach is to perform the number of repetitions required to obtain the theoretical guarantees on recall as described for CPSJoin in Section 37.3 and for MinHash in Section 38.2. Unfortunately, with our current analysis of the CPSJoin algorithm the number of repetitions required to guarantee theoretically a recall of $90\%$ far exceeds the number required in practice as observed in our experiments where ten independent repetitions always suffice. For BayesLSH using LSH as the candidate generation method, the recall probability with the default parameter setting is $95\%$ , although we experience a recall closer to $90\%$ in our experiments.

Hardware.

All experiments were run on an Intel Xeon E5-2690v4 CPU at 2.60GHz with $35$ MB L $3$ , $256$ kB L $2$ and $32$ kB L $1$ cache and $512$ GB of RAM. Since a single experiment is always confined to a single CPU core we ran several experiments in parallel [155] to better utilize our hardware.

39.1 Results

Join time.

Table 39.1 shows the average join time in seconds over five independent runs, when approximate methods are required to have at least $90\%$ recall. We have omitted timings for BayesLSH since it was always slower than all other methods, and in most cases it timed out after 20 minutes when using LSH as candidate generation method. The join time for MinHash is always greater than the corresponding join time for CPSJoin except in a single setting: the dataset KOSARAK with threshold $\lambda=0.5$ . Since CPSJoin is typically $2-4\times$ faster than MinHash we can restrict our attention to comparing AllPairs and CPSJoin where the picture becomes more interesting.

Bibliography174

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] ACM. ACM paris kanellakis theory and practice award. https://awards.acm.org/award_winners/charikar_0308379 , 2012. [Online; accessed 26-April-2018].
2[2] C. Aggarwal, D. A. Keim, and A. Hinneburg. On the surprising behaviour of distance metrics in high dimensional space. In Proc. ICDT ’01 , pages 420–434, 2001.
3[3] A. Agresti. Bounds on the extinction time distribution of a branching process. Advances in Applied Probability , 6(2):322–335, 1974.
4[4] T. D. Ahle, M. Aumüller, and R. Pagh. Parameter-free locality sensitive hashing for spherical range reporting. In Proc. SODA ’17 , pages 239–256, 2017.
5[5] T. D. Ahle, R. Pagh, I. P. Razenshteyn, and F. Silvestri. On the complexity of inner product similarity join. In Proc. PODS’16 , pages 151–164, 2016.
6[6] J. Alman and R. Williams. Probabilistic polynomials and hamming nearest neighbors. In Proc. FOCS ’15 , pages 136–150, 2015.
7[7] N. Alon and N. Asaf. k-wise independent random graphs. In Proc. FOCS ’08 , pages 813–822, 2008.
8[8] N. Alon, O. Goldreich, J. Håstad, and R. Peralta. Simple constructions of almost k-wise independent random variables. Random Structures & Algorithms , 3(3):289–304, 1992.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Abstract

Resumé

Acknowledgements.

Contents

Chapter 1 Introduction

1 Part I: Similarity search

Definition 1.1**.**

1.1 Locality-sensitive hashing

Definition 1.2** (Locality-sensitive hashing [91]).**

Theorem 1.1** (Indyk-Motwani [91, 86], simplified).**

1.2 Examples

Bit-sampling.

MinHash.

SimHash.

1.3 Lower bounds

Definition 1.3**.**

1.4 Beyond locality-sensitive hashing

Data-dependent locality-sensitive hashing.

Asymmetric locality-sensitive hashing.

Space-time tradeoffs.

Locality-sensitive filters and maps.

2 Part II: Pseudorandom hashing and generators

Universal hashing.

Definition 1.4**.**

Fast hashing and lower bound.

Generating kkk-independent random variables.

3 Overview and contributions

3.1 Part I: Similarity search

Chapter 2: Fast locality-sensitive hashing frameworks.

Chapter 3: Space-time tradeoffs for similarity search.

Chapter 4: Set similarity search beyond MinHash.

Chapter 6: Lower bounds for asymmetric locality-sensitive hashing.

Chapter 7: Optimal Boolean locality-sensitive hashing.

3.2 Part II: Pseudorandom hashing and number generation

Chapter 8: Generating kkk-independent random variables in constant time.

Chapter 9: Near-optimal kkk-independent hashing.

4 Conclusion and open problems

4.1 Similarity search

Optimal data-independent locality-sensitive hashing.

Orthogonal search.

Simple data-dependent constructions.

4.2 kkk-independent hashing and generation

Optimal explicit unbalanced bipartite expander graphs.

Constant time generators with minimal space.

Part I Similarity search

Chapter 2 Fast locality-sensitive hashing frameworks

5 Introduction

Definition 2.1** (Locality-sensitive hashing [91]).**

Theorem 2.1** (Indyk-Motwani [91, 86], simplified).**

5.1 Related work

Indyk-Motwani.

Andoni-Indyk.

Theorem 2.2** (Andoni-Indyk [10], simplified).**

Dahlgaard-Knudsen-Throup.

Theorem 2.3** (Dahlgaard-Knudsen-Thorup [64], simplified).**

Number of hash functions.

5.2 Contribution

Improved word-RAM complexity.

Theorem 2.4**.**

Distance sketching using LSH.

Theorem 2.5**.**

6 Preliminaries

Problem and dynamization.

Definition 2.2**.**

LSH powering.

Model of computation.

7 Frameworks

Overview.

Preprocessing and queries.

7.1 Indyk-Motwani

Theorem 2.6** (Indyk-Motwani [91, 86]).**

7.2 Andoni-Indyk

Setting ttt.

Definition 1.1.

Definition 1.2 (Locality-sensitive hashing [91]).

Theorem 1.1 (Indyk-Motwani [91, 86], simplified).

Definition 1.3.

Definition 1.4.

Generating $k$ -independent random variables.

Chapter 8: Generating $k$ -independent random variables in constant time.

Chapter 9: Near-optimal $k$ -independent hashing.

4.2 $k$ -independent hashing and generation

Definition 2.1 (Locality-sensitive hashing [91]).

Theorem 2.1 (Indyk-Motwani [91, 86], simplified).

Theorem 2.2 (Andoni-Indyk [10], simplified).

Theorem 2.3 (Dahlgaard-Knudsen-Thorup [64], simplified).

Theorem 2.4.

Theorem 2.5.

Definition 2.2.

Theorem 2.6 (Indyk-Motwani [91, 86]).

Setting $t$ .

Theorem 2.7.

Lemma 2.1.

Theorem 2.8 (Dahlgaard-Knudsen-Thorup [64]).

Lemma 2.2.

Lemma 2.3.

Lemma 2.4.

Lemma 2.5 (Cantelli’s inequality).

Lemma 2.6 (Hoeffding [88, Theorem 1]).

Definition 3.1.

Theorem 3.1.

Theorem 3.2.

Theorem 3.3.

Theorem 3.4 (informal).

Theorem 3.5 (informal).

Definition 3.2.

Theorem 3.6 (LSH framework [91, 86]).

Lemma 3.1 (powering).

Lemma 3.2 (tensoring).

Lemma 3.3.

Lemma 3.4.

16.1 Tradeoffs in $\ell_{s}^{d}$ -space

Lemma 3.5 (Lévy [102]).

Theorem 3.7 (O’Donnell et al. [121]).

Lemma 3.6.

Theorem 1.4.

Lemma 3.7 ([120, p. 285]).

Theorem 1.5.

Lemma 3.8 (Follows Szarek & Werner [153]).

Lemma 3.9 (Lu & Li [104]).

Lemma 3.10 (Tail upper bound).

Lemma 3.11 (Tail lower bound).