Algorithms for Similarity Search and Pseudorandomness
Tobias Christiani

TL;DR
This paper advances algorithms for approximate near neighbor search and pseudorandom number generation, providing new frameworks, bounds, and practical algorithms with improved efficiency and theoretical guarantees.
Contribution
It introduces new frameworks and bounds for ANN search using locality-sensitive hashing and develops high-quality pseudorandom number generators with optimal or near-optimal resource usage.
Findings
Reduced evaluations and complexity in ANN algorithms.
Established tight bounds for space-time tradeoffs in ANN.
Developed high-quality pseudorandom number generators with constant time.
Abstract
We study the problem of approximate near neighbor (ANN) search and show the following results: - An improved framework for solving the ANN problem using locality-sensitive hashing, reducing the number of evaluations of locality-sensitive hash functions and the word-RAM complexity compared to the standard framework. - A framework for solving the ANN problem with space-time tradeoffs as well as tight upper and lower bounds for the space-time tradeoff of framework solutions to the ANN problem under cosine similarity. - A novel approach to solving the ANN problem on sets along with a matching lower bound, improving the state of the art. - A self-tuning version of the algorithm is shown through experiments to outperform existing similarity join algorithms. - Tight lower bounds for asymmetric locality-sensitive hashing which has applications to the approximate furthest neighbor…
| Dataset | # sets / | avg. set size | sets / tokens |
| AOL | |||
| BMS-POS | |||
| DBLP | |||
| ENRON | |||
| FLICKR | |||
| LIVEJ | |||
| KOSARAK | |||
| NETFLIX | |||
| ORKUT | |||
| SPOTIFY | |||
| UNIFORM | |||
| TOKENS10K | |||
| TOKENS15K | |||
| TOKENS20K |
| Parameter | Description | Test | Final |
| limit | Brute force limit | ||
| Sketch word length | |||
| Size of MinHash set | |||
| Brute force aggressiveness | |||
| Sketch false negative prob. |
| Parameter | Description | Test | Final |
| limit | Brute force limit | ||
| Sketch word length | |||
| Size of MinHash set | |||
| Brute force aggressiveness | |||
| Sketch false negative prob. |
| Dataset | Threshold | Threshold | ||
| ALL | CP | ALL | CP | |
| 8.5E+09 | 7.4E+09 | 6.2E+08 | 2.9E+09 | |
| AOL | 8.5E+09 | 1.4E+09 | 6.2E+08 | 3.1E+07 |
| 1.3E+08 | 1.2E+08 | 1.6E+06 | 1.5E+06 | |
| 2.0E+09 | 9.2E+08 | 2.7E+08 | 3.3E+08 | |
| BMS-POS | 1.8E+09 | 1.7E+08 | 2.6E+08 | 4.9E+06 |
| 1.1E+07 | 1.0E+07 | 2.0E+05 | 1.8E+05 | |
| 6.6E+09 | 4.6E+08 | 1.2E+09 | 1.3E+08 | |
| DBLP | 1.9E+09 | 4.6E+07 | 7.2E+08 | 4.3E+05 |
| 1.7E+06 | 1.6E+06 | 9.1E+03 | 8.5E+03 | |
| 2.8E+09 | 3.7E+08 | 2.0E+08 | 1.5E+08 | |
| ENRON | 1.8E+09 | 6.7E+07 | 1.3E+08 | 2.1E+07 |
| 3.1E+06 | 2.9E+06 | 1.2E+06 | 1.2E+06 | |
| 5.7E+08 | 2.1E+09 | 9.3E+07 | 9.0E+08 | |
| FLICKR | 4.1E+08 | 1.1E+09 | 6.3E+07 | 3.8E+08 |
| 6.6E+07 | 6.1E+07 | 2.5E+07 | 2.3E+07 | |
| 2.6E+09 | 4.7E+09 | 7.4E+07 | 4.2E+08 | |
| KOSARAK | 2.5E+09 | 2.1E+09 | 6.8E+07 | 2.1E+07 |
| 2.3E+08 | 2.1E+08 | 4.4E+05 | 4.1E+05 | |
| 9.0E+09 | 2.8E+09 | 5.8E+08 | 1.2E+09 | |
| LIVEJ | 8.3E+09 | 3.6E+08 | 5.6E+08 | 1.8E+07 |
| 2.4E+07 | 2.2E+07 | 8.1E+05 | 7.6E+05 | |
| 8.6E+10 | 1.3E+09 | 1.0E+10 | 4.3E+08 | |
| NETFLIX | 1.3E+10 | 3.1E+07 | 3.4E+09 | 6.4E+05 |
| 1.0E+06 | 9.5E+05 | 2.4E+04 | 2.2E+04 | |
| 5.1E+09 | 1.1E+09 | 3.0E+08 | 7.2E+08 | |
| ORKUT | 3.9E+09 | 1.3E+06 | 2.6E+08 | 8.1E+04 |
| 9.0E+04 | 8.4E+04 | 5.6E+03 | 5.3E+03 | |
| 5.0E+06 | 1.2E+08 | 4.7E+05 | 8.5E+07 | |
| SPOTIFY | 4.8E+06 | 3.1E+05 | 4.6E+05 | 2.7E+03 |
| 2.0E+04 | 1.8E+04 | 2.0E+02 | 1.9E+02 | |
| 1.5E+10 | 1.7E+08 | 8.1E+09 | 4.9E+07 | |
| TOKENS10K | 4.1E+08 | 5.7E+06 | 4.1E+08 | 1.9E+06 |
| 1.3E+05 | 1.3E+05 | 7.4E+04 | 6.9E+04 | |
| 3.6E+10 | 3.0E+08 | 1.9E+10 | 8.1E+07 | |
| TOKENS15K | 9.6E+08 | 7.2E+06 | 9.6E+08 | 1.9E+06 |
| 1.4E+05 | 1.3E+05 | 7.5E+04 | 6.9E+04 | |
| 6.4E+10 | 4.4E+08 | 3.4E+10 | 1.0E+08 | |
| TOKENS20K | 1.7E+09 | 8.8E+06 | 1.7E+09 | 1.9E+06 |
| 1.4E+05 | 1.4E+05 | 7.9E+04 | 7.4E+04 | |
| 2.5E+09 | 3.7E+08 | 6.5E+08 | 1.3E+08 | |
| UNIFORM005 | 2.0E+09 | 9.5E+06 | 6.1E+08 | 3.9E+04 |
| 2.6E+05 | 2.4E+05 | 1.4E+03 | 1.3E+03 | |
| Horner | FFT | FFT + Expander | |||||
| ns | ns | ns | |||||
| 177 | 243 | 64 | 8 | 15 | |||
| 361 | 294 | 64 | 8 | 16 | |||
| 730 | 338 | 64 | 8 | 19 | |||
| 1470 | 375 | 64 | 8 | 23 | |||
| 2950 | 412 | 64 | 8 | 24 | |||
| 5902 | 449 | 64 | 8 | 25 | |||
| 11808 | 487 | 32 | 8 | 35 | |||
| 23627 | 523 | 64 | 16 | 43 | |||
| 47183 | 561 | 32 | 16 | 54 | |||
| 94429 | 599 | 64 | 8 | 68 | |||
| 188258 | 638 | 64 | 8 | 69 | |||
| 376143 | 678 | 64 | 8 | 77 | |||
| 751781 | 719 | 64 | 8 | 85 | |||
| 1505016 | 765 | 64 | 8 | 93 | |||
| 3015969 | 808 | 32 | 8 | 110 | |||
| 6082313 | 864 | 64 | 16 | 175 | |||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Advanced Image and Video Retrieval Techniques · Graph Theory and Algorithms
\settitle
Algorithms for Similarity Search and Pseudorandomness† \setauthorTobias Christiani \setsupervisorRasmus Pagh \setdateMay 2018
\thetitlepage
Abstract
We study the problem of approximate near neighbor (ANN) search and show the following results:
- •
An improved framework for solving the ANN problem using locality-sensitive hashing, reducing the number of evaluations of locality-sensitive hash functions and the word-RAM complexity compared to the standard framework.
- •
A framework for solving the ANN problem with space-time tradeoffs as well as tight upper and lower bounds for the space-time tradeoff of framework solutions to the ANN problem under cosine similarity.
- •
A novel approach to solving the ANN problem on sets along with a matching lower bound, improving the state of the art. A self-tuning version of the algorithm is shown through experiments to outperform existing similarity join algorithms.
- •
Tight lower bounds for asymmetric locality-sensitive hashing which has applications to the approximate furthest neighbor problem, orthogonal vector search, and annulus queries.
- •
A proof of the optimality of a well-known Boolean locality-sensitive hashing scheme.
We study the problem of efficient algorithms for producing high-quality pseudorandom numbers and obtain the following results:
- •
A deterministic algorithm for generating pseudorandom numbers of arbitrarily high quality in constant time using near-optimal space.
- •
A randomized construction of a family of hash functions that outputs pseudorandom numbers of arbitrarily high quality with space usage and running time nearly matching known cell-probe lower bounds.
Resumé
Vi undersøger et grundlæggende problem indenfor approksimativ søgning: tilnærmelsesvis nær nabo (TNN) problemet, og viser følgende resultater:
- •
En forbedret generel løsning af TNN problemet som reducerer antal evalueringer af afstandsfølsomme spredefunktioner.
- •
En generel løsning af TNN problemet som giver mulighed for tid-plads afvejning samt tætte øvre og nedre grænser for TNN problemet med tid-plads afvejning under kosinuslighed.
- •
En ny tilgang til løsning af TNN problemet på mængder samt en matchende nedre grænse. En adaptiv version af algoritmen til approksimativ sammenføjning vises ved eksperimenter at være konkurrencedygtig.
- •
Tætte nedre grænser for asymmetrisk afstandsfølsom spredning som har anvendelser til approksimativ søgning efter fjerne naboer, ortogonale vektorer, og annulus forespørgsler.
- •
Et optimalitetsbevis for en velkendt familie af Boolske afstandsfølsomme spredefunktioner.
Vi undersøger problemet at finde effektive algoritmer til produktion af pseudotilfældighed af høj kvalitet og opnår følgende resultater:
- •
En deterministisk algoritme til generation af pseudotilfældige tal af vilkårlig høj kvalitet i konstant tid og med tæt på optimalt pladsforbrug.
- •
En randomiseret konstruktion af en familie af spredefunktioner som afbilder til pseudotilfældige tal af vilkårlig høj kvalitet, med evalueringstid og pladsforbrug tæt på den nedre grænse.
Acknowledgements.
I am very grateful to Rasmus Pagh for advising me for the past almost five years. Most of my favorite results have come through my collaboration with Rasmus where his overview and technical strength complements my intuition. I always feel that Rasmus is ready to listen to my ideas, and to encourage me and guide my research in the right direction. I cannot imagine a better advisor. I would like to thank my collegues in the 4B corridor at ITU for creating a friendly academic environment where I enjoy spending my time. In particular I would like to thank my current and former office mates Johan Sivertsen, Matteo Dusefante, Thomas Ahle, Martin Aumüller, Morten Stöckel, and Ninh Pham. Thore Husfeldt also deserves special thanks for his stimulating lunch discussions on issues ranging from superintelligence to immigration policy. I would like to thank Greg Valiant for hosting my stay at Stanford in the fall of 2015 and Michael Mitzenmacher for hosting my stay at Harvard in the fall of 2017. Specifically I would like to thank Josh Alman, Michael Kim, Zhao Song, and Aviad Rubinstein for making my time spent abroad a pleasant and social experience. Finally I want to thank my parents Tom and Kirsten for supporting me, my brother Anders for tolerating living with a somewhat disorganized PhD student, and my girlfriend Elisabeth for listening to me and helping me through difficult times.
*All that is gold does not glitter,
Not all those who wander are lost;
The old that is strong does not wither,
Deep roots are not reached by the frost.
*From the ashes, a fire shall be woken,
A light from the shadows shall spring;
Renewed shall be blade that was broken,
The crownless again shall be king.
J. R. R. Tolkien (1892–1973)
Contents
-
21 Appendix: Approximate feature maps, characteristic functions, and Bochner’s Theorem
-
24 Appendix: Details about dynamization and the model of computation
\midsloppy\sloppybottom
Chapter 1 Introduction
1 Part I: Similarity search
Similarity search in large collections of high-dimensional objects is a problem that is well-motivated by numerous applications. Consider for example the representation of an image by a -dimensional feature vector , where each entry denotes the fraction of pixels of color in the image. Given a collection of images and a query image , we could for example be interested in finding the nearest neighbor of : the image such that the distance is minimized, for some appropriate choice of distance function. Applications of near neighbor search include:
- •
Classification: Given a collection of labelled objects and an unlabelled object , classify according to the label of its nearest neighbor in .
- •
Recommender systems: Find similar users, movies, songs, books etc. to be used for recommendation.
- •
Duplicate detection: Remove near-identical objects from a collection, for example duplicate web pages from the index of a search engine.
The trivial solution to the near neighbor problem would be to iterate through every and compute while keeping track of the nearest neighbor found so far. If we let denote the size of our collection and assume that it takes time to compute the distance between a pair of objects, then the trivial solution uses time .
Suppose we are interested in preprocessing the collection into a data structure that supports answering queries faster than the trivial solution. In two-dimensional Euclidean space there exists a solution based on the Voronoi diagram of with space usage and query time [68]. In higher dimensions, the best known solutions to the nearest neighbor problem either suffer from space usage or query time that is exponential in [86]. This phenomenon is known as the “curse of dimensionality” and has recently been substantiated by conditional hardness results [6, 67, 170, 139], showing for example that the problem of finding all nearest neighbors in a collection of points in -dimensional Euclidean space cannot be solved in time subquadratic in when unless the Strong Exponential Time Hypothesis (SETH) is false [169].
In order to efficiently solve similarity search problems in high dimensional spaces, researchers and practitioners have turned to approximate solutions. Instead of finding the exact nearest neighbor of a query point, we settle for finding a point that is some approximation factor times further away than the nearest neighbor. The algorithms and data structures for similarity search in this thesis are primarily aimed at providing efficient solutions to the approximate near neighbor problem defined as follows:
Definition 1.1**.**
Let be a collection of points in a distance space . A solution to the -near neighbor problem is a data structure that supports the following query operation: Given a query , if there exists with return with .
The -near neighbor problem differs from the nearest neighbor problem by searching for any point within a fixed radius of the query point and allowing us to return points at distance up to even though better candidates exist. It will be convenient to also define the -similarity problem as the natural equivalent of the -near neighbor problem where we measure similarities rather than distances, i.e., we wish to report find a point with similarity and we are willing to accept points with similarity .
By allowing an approximation factor it is possible to solve the -near neighbor problem in Euclidean space (and many other spaces) with query time that is sublinear in and polynomial in using space polynomial in and [91, 66]. However, even approximation has its limits when it comes to alleviating the curse of dimensionality. Rubinstein [139] has recently shown that unless SETH is false, for every choice of constants there exists such that a solution to the -approximate near neighbor problem with preprocessing time must use query time .
1.1 Locality-sensitive hashing
One of the most successful approaches for finding solutions to the approximate near neighbor problem in various spaces is known as locality-sensitive hashing, commonly abbreviated as LSH (see [12, 165] for more information). The idea behind locality-sensitive hashing is to construct a distribution over functions that are used to partition the space . This randomized partitioning scheme is locality-sensitive in the sense that close points are more likely to hash to the same part of a randomly sampled partition. When discussing locality-sensitive hashing, we will sometimes refer to a distribution of locality-sensitive hash functions as a family.
Definition 1.2** (Locality-sensitive hashing [91]).**
Let be a distance space and let be a distribution over functions . We say that is -sensitive if for and we have that:
- •
If then .
- •
If then .
We can speed up approximate near neighbor searches at the cost of some additional preprocessing by partitioning the set of points according to randomly sampled locality-sensitive hash functions . A query for a point proceeds by considering the points of that collide with under . Intuitively we want to sample enough hash functions such that the ball of radius around every potential query point is covered by the union of the parts . This approach yields the following general LSH framework for solving the approximate near neighbor problem (for more details see Chapter 2).
Theorem 1.1** (Indyk-Motwani [91, 86], simplified).**
Let be -sensitive and let , then there exists a solution to the -near neighbor problem using words of space and with query time dominated by evaluations of functions from .
1.2 Examples
To further introduce locality-sensitive hashing and the approach to solving the approximate near neighbor problem used in this thesis, we will present three simple and powerful families of locality-sensitive hash functions: Bit-sampling by Indyk and Motwani [91], MinHash by Broder [35], and SimHash by Charikar [47]. Indyk, Broder, and Charikar received the 2012 ACM Paris Kanellakis Theory and Practice Award “for their groundbreaking work on Locality-Sensitive Hashing that has had great impact in many fields of computer science including computer vision, databases, information retrieval, machine learning, and signal processing” [1]. We proceed by describing each of these families in turn, introducing relevant notation as we go along.
Bit-sampling.
Indyk and Motwani introduced a simple family of locality-sensitive hash functions for the -dimensional Boolean hypercube under Hamming distance where denotes the set . We sample a function by sampling uniformly at random in and setting . It is easy to see that a pair of points fail to collide under a random hash function if and only if is sampled from the set of coordinates where and differ.
[TABLE]
Suppose we want to use this function to solve the -near neighbor problem in Hamming space . Then, from Theorem 1.1 we optain a query exponent of
[TABLE]
where the details behind the last inequality can be found in [86]. In conclusion, bit-sampling gives a solution to the -near neighbor problem in Hamming space with query time roughly and space usage and preprocessing time roughly .
MinHash.
MinHash is a family of locality-sensitive hash functions with applications to similarity search and similarity estimation on sets under Jaccard similarity. Given sets their Jaccard similarity is defined by .
A random hash function from the MinHash distribution is specified by a random permutation of and hashes a set to the first element of in this permutation. The permutation can be specified by a uniformly random hash function where denotes the closed interval from [math] to . Specifically, we sample a random by sampling a uniformly random hash function and setting
[TABLE]
Two sets and collide under a random hash function if and only if the smallest element of is contained in . Otherwise, the smallest element of is in or the smallest element of is in and there is no way the sets hash to the same element. Since the smallest element of is uniformly distributed we get that
[TABLE]
MinHash gives a solution to the -similarity problem with exponent .
SimHash.
SimHash is a family of Boolean-valued locality-sensitive hash functions for d under cosine similarity where denotes the angle between and . We sample a function by sampling a -dimensional standard normal random variable and setting
[TABLE]
Intuitively, we sample a random hyperplane that goes through the origin and hash points depending on which side of the hyperplane they are on (the sign of the inner product ). Due to the rotational invariance of the standard normal distribution the properties of this scheme can be analyzed in two dimensions. The probability that two points on the unit circle are separated by a random line through the origin is exactly
[TABLE]
This scheme yields a solution to the -similarity problem under cosine similarity with .
1.3 Lower bounds
Given a space and distance thresholds , we are interested in finding a -sensitive family with a value of that is as small as possible. The primary technique for deriving locality-sensitive hashing lower bounds has been Fourier analysis of Boolean functions under noisy inputs (see the excellent book by O’Donnell for a comprehensive introduction [120]). Lower bounds for locality-sensitive hashing schemes (distributions over functions) often follow from lower bounds on the behaviour of a single function under randomly -correlated inputs, defined as follows:
Definition 1.3**.**
For and we say that is randomly -correlated if the pairs are i.i.d. with uniform in and
[TABLE]
If two vectors are randomly -correlated their expected cosine similarity is , and their expected Hamming distance is given by . As the dimensionality increases, the empirical correlation between and will be tightly concentrated around .
Let and consider a -sensitive family for Hamming space . Combining lower bounds by O’Donnell et al. [121] and Andoni and Razenshteyn [19] (building on work by Motwani et al. [114]), we have that
[TABLE]
The lower bounds require that is not too small as a function of . In particular, only the trivial lower bound of holds if can be exponentially small in , but such families are typically not of interest for high-dimensional similarity search where we want . For a more comprehensive discussion of this issue see [121].
Compared against different constructions of locality-sensitive hash families, the two lower bounds comprising equation (1) reveal interesting properties of the Boolean hypercube. As approach the lower bound of is the larger of the two bounds. If we convert the lower bound to Hamming distance we get that for an -sensitive family when . This lower bound is tight against the bit-sampling LSH of Indyk and Motwani. The bit-sampling family can be described as randomly partitioning the Boolean hypercube into subcubes, so in a sense subcubes are an optimal “shape” for distinguishing between very short random walks and slightly longer random walks in the Boolean hypercube. The lower bound of in Hamming space gives a lower bound of for -spaces (vectors in d under the -norm ). This follows from a direct embedding of the Boolean hypercube in -space.
As approaches [math] the lower bound of dominates. Converted to Hamming distance this bound becomes . For this is tight against existing constructions that use balls to partition the hypercube [73, 14, 13]. Loosely speaking, in this regime we see that balls in Hamming space are optimal for simultaneously minimizing volume (capturing [math]-correlated points) while maximizing the probability of capturing positively correlated points.
In Hamming space, the family of locality-sensitive hash functions that give the best known upper bound on the -value can essentially be described as follows: We sample a function by sampling a sequence of balls of radius slightly below with the center of each ball being sampled uniformly at random from . A point is then hashed to the index of the first ball in the sequence that contains . As we increase and decrease the radius of the balls, this scheme has a -value for the -near neighbor problem of
[TABLE]
This scheme also works on the unit sphere if we replace the balls by spherical caps [156, 13]. The size of the gap between the lower bound in equation (1) and the upper bound (2) is shown in Figure 1. Since the gap is less than it is difficult to argue that closing the gap would have huge practical implications, especially since the lower order terms in existing constructions exceed this for most realistic applications [13]. Nevertheless, considering the tools that have gone into proving the existing lower bounds, we believe that it is of fundamental mathematical interest to understand how to best separate -correlated points from -correlated points.
1.4 Beyond locality-sensitive hashing
A common theme among recent advances in the area of theoretical approximate similarity search has been to move beyond standard locality-sensitive hashing [14, 100, 149, 24, 16, 54, 56, 22]. The results in this direction usually modify part of the framework, for example by constructing the locality-sensitive family by looking at the data, but the underlying approach of using locality-sensitive mappings from points to buckets remains the same. This thesis explores several variations of standard locality-sensitive hashing and we therefore briefly introduce some of this work here.
Data-dependent locality-sensitive hashing.
A sequence of papers [14, 17, 19, 18, 16] has explored the idea of data-dependent locality-sensitive hashing: If we allow the construction of to depend on the set of data points , how fast can we then solve the approximate near neighbor problem? Andoni and Razenshteyn was able to show matching upper and lower bounds of in Euclidean space [17, 19]. This matches standard LSH upper and lower bounds in the case of random instances on the unit sphere, and indeed the construction by Andoni and Razenshteyn is based on a reduction to this case. Unfortunately the construction and its analysis is complicated and suffer from large lower order terms [16], although recent work has found some success in striking a balance between algorithmic simplicity and theoretical optimality using data-dependence in Hamming space [18].
Asymmetric locality-sensitive hashing.
Asymmetric locality-sensitive hashing extends the concept of standard locality-sensitive hashing to cover distributions over pairs of functions and studies how the probability of collision between pairs of points can be made to depend on the distance/similarity between the points [149, 22]. This modification to standard locality-sensitive hashing opens up new applications such as approximate search for furthest neighbors, orthogonal vectors [163], and annulus queries (see [22] for an overview). In Chapter 6 we show lower bounds for asymmetric locality-sensitive hashing.
Space-time tradeoffs.
The standard locality-sensitive hashing framework offers a balanced space-time tradeoff that is the result of a symmetric query and update procedure: Every data point is stored in buckets and during queries we probe buckets. A line of work has investigated how the query and update parts of the algorithm can be modified to yield different tradeoffs between space usage and query time [129, 106, 9, 95, 100, 54, 16]. Typically the performance of such solutions is expressed by two exponents: and . During updates we store points in buckets and during queries we probe buckets.
Early work in this area focused on how to modify the standard locality-sensitive hashing query and update algorithms using an idea known as multi-probing [106]. Regular locality-sensitive hashing uses hash functions . Suppose denotes the th bucket to be probed during the standard LSH query algorithm. By inspecting buckets in the neighborhood of , for example by adding some noise to and probing , we can increase the probability of finding a near neighbor of , which in turn allows us to reduce while maintaining correctness.
Recent breakthroughs in this area have come by abandoning the locality-sensitive hashing framework in favor of a more direct approach based on locality-sensitive filtering [100, 54]. Finally, Andoni et al. [16] have combined their data-dependent approach to locality-sensitive hashing with the best known space-time tradeoff solutions for random data to obtain optimal space-time tradeoffs, as shown by matching lower bounds. The optimal trade-off between for the -near neighbor problem in Euclidean space can be described by the equation
[TABLE]
For a balanced tradeoff this collapses to which is tight for data-dependent locality-sensitive hashing, but the bound has been shown to be tight for every choice of that satisfies the equation.
Locality-sensitive filters and maps.
Locality-sensitive filtering [24] differs from locality-sensitive hashing in that it uses locality-sensitive subsets of space (filters) rather than locality-sensitive partitions (hash functions) to solve the approximate near neighbor problem. An example of a locality-sensitive filter family is the distribution over balls of some fixed radius in Hamming space. This idea is further extended to allow asymmetry by using different filters for queries and updates [100, 54]. It turns out that the filter family of consisting of pairs of concentric balls in Hamming space can be used to solve the approximate near-neighbor problem with optimal space-time tradeoffs, matching the lower bound of Andoni et al. [16] for random data. Chapter 3 further introduces locality-sensitive filtering and space-time tradeoffs.
In even greater generality we can think of locality-sensitive hashing and filtering as being approaches to constructing randomized mappings (where denotes the power set of ) from a space to a collection of buckets that satisfy certain properties. Recent work on set similarity search (Chapter 4) and improvements to the standard locality-sensitive hashing framework (Chapter 2) explores these ideas and obtains efficient search algorithms by deviating from the standard approach.
2 Part II: Pseudorandom hashing and generators
The second part of this thesis contains results on efficient pseudorandom hash functions and pseudorandom number generators. We are interested in replacing the use of true randomness in randomized algorithms and data structures with the output of a pseudorandom hash function or generator, stretching a small random seed into a much larger output of pseudorandom values, while retaining guarantees on the performance of these algorithms. For a primer on the general study of pseudorandom generators see [79].
Universal hashing.
The pseudorandomness part of this thesis focuses on one specific type of pseudorandomness known as -wise independence or -independence, first introduced to the field of computer science through the concept of universal hashing by Carter and Wegman [40].
Definition 1.4**.**
Let be a positive integer and let be a family of functions from to . We say that is a -independent family of functions if for every choice of distinct keys and arbitrary values we have that
[TABLE]
Furthermore, we say that is -independent when it is selected uniformly at random from a family of -independent functions.
We can sample a -independent hash function by sampling each uniformly at random from the set where is prime. In fact, the family of polynomials of degree at most over a finite field is -independent [92]. We are typically interested in applications where the size of the universe is much larger than the degree of independence .
Different types of hashing-based dictionaries work for -independent hash functions with much smaller than the number of elements in the dictionary which we denote by . For example, it was shown that -independence suffices for linear probing to ensure expected constant time per operation [124]. It is known that -independence suffices for Cuckoo hashing [127], but -independence is not enough to ensure constant amortized cost per operation [63]. For a brief introduction to the use of random hashing in algorithms and data structures see [69].
Fast hashing and lower bound.
For applications that require super-constant independence, the time to evalute the hash function can be a performance bottleneck. A -independent polynomial hash function can be stored using words and evaluated using time on a word-RAM, assuming constant time arithmetic over the finite field. What if we are willing to use more space to represent a -independent hash function in order to reduce the evaluation time? Siegel [151] gave a powerful cell-probe lower bound for this problem, showing that for a -independent hash function with domain size , even if we use space roughly for some the evaluation time has to be .
Siegel also showed the existence of a matching upper bound based on highly unbalanced bipartite expander graphs with left vertex set corresponding to the domain of the hash function, right vertex set of size , and left outdegree . Given an appropriate expander graph we can sample a -independent hash function by associating each vertex with a random element from where we assume that is an abelian group, such as the the integers under modular arithmetic. To compute we take the sum of the random elements associated with the neighbors of the vertex and return the result.
Unfortunately we only know of the existence of such optimal expander graphs by the probablistic method: a random bipartite graph has the right properties for optimal -independent hashing with overwhelming probability if we parameterize the graph generation process correctly. Several works, Siegel’s original paper included, attempt to approach the performance of such optimal bipartite expander graph by the use of probabilistic constructions [72, 123, 158, 58]. In Chapter 9 we show a probablistic construction with space usage and evaluation time that almost matches the lower bound. Finding optimal explicit constructions remains a major open problem.
Other approaches to the problem of finding fast hash functions with theoretical guarantees include the study of tabulation hashing and its variations which has guarantees beyond what can be derived from the degree of -independence [159], to simulate uniformly random hashing in constant time on a subset of the universe [123], reusing randomness by splitting the problem into sub-problems that share a single highly random hash function [71], or extracting additional randomness from the input to the hash function [59].
Generating -independent random variables.
The generation of -independent random variables differs from random hashing by allowing the algorithm designer to specify where to evaluate a -independent function in order to generate a sequence of variables that is -independent. The problem of generating a sequence of -independent random variables is therefore easier than the problem of constructing a data structure to represent a random -independent hash function that an adversary can choose to evaluate in an arbitrary point.
We can take a standard -independent polynomial hash function and evaluate it in points in time using fast multipoint evaluation algorithms [27, 164], giving us a generator of -independent random variables with generation time per variable that uses space . This in itself shows that the task of generation is easier than hashing, as it would be impossible to evaluate a -independent hash function in time using space , if for example . In Chapter 8 we show how to generate -independent variables in constant time, independent of , using space .
3 Overview and contributions
This thesis is divided into two parts. The first part presents algorithms and lower bounds for various problems related to similarity search. The second part presents algorithms for the efficient generation of high-quality pseudorandom numbers, as well as efficient hash functions. The chapters are based on the following papers:
- I.
Similarity search.
- 2.
Tobias Christiani: Fast locality-sensitive hashing frameworks for approximate near neighbor search [53]. 2017. Unpublished.
- 3.
Tobias Christiani: A Framework for Similarity Search with Space-Time Tradeoffs using Locality-Sensitive Filtering [54]. SODA 2017.
- 4.
Tobias Christiani and Rasmus Pagh: Set similarity search beyond MinHash [56]. STOC 2017.
- 5.
Tobias Christiani, Rasmus Pagh and Johan Sivertsen: Scalable and robust set similarity join [57]. ICDE 2018.
- 6.
Martin Aumüller, Tobias Christiani, Rasmus Pagh and Francesco Silvestri: Distance-sensitive hashing [22]. PODS 2018.
- 7.
Tobias Christiani: Optimal Boolean locality-sensitive hashing. 2018. Unpublished.
- II.
Pseudorandomness.
- 8.
Tobias Christiani and Rasmus Pagh: Generating -independent variables in constant time [55]. FOCS 2014.
- 9.
Tobias Christiani, Rasmus Pagh and Mikkel Thorup: From Independence to Expansion and Back Again [58]. STOC 2015.
We proceed by giving a brief description of the contribution of each chapter.
3.1 Part I: Similarity search
Chapter 2: Fast locality-sensitive hashing frameworks.
This chapter begins by surveying different techniques for constructing a solution to the approximate near neighbor problem from a family of locality-sensitive hash functions. Given a family of locality-sensitive hash functions, the standard Indyk-Motwani framework (Theorem 1.1) uses functions from to solve the approximate near neighbor problem. During a query all of these hash functions are evaluated, dominating the query time. For many LSH schemes the time to evaluate a single function is or greater, as witnessed for example by SimHash or MinHash, further exacerbating the problem. Building on recent work by Dahlgaard et al. [64] we show that the number of locality-sensitive hash functions can be reduced to in general, yielding an improved LSH framework. We combine this result with a technique from another LSH framework by Andoni and Indyk [10] to reduce the word-RAM complexity of this improved framework by a logarithmic factor to .
Chapter 3: Space-time tradeoffs for similarity search.
This chapter introduces a framework for solving the approximate near neighbor problem with space-time tradeoffs using locality-sensitive filtering. We show concrete solutions on the unit sphere under cosine similarity with extensions to -space for every . These results improve and generalize prior work [100, 95]. We also include a lower bound on space-time tradeoff that is tight, but suffers from some important restrictions. A paper by Andoni et al. [16] has since shown a strengthened lower bound and an improved upper bound through the use of data-dependent techniques. An early version of the paper behind this chapter formed part of my master’s thesis. At the end of the chapter we have added an improved locality-sensitive filtering framework compared to the one in the main text, building on ideas introduced in Chapter 2 and 4.
Chapter 4: Set similarity search beyond MinHash.
In this chapter we consider the problem of set similarity search under Braun-Blanquet similarity . We show that the -similarity problem in this setting can be solved with an exponent of and that this is tight among solutions based on data-independent locality-sensitive maps. The upper bound is based on a novel construction inspired by branching processes and interestingly, although it is data-independent, it outperforms the best known data-dependent techinques for a large portion of the parameter space . The lower bound follows from a reduction to the standard -near neighbor problem in Hamming space for . In this setting the lower bound by O’Donnell et al. [121] is tight and we are able to show that it extends to Braun-Blanquet similarity for every choice of . This is interesting in the light of the gap in our knowledge when it comes to the usual -similarity problem for cosine similarity, as explained in the introduction.
Chapter 6: Lower bounds for asymmetric locality-sensitive hashing.
In this chapter we derive lower bounds (on the -value) for asymmetric locality-sensitive hashing. Our lower bound covers the case of asymmetric families for approximate near neighbor search, as well as the case of approximate furthest neighbor search where we are interested in having the collision probability of increase in the distance between points. We show that our lower bounds are tight against existing symmetric constructions in the case of the application to near neighbor search, and that this construction can easily be modified to yield an optimal asymmetric construction for furthest neighbor search.
Chapter 7: Optimal Boolean locality-sensitive hashing.
In this chapter we show that, among the class of Boolean locality-sensitive hash functions , bit-sampling is an optimal LSH (minimizes the -value) for the -near neighbor problem in Hamming space for every choice of . This stands in contrast to the lower bound by O’Donnell et al. [121] which is unrestricted with respect to the range of the locality-sensitive hash functions. Bit-sampling only matches this unrestricted lower bound in the case where approach . Our result settles the question of optimal Boolean locality-sensitive hashing for Hamming space and shows that we have to look towards families of hash functions with a larger range in order to further improve the -value compared to bit-sampling. Andoni et al. [13] have shown lower bounds on the -value on the unit sphere as a function of the size of the range of the hash function.
3.2 Part II: Pseudorandom hashing and number generation
Chapter 8: Generating -independent random variables in constant time.
We investigate the problem of efficiently generating -independent random variables and give an explicit generator of -independent random variables with constant generation time, independent of . The explicit construction combines multipoint evaluation of polynomials over finite fields with a cascading construction of explicit bipartite expander graphs by Capalbo et al. [39]. The space usage of this construction is with a very large exponent in the polynomial. We also show a randomized version of the same construction that uses a randomly generated bipartite graph. This reduces the space overhead to at the cost of introducing an error probability (the generated sequence may fail to be -independent) that is polynomially small in . We implement a version of the generator that combines a random bipartite expander with fast multipoint evaluation of polynomials over and show that it scales well, even for generating -independent variables.
Chapter 9: Near-optimal -independent hashing.
In this chapter we attack the problem of constructing fast -independent random hash functions. We use the fact that there is a sort of duality between randomized bipartite expander graphs and -independent random hash functions. A bipartite expander graph that expands on subsets of size can be used to construct a -independent family of functions, and a -independent function is likely to represent a bipartite expander that expands on subsets of size . We take a small bipartite expander graph and apply an inefficient graph product that preserves its expansion properties while increasing the size of the left vertex set (the size of the domain of the resulting hash function). Then we use this resulting bipartite expander graph to construct a -independent random hash function that now represents a new expander on a larger domain with optimal properties. By applying this strategy recursively using different graph products we are able to give randomized constructions of -independent hash functions in the word-RAM model that almost match Siegel’s cell probe lower bound [151].
4 Conclusion and open problems
4.1 Similarity search
We have shown new upper and lower bounds for problems related to approximate similarity search in high-dimensional spaces, showing improved locality-sensitive hashing frameworks, lower bounds for Boolean locality-sensitive hashing, and going beyond locality-sensitive hashing in several different directions with asymmetric locality-sensitive hashing, space-time tradeoffs through locality-sensitive filtering, and locality-sensitive maps for set similarity search.
Optimal data-independent locality-sensitive hashing.
It remains open to close the gap between the upper and lower bounds on the -value of -sensitive families in Hamming space (shown in Figure 1). Existing lower bounds seem to have explored the limits of what can be shown with our current understanding of hypercontractive inequalities and Fourier analysis of Boolean functions. We conjecture that the ball-based LSH construction with the -value given in equation 2 is asymptotically optimal for every choice of .
Orthogonal search.
Suppose we are interested in an asymmetric locality-sensitive hashing scheme for the unit sphere under cosine similarity that can be used to search for orthogonal vectors. For this purpose we want the probability of collison to be as high as possible for [math]-correlated (orthogonal) vectors and have the probability of collision decrease at the correlation becomes positive or negative. Let denote the probability of collision of the asymmetric locality-sensitive hashing scheme for a pair of -correlated vectors. The current best upper bound on is given by [22]. The lower bound presented in Chapter 6 only implies . Obtaining a “two-sided” lower bound that simultanously relates to both and has close ties to the open symmetric Gaussian problem [119]. It is conjectured that the upper bound is tight.
Simple data-dependent constructions.
It is an important open problem to find simpler data-dependent solutions to approximate near neighbor search. Despite the intuitive appeal of using the data to inform the construction of the solution, relatively few people have succeded in making theoretical progress in this area [14, 17, 18]. Perhaps by relaxing the problem slightly, for example by only requiring that queries that follow a specific distribution succeed with constant probability, progress can be made. An example of such a query distribution could be to sample one of the data points uniformly at random and sample the query from a ball around the data point. Attacking the problem for data structures that use near-linear space in also seems like a promising approach.
4.2 -independent hashing and generation
We have shown near-optimal results for -independent hashing and generation.
Optimal explicit unbalanced bipartite expander graphs.
The main open problem in this area is the explicit construction of highly unbalanced bipartite expander graphs with optimal properties. We would like to be able to evaluate the neighbor function of a left -regular bipartite expander graph with optimal parameters (matching Siegel’s lower bound for -independent hashing) using time that is at most polynomial in the bit-length of the input. For the application to random hashing we would furthermore like to be able to list the neighbors of a vertex in time . The construction in Chapter 9 is essentially able to solve this task in time , so it would require a very clean explicit construction to yield an improvement to the efficiency of random hashing in practice. Results on the construction of explicit bipartite expanders by Guruswami et al. [84] and preprocessing polynomials [96] are based directly on results such as the fundamental theorem of algebra and the Chinese remainder theorem and give hope that there exists a simple explicit construction.
Constant time generators with minimal space.
The fast generators in Chapter 8 uses polynomials over finite fields and require space . Through the sequential evaluation of hash functions presented in Chapter 9 we can remove the need for arithmetic over finite fields, but it seems that if we want to use minimal space the evalution time will still be with space usage . Is it possible to get constant-time generation in a restricted word-RAM model without multiplication using space ?
Part I Similarity search
Chapter 2 Fast locality-sensitive hashing frameworks
‘Renewed shall be blade that was broken’
The Indyk-Motwani Locality-Sensitive Hashing (LSH) framework (STOC 1998) is a general technique for constructing a data structure to answer approximate near neighbor queries by using a distribution over locality-sensitive hash functions that partition space. For a collection of points, after preprocessing, the query time is dominated by evaluations of hash functions from and hash table lookups and distance computations where is determined by the locality-sensitivity properties of . It follows from a recent result by Dahlgaard et al. (FOCS 2017) that the number of locality-sensitive hash functions can be reduced to , leaving the query time to be dominated by distance computations and additional word-RAM operations. We state this result as a general framework and provide a simpler analysis showing that the number of lookups and distance computations closely match the Indyk-Motwani framework. Using ideas from another locality-sensitive hashing framework by Andoni and Indyk (SODA 2006) we are able to reduce the number of additional word-RAM operations to .
5 Introduction
The -approximate near neighbor problem is the problem of preprocessing a collection of points in a space into a data structure that after preprocessing supports the following query operation: Given a query point , if there exists a point with , then the data structure is guaranteed to return a point such that .
Indyk and Motwani [91] introduced a general framework for constructing solutions to the approximate near neighbor problem using a technique known as locality-sensitive hashing (LSH). The framework takes a distribution over hash functions with the property that near points are more likely to collide under a random . During preprocessing a number of locality-sensitive hash functions are sampled from and used to hash the points of into buckets. The query algorithm evaluates the same hash functions on the query point and looks into the associated buckets to find an approximate near neighbor.
The locality-sensitive hashing framework of Indyk and Motwani has had a large impact in both theory and practice (see surveys [12] and [165] for an introduction), and many of the best known (data-independent) solutions to the approximate near neighbor problem in high-dimensional spaces, such as Euclidean space [11], the unit sphere under inner product similarity [13], and sets under Jaccard similarity [33] come in the form of families of locality-sensitive hash functions that can be plugged into the Indyk-Motwani LSH framework. Recent work on data-dependent locality-sensitive hashing has further improved solutions for -spaces and cosine similarity [14, 17, 16], but these solutions typically do not come directly in the form of a distribution over locality-sensitive hash functions and as such it is unclear whether the techniques in this paper can yield further speedups to these results.
Definition 2.1** (Locality-sensitive hashing [91]).**
Let be a distance space and let be a distribution over functions . We say that is -sensitive if for and we have that:
- •
If then .
- •
If then .
The Indyk-Motwani framework takes a -sensitive family and constructs a data structure that solves the approximate near neighbor problem for parameters with some positive constant probability of success. We will refer to this randomized approximate version of the near neighbor problem as the -near neighbor problem, where we require queries to succeed with probability at least (see Definition 2.2). To simplify the exposition we will assume throughout the introduction, unless otherwise stated, that are constant, that a hash function can be stored in words of space, and for that a point can be stored in words of space. The assumption of a constant gap between and allows us to avoid performing distance computations by instead using the -bit sketching scheme of Li and König [103] together with the family to approximate distances (see Section 8.1 for details). In the remaining part of the paper we will state our results without any such assumptions to ensure, for example, that our results hold in the important case where may depend on or the dimensionality of the space [11, 13].
Theorem 2.1** (Indyk-Motwani [91, 86], simplified).**
Let be -sensitive and let , then there exists a solution to the -near neighbor problem using words of space and with query time dominated by evaluations of functions from .
The query time of the Indyk-Motwani framework is dominated by the number of evaluations of locality-sensitive hash functions. To make matters worse, almost all of the best known and most widely used locality-sensitive families have an evalution time that is at least linear in the dimensionality of the underlying space [33, 47, 66, 11, 13]. Significant effort has been devoted to the problem of reducing the evaluation complexity of locality-sensitive hash families [156, 75, 65, 13, 97, 147, 148, 64], while the question of how many independent locality-sensitive hash functions are actually needed to solve the -near neighbor problem has received relatively little attention [10, 64].
This paper aims to bring attention to, strengthen, generalize, and simplify results that reduce the number of locality-sensitive hash functions used to solve the -near neighbor problem. In particular, we will extract a general framework from a technique introduced by Dahlgaard et al. [64] in the context of set similarity search under Jaccard similarity, showing that the number of locality-sensitive hash functions can be reduced to in general. Reducing the number of locality-sensitive hash functions allows us to spend time per hash function evaluation without increasing the overall complexity of the query algorithm — something which is particularly useful in Euclidean space where the best known LSH upper bounds offer a tradeoff between the -value that can be achieved and the evaluation complexity of the locality-sensitive hash function [11, 13, 97].
The main technical contribution of this paper is to reduce the word-RAM complexity of the general LSH framework from to by combining techniques from Dahlgaard et al. and Andoni and Indyk [10].
5.1 Related work
Indyk-Motwani.
The Indyk-Motwani framework uses independent partitions of space, each formed by overlaying random partitions induced by random hash functions from a locality-sensitive family . The parameter is chosen such that a random partition has the property that a pair of points with has probability of ending up in the same part of the partition, while a pair of points with has probability of colliding. By randomly sampling such partitions we are able to guarantee that a pair of near points will collide with constant probability in at least one of them. Applying these partitions to our collection of data points and storing the result of each partition of in a hash table we obtain a data structure that solves the -near neighbor problem as outlined in Theorem 2.1 above. Section 7 and 7.1 contains a more complete description of LSH-based frameworks and the Indyk-Motwani framework.
Andoni-Indyk.
As previously mentioned, many locality-sensitive hash functions happen to have a super-constant evaluation time. This motivated Andoni and Indyk to introduce a replacement to the Indyk-Motwani framework in a paper on substring near neighbor search [10]. The key idea is to re-use hash functions from a small collection of size by forming all combinations of hash functions. This technique is also known as tensoring and has seen some use in the work on alternative solutions to the approximate near neighbor problem, in particular the work on locality-sensitive filtering [73, 24, 54]. By applying the tensoring technique the Andoni-Indyk framework reduces the number of hash functions to as stated in Theorem 2.2.
Theorem 2.2** (Andoni-Indyk [10], simplified).**
Let be -sensitive and let , then there exists a solution to the -near neighbor problem using words of space and with query time dominated by evaluations of functions from and other word-RAM operations.
The paper by Andoni and Indyk did not state this result explicitly as a theorem in the same form as the Indyk-Motwani framework; the analysis made some implicit restrictive assumptions on and ignored integer constraints. Perhaps for these reasons the result does not appear to have received much attention, although it has seen some limited use in practice [152]. In Section 7.2 we present a slightly different version of the Andoni-Indyk framework together with an analysis that satisfies integer constraints, providing a more accurate assessment of the performance of the framework in the general, unrestricted case.
Dahlgaard-Knudsen-Throup.
The paper by Dahlgaard et al. [64] introduced a different technique for constructing the hash functions/partitions from a smaller collection of hash functions from . Instead of forming all combinations of subsets of size as the Andoni-Indyk framework they instead sample hash functions from the collection to form each of the partitions. The paper focused on a particular application to set similarity search under Jaccard similarity, and stated the result in terms of a solution to this problem. In Section 7.3 we provide a simplified and tighter analysis to yield a general framework:
Theorem 2.3** (Dahlgaard-Knudsen-Thorup [64], simplified).**
Let be -sensitive and let , then there exists a solution to the -near neighbor problem using words of space and with query time dominated by evaluations of functions from and other word-RAM operations.
The analysis of [64] indicates that the Dahlgaard-Knudsen-Thorup framework, when compared to the Indyk-Motwani framework, would use at least times as many partitions (and a corresponding increase in the number of hash table lookups and distance computations) to solve the -near neighbor problem with success probability at least . Using elementary tools, the analysis in this paper shows that we only have to use twice as many partitions as the Indyk-Motwani framework to obtain the same guarantee of success.
Number of hash functions.
To provide some idea of the number of hash functions used by the different frameworks, Figure 2 shows the value of that is obtained by the Indyk-Motwani (IM), Andoni-Indyk (AI), and Dahlgaard-Knudsen-Thorup (DKT) frameworks according to the analysis in Section 7 for and every value of for a solution to the -near neighbor problem on a collection of points with success probability at least . Note that Figure 2 shows an upper bound on the number of hash functions used by the frameworks according to the analysis in order to provide a solution with theoretical guarantees to the approximate near neighbor problem for any data set, and not the actual setting required for a particular data set (we haven’t actually performed an experiment on points). In the analysis behind Figure 2 we have attempted to minimize within each respective framework.
Figure 2 reveals that the number of hash functions used by the Indyk-Motwani framework exceeds , the size of the collection of points , as approaches . In addition, locality-sensitive hash functions used in practice such as Charikar’s SimHash [47] and -stable LSH [66] have evaluation time for points in . These two factors might help explain why a linear scan over sketches of the entire collection of points is a popular approach to solve the approximate near neighbor problem in practice [168, 80]. The Andoni-Indyk framework reduces the number of hash functions by several orders of magnitude, and the Dahlgaard-Knudsen-Thorup framework presents another improvement of several orders of magnitude. Since the word-RAM complexity of the DKT framework matches the the number of hash functions used by the IM framework, the gap between the solid line (DKT) and the dotted line (IM) gives some indication of the time we can spend on evaluating a single hash function in the DKT framework without suffering a noticeable increase in the query time.
5.2 Contribution
Improved word-RAM complexity.
In addition to our work on the Andoni-Indyk and Dahlgaard-Knudsen-Thorup frameworks as mentioned above, we show how the word-RAM complexity of the DKT framework can be reduced by a logarithmic factor. The solution is a simple combination of the DKT sampling technique and the AI tensoring technique: First we use the DKT sampling technique twice to construct two collections of partitions. Then we use the AI tensoring technique to form pairs of partitions from the two collections. Below we state our main Theorem 2.4 in its general form where we make no implicit assumptions about ( and are not assumed to be constant and can depend on for example ) or about the complexity of storing a point or a hash function, or computing the distance between pairs of points in the space .
Theorem 2.4**.**
Let be -sensitive and let , then there exists a solution to the -near neighbor with the following properties:
- •
The query complexity is dominated by evaluations of functions from , distance computations, and other word-RAM operations.
- •
The solution uses words of space in addition to the space required to store the data and functions from .
Under the same simplifying assumptions used in the statements of Theorem 2.1, 2.2, and 2.3, our main Theorem 2.4 can be stated as Theorem 2.3 with the word-RAM complexity reduced by a logarithmic factor to . This improvement in the word-RAM complexity comes at the cost of a (rather small) constant factor increase in the number of hash functions, lookups, and distance computations compared to the DKT framework. By varying the size of the collection of hash functions from and performing independent repetitions we can obtain a tradeoff between the number of hash functions and the number of lookups. In Section 9 we remark on some possible improvements in the case where is large.
Distance sketching using LSH.
Finally, we combine Theorem 2.4 with the 1-bit sketching scheme of Li and König [103] where we use the locality-sensitive hash family to create sketches that allow us to leverage word-level parallelism and avoid direct distance computations. This sketching technique is well known and has been used before in combination with LSH-based approximate similarity search [57], but we believe there is some value in the simplicity of the analysis and in a clear statement of the combination of the two results as given in Theorem 2.5, for example in the important case where are constant.
Theorem 2.5**.**
Let be -sensitive and let , then there exists a solution to the -near neighbor with the following properties:
- •
The complexity of the query operation is dominated by evaluations of hash functions from and other word-RAM operations.
- •
The solution uses words of space in addition to the space required to store the data and hash functions from .
6 Preliminaries
Problem and dynamization.
We begin by defining the version of the approximate near neighbor problem that the frameworks presented in this paper will be solving:
Definition 2.2**.**
Let be a collection of points in a distance space . A solution to the -near neighbor problem is a data structure that supports the following query operation: Given a query point , if there exists a point with , then, with probability at least , return a point such that .
We aim for solutions with a failure probability that is upper bounded by . The standard trick of using independent repetitions of the data structure allows us to reduce the probability of failure to . For the sake of simplicity we restrict our attention to static solutions, meaning that we do not concern ourselves with the complexity of updates to the underlying set , although it is simple to modify the static solutions presented in this paper to dynamic solutions where the update complexity essentially matches the query complexity [122, 86]
LSH powering.
The Indyk-Motwani framework and the Andoni-Indyk framework will make use of the following standard powering technique described in the introduction as “overlaying partitions”. Let be an integer and let denote a locality-sensitive family of hash functions as in Definition 2.1. We will use the notation to denote the distribution over functions where
[TABLE]
and are sampled independently at random from . It is easy to see that is -sensitive. To deal with some special cases we define to be the family consisting of a single constant function.
Model of computation.
We will work in the standard word-RAM model of computation [85] with a word length of bits where denotes the size of the collection to be searched in the -near neighbor problem. During the preprocessing stage of our solutions we will assume access to a source of randomness that allows us to sample independently from a family and to seed pairwise independent hash functions [41, 42]. The latter can easily be accomplished by augmenting the model with an instruction that generates a uniformly random word in constant time and using that to seed the tables of a Zobrist hash function [173].
7 Frameworks
Overview.
We will describe frameworks that take as input a -sensitive family and a collection of points and constructs a data structure that solves the -near neighbor problem. The frameworks described in this paper all use the same high-level technique of constructing hash functions that are used to partition space such that a pair of points with will end up in the same part of one of the partitions with probability at least . That is, for with we have that where is used to denote the set . At the same time we ensure that the expected number of collisions between pairs of points with is at most one in each partition.
Preprocessing and queries.
During the preprocessing phase, for each of the hash functions we compute the partition of the collection of points induced by and store it in a hash table in the form of key-value pairs . To reduce space usage we store only a single copy of the collection and store references to in our hash tables. To guarantee lookups in constant time we can use the perfect hashing scheme by Fredman et al. [76] to construct our hash tables. We will assume that hash values fit into words. If this is not the case we can use universal hashing [40] to operate on fingerprints of the hash values.
We perform a query for a point as follows: for we compute , retrieve the set of points , and compute the distance between and each point in the set. If we encounter a point with then we return and terminate. If after querying the sets no such point is encountered we return a special symbol and terminate.
We will proceed by describing and analyzing the solutions to the -near neighbor problem for different approaches to sampling, storing, and computing the hash functions , resulting in the different frameworks as mentioned in the introduction.
7.1 Indyk-Motwani
To solve the -near neighbor problem using the Indyk-Motwani framework we sample hash functions independently at random from the family where we set and . Correctness of the data structure follows from the observation that the probability that a pair of points with does not collide under a randomly sampled is at most . We can therefore upper bound the probability that a near pair of points does not collide under any of the hash functions by using a standard bound stated as Lemma 2.3 in Appendix 11.
In the worst case, the query operation computes hash functions from corresponding to hash functions from . For a query point the expected number of points with that collide with under a randomly sampled is at most . It follows from linearity of expectation that the total expected number of distance computations during a query is at most . The result is summarized in Theorem 2.6 from which the simplified Theorem 2.1 follows.
Theorem 2.6** (Indyk-Motwani [91, 86]).**
Given a -sensitive family we can construct a data structure that solves the -near neighbor problem such that for and the data structure has the following properties:
- •
The query operation uses at most evaluations of hash functions from , expected distance computations, and other word-RAM operations.
- •
The data structure uses words of space in addition to the space required to store the data and hash functions from .
Theorem 2.6 gives a bound on the expected number of distance computations while the simplified version stated in Theorem 2.1 uses Markov’s inequality and independent repetitions to remove the expectation from the bound by treating an excessive number of distance computations as a failure.
7.2 Andoni-Indyk
In 2006 Andoni and Indyk, as part of a paper on the substring near neighbor problem, introduced an improvement to the Indyk-Motwani framework that reduces the number of locality-sensitive hash functions [10]. Their improvement comes from the use of a technique that we will refer to as tensoring: setting the hash functions to be all -tuples from a collection of functions sampled from where . The analysis in [10] shows that by setting and repeating the entire scheme times, the total number of hash functions can be reduced to when setting . This analysis ignores integer constraints on , , and , and implicitly place restrictions on and in relation to (e.g. are constant). We will introduce a slightly different scheme that takes into account integer constraints and analyze it without restrictions on the properties of .
Assume that we are given a -sensitive family . Let be non-negative integer parameters. Each of the hash functions will be formed by concatenating one hash function from each of collections of hash functions from and concatenating a last hash function from a collection of hash functions from . We take all hash functions of the above form and repeat times for a total of hash functions constructed from a total of hash functions from . In Appendix 12 we set parameters, leaving variable, and provide an analysis of this scheme, showing that matches the Indyk-Motwani framework bound of up to a constant where as in Theorem 2.6.
Setting .
It remains to show how to set to obtain a good bound on the number of hash functions . Note that in practice we can simply set by trying . If we ignore integer constraints and place certain restrictions of as in the original tensoring scheme by Andoni and Indyk we want to set to minimize the expression . This minimum is obtained when setting such that . We therefore cannot do much better than setting which gives the bound as shown in [10]. To allow for easy comparison with the Indyk-Motwani framework without placing restrictions on we set , resulting in Theorem 2.7.
Theorem 2.7**.**
Given a -sensitive family we can construct a data structure that solves the -near neighbor problem such that for , , and the data structure has the following properties:
- •
The query operation uses evaluations of functions from , distance computations, and other word-RAM operations.
- •
The data structure uses words of space in addition to the space required to store the data and hash functions from .
Thus, compared to the Indyk-Motwani framework we have gone from using locality-sensitive hash functions to locality-sensitive hash functions. Figure 2 shows the actual number of hash functions of the revised version of the Andoni-Indyk scheme as analyzed in Appendix 12 when is set to minimize .
7.3 Dahlgaard-Knudsen-Thorup
In a recent paper Dahlgaard et al. [64] introduce a different technique for reducing the number of locality-sensitive hash functions. The idea is to construct each hash value by sampling and concatenating hash values from a collection of pre-computed hash functions from . Dahlgaard et al. applied this technique to provide a fast solution the approximate near neighbor problem for sets under Jaccard similarity. In this paper we use the same technique to derive a general framework solution that works with every family of locality-sensitive hash functions, reducing the number of locality-sensitive hash functions compard to the Indyk-Motwani and Andoni-Indyk frameworks.
Let denote the set of integers . For and let denote a hash function in our collection. To sample from the collection we use pairwise independent hash functions [42] of the form and set
[TABLE]
To show correctness of this scheme we will use make use of an elementary one-sided version of Chebyshev’s inequality stating that for a random variable with mean and variance we have that . For completeness we have included the proof of this inequality in Lemma 2.5 in Appendix 11. We will apply this inequality to lower bound the probability that there are no collisions between close pairs of points. For two points and let so that denotes the sum of collisions under the hash functions. To apply the inequality we need to derive an expression for the expectation and the variance of the random variable . Let then by linearity of expectation we have that . To bound we proceed by bounding where we note that for and make use of the independence between and for .
[TABLE]
We have that which follows from the pairwise independence of . Let and set then for we have that . This allows us to bound the variance of by resulting in the following lower bound on the probability of collision between similar points.
Lemma 2.1**.**
For let , then for every pair of points with we have that
[TABLE]
By setting and we obtain an upper bound on the failure probability of . Setting the size of each of the collections of pre-computed hash values to is sufficient to yield the following solution to the -near neighbor problem where provide exact bounds on the number of lookups and hash functions :
Theorem 2.8** (Dahlgaard-Knudsen-Thorup [64]).**
Given a -sensitive family we can construct a data structure that solves the -near neighbor problem such that for , , and the data structure has the following properties:
- •
The query operation uses at most evaluations of hash functions from , expected distance computations, and other word-RAM operations.
- •
The data structure uses words of space in addition to the space required to store the data and hash functions from .
Compared to the Indyk-Motwani framework we have reduced the number of locality-sensitive hash functions from to at the cost of using twice as many lookups. To reduce the number of lookups further we can decrease and perform several independent repetitions. This comes at the cost of an increase in the number of hash functions .
8 Reducing the word-RAM complexity
One drawback of the DKT framework is that each hash value still takes word-RAM operations to compute, even after the underlying locality-sensitive hash functions are known. This results in a bound on the total number of additional word-RAM operations of . We show how to combine the DKT universal hashing technique with the AI tensoring technique to ensure that the running time is dominated by distance computations and hash function evaluations. The idea is to use the DKT scheme to construct two collections of respectively and hash functions, and then to use the AI tensoring approach to form as the combinations of functions from the two collections. The number of lookups can be reduced by applying tensoring several times in independent repetitions, but for the sake of simplicity we use a single repetition. For the usual setting of let and . Set and . According to Lemma 2.1 if we set the success probability of each collection is at least and by a union bound the probability that either collection fails to contain a colliding hash function is at most . This concludes the proof of our main Theorem 2.4.
8.1 Sketching
The theorems of the previous section made no assumptions on the word-RAM complexity of distance computations and instead stated the number of distance computations as part of the query complexity. We can use a -sensitive family to create sketches that allows us to efficiently approximate the distance between pairs of points, provided that the gap between and is sufficiently large. In this section we will re-state the results of Theorem 2.4 when applying the family to create sketches using the 1-bit sketching scheme of Li and König [103]. Let be a positive integer denoting the length of the sketches in bits. The advantage of this scheme is that we can use word level parallelism to evaluate a sketch of bits in time in our word-RAM model with word length .
For let denote a randomly sampled locality-sensitive hash function from and let denote a randomly sampled universal hash function. We let denote the sketch of a point where we set the th bit of the sketch . For two points the probability that they agree on the th bit is if the points collide under and otherwise.
[TABLE]
We will apply these sketches during our query procedure instead of direct distance computations when searching through the points in the buckets, comparing them to our query point . Let be a parameter that will determine whether we report a point or not. For sketches of length we will return a point if . An application of Hoeffiding’s inequality gives us the following properties of the sketch:
Lemma 2.2**.**
Let be a -sensitive family and let , then for sketches of length and for every pair points :
- •
If then .
- •
If then .
If we replace the exact distance computations with sketches we want to avoid two events: Failing to report a point with and reporting a point with . By setting and applying a union bound over the events that the sketch fails for a point in our collection we obtain Theorem 2.5.
9 The number of hash functions in corner cases
When the collision probabilities of the -sensitive family are close to one we get the behavior displayed in Figure 3 where we have set . Here it may be possible to reduce the number of hash functions by applying the DKT framework to the family for some positive integer . That is, instead of applying the DKT technique directly to we first apply the powering trick to produce the family . The number of locality-sensitive hash functions from used by the DKT framework is given by . If we instead use the family the expression becomes . Ignoring integer constraints, the value of that maximizes , thereby minimizing , is given by . Discretizing, the resulting number of hash functions when setting is given by . For constant and large this reduces the number of hash functions by a factor .
The behavior for small values of is displayed in Figure 4 where we have set .
10 Conclusion and open problems
We have shown that there exists a simple and general framework for solving the -near neighbor problem using only few locality-sensitive hash functions and with a reduced word-RAM complexity matching the number of lookups. The analysis in this paper indicates that the performance of the Dahlgaard-Knudsen-Thorup framework is highly competitive compared to the Indyk-Motwani framework in practice, especially when locality-sensitive hash functions are expensive to evaluate, as is often the case.
An obvious open problem is to provide a framework that uses fewer than locality-sensitive hash function. Another direction would be to find a lower bound on the number of independent locality-sensitive hash functions required to solve the ANN problem using LSH in a suitably restricted model.
Acknowledgement
I want to thank Rasmus Pagh commenting on an earlier version of this manuscript and for making me aware of the application of the tensoring technique in [152] that led me to the Andoni-Indyk framework [10].
11 Appendix: Inequalities
We make use of the following standard inequalities for the exponential function. See [111, Chapter 3.6.2] for more details.
Lemma 2.3**.**
Let such that and then .
Lemma 2.4**.**
For we have that .
We make use of a one-sided version of Chebyshev’s inequality to show correctness of the Dahlgaard-Knudsen-Thorup LSH framework.
Lemma 2.5** (Cantelli’s inequality).**
Let be a random variable with and then .
Proof.
For every we have that
[TABLE]
Next we apply Markov’s inequality
[TABLE]
Set and use that to simplify
[TABLE]
∎
To analyze the 1-bit sketching scheme by Li and König we make use of Hoeffding’s inequality:
Lemma 2.6** (Hoeffding [88, Theorem 1]).**
Let be independent random variables satisfying for . Define and , then:
For we have that .
- -
For we have that .
12 Appendix: Analysis of the Andoni-Indyk framework
Let denote the probability that a pair of points with collide in a single repetition of the scheme. A collision occurs if and only if there there exists at least one hash function in each of the underlying collections where the points collide. It follows that
[TABLE]
To guarantee a collision with probability at least it suffices to set .
We will proceed by analyzing this scheme where we let be variable and set parameters as followers:
[TABLE]
To upper bound we begin by lower bounding . The second part of can be lower bounded using Lemma 2.3 to yield . To lower bound we first note that in the case where we have and the expression can be lower bounded by . The same lower bound holds in the case there . In the case where and we make use of Lemma 2.3 and 2.4 to derive the lower bound.
[TABLE]
Using the bound we have that
[TABLE]
We can then bound the number of lookups and the expected number of distance computations
[TABLE]
Note that this matches the upper bound of the Indyk-Motwani LSH framework up to a constant factor.
To bound the number of hash functions from we use that and .
[TABLE]
Chapter 3 Space-time tradeoffs for similarity search
‘All that is gold does not glitter’
We present a framework for similarity search based on Locality-Sensitive Filtering (LSF), generalizing the Indyk-Motwani (STOC 1998) Locality-Sensitive Hashing (LSH) framework to support space-time tradeoffs. Given a family of filters, defined as a distribution over pairs of subsets of space that satisfies certain locality-sensitivity properties, we can construct a dynamic data structure that solves the approximate near neighbor problem on a collection of points in -dimensional space with query time , update time , and space usage . The space-time tradeoff is tied to the tradeoff between query time and update time (insertions/deletions), controlled by the exponents that are determined by the filter family.
Locality-sensitive filtering was introduced by Becker et al. (SODA 2016) together with a framework yielding a single, balanced, tradeoff between query time and space, further relying on the assumption of an efficient oracle for the filter evaluation algorithm. We extend the LSF framework to support space-time tradeoffs and through a combination of existing techniques we remove the oracle assumption.
Laarhoven (arXiv 2015), building on Becker et al., introduced a family of filters with space-time tradeoffs for the high-dimensional unit sphere under inner product similarity and analyzed it for the important special case of random data. We show that a small modification to the family of filters gives a simpler analysis that we use, together with our framework, to provide guarantees for worst-case data. Through an application of Bochner’s Theorem from harmonic analysis by Rahimi & Recht (NIPS 2007), we are able to extend our solution on the unit sphere to d under the class of similarity measures corresponding to real-valued characteristic functions. For the characteristic functions of -stable distributions we obtain a solution to the -near neighbor problem in -spaces with query and update exponents and where is a tradeoff parameter. This result improves upon the space-time tradeoff of Kapralov (PODS 2015) and is shown to be optimal in the case of a balanced tradeoff, matching the LSH lower bound by O’Donnell et al. (ITCS 2011) and a similar LSF lower bound proposed in this paper. Finally, we show a lower bound for the space-time tradeoff on the unit sphere that matches Laarhoven’s and our own upper bound in the case of random data.
13 Introduction
Let denote a space over a set equipped with a symmetric measure of dissimilarity (a distance function in the case of metric spaces). We consider the -near neighbor problem first introduced by Minsky and Papert [110, p. 222] in the 1960’s. A solution to the -near neighbor problem for a set of points in takes the form of a data structure that supports the following operation: given a query point , if there exists a data point such that then report a data point such that . In some spaces it turns out to be convenient to work with a measure of similarity rather than dissimilarity. We use to denote a symmetric measure of similarity and define the -similarity problem to be the -near neighbor problem in .
A solution to the -near neighbor problem can be viewed as a fundamental building block that yields solutions to many other similarity search problems such as the -approximate nearest neighbor problem [89, 86]. In particular, the -near neighbor problem is well-studied in -spaces where the data points lie in d and distances are measured by . Notable spaces include the Euclidean space , Hamming space , and the -dimensional unit sphere under inner product (cosine) similarity .
Curse of dimensionality.
All known solutions to the -near neighbor problem for (the exact near neighbor problem) either suffer from a space usage that is exponential in or a query time that is linear in [86]. This phenomenon is known as the “curse of dimensionality” and has been observed both in theory and practice. For example, Alman and Williams [6] recently showed that the existence of an algorithm for determining whether a set of points in -dimensional Hamming space contains a pair of points that are exact near neighbors with a running time strongly subquadratic in would refute the Strong Exponential Time Hypothesis (SETH) [169]. This result holds even when is rather small, . From a practical point of view, Weber et al. [167] showed that the performance of many of the tree-based approaches to similarity search from the field of computational geometry [68] degrades rapidly to a linear scan as the dimensionality increases.
Approximation to the rescue.
If we allow an approximation factor of then there exist solutions to the -near neighbor problem with query time that is strongly sublinear in and space polynomial in where both the space and time complexity of the solution depends only polynomially on . Techniques for overcoming the curse of dimensionality through approximation were discovered independently by Kushilevitz et al. [99] and Indyk and Motwani [91]. The latter, classical work by Indyk and Motwani [91, 86] introduced a general framework for solving the -near neighbor problem known as Locality-Sensitive Hashing (LSH). The introduction of the LSH framework has inspired an extensive literature (see e.g. [12, 165] for surveys) that represents the state of the art in terms of solutions to the -near neighbor problem in high-dimensional spaces [91, 47, 66, 129, 11, 12, 9, 13, 17, 95, 19, 24, 100].
Hashing and filtering frameworks.
The LSH framework and the more recent LSF framework introduced by Becker et al. [24] produce data structures that solve the -near neighbor problem with query and update time and space usage . The LSH (LSF) framework takes as input a distribution over partitions (subsets) of space with the locality-sensitivity property that close points are more likely to be contained in the same part (subset) of a randomly sampled element from the distribution. The frameworks proceed by constructing a data structure that associates each point in space with a number of memory locations or “buckets” where data points are stored. During a query operation the buckets associated with the query point are searched by computing the distance to every data point in the bucket, returning the first suitable candidate. The set of memory locations associated with a particular point is independent of whether an update operation or a query operation is being performed. This symmetry between the query and update algorithm results in solutions to the near neighbor problem with a balanced space-time tradeoff. The exponent is determined by the locality-sensitivity properties of the family of partitions/hash functions (LSH) or subsets/filters (LSF) and is typically upper bounded by an expression that depends only on the aproximation factor . For example, Indyk and Motwani [91] gave a simple locality-sensitive family of hash functions for Hamming space with an exponent of . This exponent was later shown to be optimal by O’Donnell et al. [121] who gave a lower bound of in the setting where and are small compared to . The advantage of having a general framework for similarity search lies in the reduction of the -near neighbor problem to the, often simpler and easier to analyze, problem of finding a locality-sensitive family of hash functions or filters for the space of interest.
Space-time tradeoffs.
Space-time tradeoffs for solutions to the -near neighbor problem is an active line of research that can be motivated by practical applications where it is desirable to choose the tradeoff between query time and update time (space usage) that is best suited for the application and memory hierarchy at hand [129, 106, 9, 95, 100]. Existing solutions typically have query time , update time (insertions/deletions) , and use space where the query and update exponents that control the space-time tradeoff depend on the approximation factor and on a tradeoff parameter .
This paper combines a number of existing techniques [24, 100, 73] to provide a general framework for similarity search with space-time tradeoffs. The framework is used to show improved upper bounds on the space-time tradeoff in the well-studied setting of -spaces and the unit sphere under inner product similarity. Finally, we show a new lower bound on the space-time tradeoff for the unit sphere that matches an upper bound for random data on the unit sphere by Laarhoven [100]. We proceed by stating our contribution and briefly surveying the relevant literature in terms of frameworks, upper bounds, and lower bounds as well as some recent developments. See table Table 1 for an overview.
13.1 Contribution
Before stating our results we give a definition of locality-sensitive filtering that supports asymmetry in the framework query and update algorithm, yielding space-time tradeoffs.
Definition 3.1**.**
Let be a space and let be a probability distribution over . We say that is -sensitive if for all points and sampled randomly from the following holds:
- •
If then .
- •
If then .
- •
and .
We refer to as a filter and to as the query filter and as the update filter.
Our main contribution is a general framework for similarity search with space-time tradeoffs that takes as input a locality-sensitive family of filters.
Theorem 3.1**.**
Suppose we have access to a family of filters that is -sensitive. Then we can construct a fully dynamic data structure that solves the -near neighbor problem with query time , update time , and space usage where and .
We give a worst-case analysis of a slightly modified version of Laarhoven’s [100] filter family for the unit sphere and plug it into our framework to obtain the following theorem.
Theorem 3.2**.**
For every choice of and there exists a solution to the -similarity problem in that satisfies the guarantees from Theorem 3.1 with exponents and .
We show how an elegant and powerful application of Bochner’s Theorem [140] by Rahimi and Recht [138] allows us to extend the solution on the unit sphere to a large class of similarity measures, yielding as a special case solutions for -space.
Theorem 3.3**.**
For every choice of , , and there exists a solution to the -near neighbor problem in that satisfies the guarantees from Theorem 3.1 with exponents and .
This result improves upon the state of the art for every choice of asymmetric query/update exponents [129, 11, 9, 95]. We conjecture that this tradeoff is optimal among the class of algorithms that independently of the data determine which locations in memory to probe during queries and updates. In the case of a balanced space-time tradeoff where we set our approach matches existing, optimal [121], data-independent solutions in -spaces [91, 66, 11, 117].
The LSF framework is very similar to the LSH framework, especially in the case where the filter family is symmetric ( for every filter in ). In this setting we show that the LSH lower bound by O’Donnell et al. applies to the LSF framework as well [121], confirming that the results of Theorem 3.3 are optimal when we set .
Theorem 3.4** (informal).**
Every filter family that is symmetric and -sensitive in must have when is chosen to be sufficiently small.
Finally we show a lower bound on the space-time tradeoff that can be obtained in the LSF framework. Our lower bound suffers from two important restrictions. First the filter family must be regular, meaning that all query filters and all update filters are of the same size. Secondly, the size of the query and update filter cannot differ by too much.
Theorem 3.5** (informal).**
Every regular filter family that is -sensitive in -dimensional Hamming space with asymmetry controlled by cannot simultanously have that and .
Together our upper and lower bounds imply that the filter family of concentric balls in Hamming space is asymptotically optimal for random data.
Techniques.
The LSF framework in Theorem 3.1 relies on a careful combination of “powering” and “tensoring” techniques. For positive integers and with the tensoring technique, a variant of which was introduced by Dubiner [73], allows us to simulate a collection of filters from a collection of filters by considering the intersection of all -subsets of filters. Furthermore, given a point we can efficiently list the simulated filters that contain . This latter property is crucial as we typically need filters to split our data into sufficiently small buckets for the search to be efficient. The powering technique lets us amplify the locality-sensitivity properties of a filter family in the same way that powering is used in the LSH framework [91, 12, 121].
To obtain results for worst-case data on the unit sphere we analyze a filter family based on standard normal projections using the same techniques as Andoni et al. [13] together with existing tail bounds on bivariate Gaussians. The approximate kernel embedding technique by Rahimi and Recht [138] is used to extend the solution on the unit sphere to a large class of similarity measures, yielding Theorem 3.3 as a special case.
The lower bound in Theorem 3.4 relies on an argument of contradiction against the LSH lower bounds by O’Donnell [121] and uses a theoretical, inefficient, construction of a locality-sensitive family of hash functions from a locality-sensitive family of filters that is similar to the spherical LSH by Andoni et al. [14].
Finally, the space-time tradeoff lower bound from Theorem 3.5 is obtained through an application of an isoperimetric inequality by O’Donnell [120, Ch. 10] and is similar in spirit to the LSH lower bound by Motwani et al. [114].
13.2 Related work
The LSH framework takes a distribution over hash functions that partition space with the property that the probability of two points landing in the same partition is an increasing function of their similarity.
Definition 3.2**.**
Let be a space and let be a probability distribution over functions . We say that is -sensitive if for all points and sampled randomly from the following holds:
- •
If then .
- •
If then .
The properties of determines a parameter that governs the space and time complexity of the solution to the -near neighbor problem.
Theorem 3.6** (LSH framework [91, 86]).**
Suppose we have access to a -sensitive hash family. Then we can construct a fully dynamic data structure that solves the -near neighbor problem with query time , update time , and with a space usage of where .
The LSF framework by Becker et al. [24] takes a symmetric -sensitive filter family and produces a data structure that solves the -near neighbor problem with the same properties as the one produced by the LSH framework where instead we have . In addition, the framework assumes access to an oracle that is able to efficiently list the relevant filters containing a point out of a large collection of filters. The LSF framework in this paper removes this assumption, showing how to construct an efficient oracle as part of the framework.
In terms of frameworks that support space-time tradeoffs, Panigrahy [129] developed a framework based on LSH that supports the two extremes of the space-time tradeoff. In the language of Theorem 3.1, Panigrahy’s framework supports either setting for a solution that uses near-linear space at the cost of a slower query time, or setting for a solution with query time at the cost of a higher space usage. To obtain near-linear space the framework stores every data point in partitions induced by randomly sampled hash functions from a -sensitive LSH family . In comparison, the standard LSH framework from Theorem 3.6 uses such partitions where is determined by . For each partition induced by the query algorithm in Panigrahy’s framework generates a number of random points in a ball around the query point and searches the parts of the partition that they hash to. The query time is bounded by where and denotes conditional entropy, i.e. the query time is determined by how hard it is to guess where hashes to given that we know and . Panigrahy’s technique was used in a number of follow-up works that improve on solutions for specific spaces, but to our knowledge none of them state a general framework with space-time tradeoffs [106, 9, 95].
Upper bounds.
As is standard in the literature we state results in -spaces in terms of the properties of a solution to the -near neighbor problem. For results on the unit sphere under inner product similarity we instead use the -similarity terminology, defined in the introduction, as we find it to be cleaner and more intuitive while aligning better with the analysis. The -spaces, particularly and , as well as are some of most well-studied spaces for similarity search and are also widely used in practice [165]. Furthermore, fractional norms ( for ) have been shown to perform better than the standard norms in certain use cases [2] which motivates finding efficient solutions to the near neighbor problem in general -space.
In the case of a balanced space-time tradeoff the best data-independent upper bound for the -near neighbor problem in are solutions with an LSH exponent of for . This result is obtained through a combination of techniques. For the LSH based on -stable distributions by Datar et al. [66] can be used to obtain an exponent of for an arbitrarily small constant . For the ball-carving LSH by Andoni and Indyk [11] for Euclidean space can be extended to using the technique described by Nguyen [117, Section 5.5]. Theorem 3.3 matches (and potentially improves in the case of ) these results with a single unified technique and analysis that we find to be simpler.
For space-time tradeoffs in Euclidean space (again extending to for ) Kapralov [95], improving on Panigrahy’s results [129] in Euclidean space and using similar techniques, obtains a solution with query exponent and update exponent under the condition that where is an arbitrary positive constant. Comparing to our Theorem 3.3 it is easy to see that we improve upon Kapralov’s space-time tradeoff for all choices of and . In addition, Theorem 3.3 represents the first solution to the -near neighbor problem in Euclidean space that for every choice of constant obtains sublinear query time () using only near-linear space (). Due to the restrictions on Kapralov’s result he is only able to obtain sublinear query time for when the space usage is restricted to be near-linear. It appears that our improvements can primarily be attributed to our techniques allowing a more direct analysis. Kapralov uses a variation of Panigrahy’s LSH-based technique of, depending on the desired space-time tradeoff, either querying or updating additional memory locations around a point in the partition induced by . For a query point and a near neighbor his argument for correctness is based on guaranteeing that both the query algorithm and update algorithm visit the part where is a point lying between and , possibly leading to a loss of efficiency in the analysis. More details on the comparison of Theorem 3.3 to Kapralov’s result can be found in Appendix 23.
In terms of space-time tradeoffs on the unit sphere, Laarhoven [100] modifies a filter family introduced by Becker et al. [24] to support space-time tradeoffs, obtaining a solution for random data on the unit sphere (the -similarity problem with ) with query exponent and update exponent . Theorem 3.2 extends this result to provide a solution to the -similarity problem on the unit sphere for every choice of . This extension to worst case data is crucial for obtaining our results for -spaces in Theorem 3.3. We note that there exist other data-independent techniques (e.g. Valiant [162, Alg. 25]) for extending solutions on the unit sphere to , but they also require a solution for worst-case data on the unit sphere to work.
Lower bounds
The performance of an LSH-based solution to the near neighbor problem in a given space that uses a -sensitive family of hash functions is summarized by the value of the exponent . It is therefore of interest to lower bound in terms of the approximation factor . Motwani et al. [114] proved the first lower bound for LSH families in -dimensional Hamming space. They show that for every choice of then for some choice of it must hold that as goes to infinity under the assumption that is not too small ().
As part of an effort to show lower bounds for data-dependent locality-sensitive hashing, Andoni and Razenshteyn [19] strengthened the lower bound by Motwani et al. to in Hamming space. These lower bounds are initially shown in Hamming space and can then be extended to -space and the unit sphere by the fact that a solution in these spaces can be used to yield a solution in Hamming space, contradicting the lower bound if is too small. Translated to -similarity on the unit sphere, which is the primary setting for the lower bounds on LSF space-time tradeoffs in this paper, the lower bound by Andoni and Razenshteyn shows that an LSH on the unit sphere must have which is tight in the case of random data [13].
The lower bound uses properties of random walks over a partition of Hamming space: A random walk starting from a random point is likely to “walk out” of the the part identified by in the partition induced by . The space-time tradeoff lower bound in Theorem 3.5 relies on a similar argument that lower bounds the probability that a random walk starting from a subset ends up in another subset , corresponding nicely to query and update filters in the LSF framework.
Using related techniques O’Donnell [121] showed tight LSH lower bounds for -space of . The work by Andoni et al. [15] and Panigrahy et al. [130, 131] gives cell probe lower bounds for the -near neighbor problem, showing that in Euclidean space a solution with a query complexity of probes require space at least . For more details on these lower bounds and how they relate to the upper bounds on the unit sphere see [16, 100].
Data-dependent solutions
The solutions to the -near neighbor problems considered in this paper are all data-independent. For the LSH and LSF frameworks this means that the choice of hash functions or filters used by the data structure, determining the mapping between points in space and the memory locations that are searched during the query and update algorithm, is made without knowledge of the data. Data-independent solutions to the -near neighbor problem for worst-case data have been the state of the art until recent breakthroughs by Andoni et al. [14] and Andoni and Razenshteyn [17] showing improved solutions to the -near neighbor problem in Euclidean space using data-dependent techniques. In this setting the solution obtained by Andoni and Razenshteyn has an exponent of compared to the optimal data-independent exponent of . Furthermore, they show that this exponent is optimal for data-dependent solutions in a restricted model [19].
Recent developments
Recent work by Andoni et al. [16], done independently of and concurrently with this paper, shows that Laarhoven’s upper bound for random data on the unit sphere can be combined with data-dependent techniques [17] to yield a space-time tradeoff in Euclidean space with satisfying . This improves the result of Theorem 3.3 and matches the lower bound in Theorem 3.5. In the same paper they also show a lower bound matching our lower bound in Theorem 3.5. Their lower bound is set in a more general model that captures both the LSH and LSF framework and they are able to remove some of the technical restrictions such as the filter family being regular that weaken the lower bound in this paper. In spite of these results we still believe that this paper presents an important contribution by providing a general and simple framework with space-time tradeoffs as well as improved data-independent solutions to nearest neighbor problems in -space and on the unit sphere. We would also like to point out the simplicity and power of using Rahimi and Recht’s [138] result to extend solutions on the unit sphere to spaces with similarity measures corresponding to real-valued characteristic functions, further described in Appendix 21.
14 A framework with space-time tradeoffs
We use a combination of powering and tensoring techniques to amplify the locality-sensitive properties of our initial filter family, and to simulate a large collection of filters that we can evaluate efficiently. We proceed by stating the relevant properties of these techniques which we then combine to yield our Theorem 3.1.
Lemma 3.1** (powering).**
*Given a -sensitive filter family for and a positive integer define the family as follows: we sample a filter from by sampling independently from and setting . The family is -sensitive for . *
Let denote a collection (indexed family) of filters and let and denote the corresponding collections of query and update filters, that is, for we have that . Given a positive integer (typically ) we define to be the collection of filters formed by taking all the intersections of -combinations of filters from , that is, for every with we have that
[TABLE]
The following properties of the tensoring technique will be used to provide correctness, running time, and space usage guarantees for the LSF data structure that will be introduced in the next subsection. We refer to the evaluation time of a collection of filters as the time it takes, given a point to prepare a list of query filters containing and a list of update filters containing such that the next element of either list can be reported in constant time. We say that a pair of points is contained in a filter if and .
Lemma 3.2** (tensoring).**
Let be a filter family that is -sensitive in . Let be a positive integer and let denote a collection of independently sampled filters from . Then the collection of filters has the following properties:
- •
If have distance at most then with probability at least there exists a filter in containing .
- •
If have distance greater than then the expected number of filters in containing is at most .
- •
In expectation, a point is contained in at most query filters and at most update filters in .
- •
The evaluation time and space complexity of is dominated by the time it takes to evaluate and store filters from .
Proof.
To prove the first property we note that there exists a filter in containing if at least filters in contain . The binomial distribution has the property that the median is at least as great as the mean rounded down [93]. By the choice of we have that the expected number of filters in containing is at least and the result follows. The second and third properties follow from the linearity of expectation and the fourth is trivial. ∎
14.1 The LSF data structure
We will introduce a dynamic data structure that solves the -near neighbor problem on a set of points . The data structure has access to a -sensitive filter family in the sense that it knows the parameters of the family and is able to sample, store, and evaluate filters from in time .
The data structure supports an initialization operation that initializes a collection of filters where for every filter we maintain a (possibly empty) set of points from . After initialization the data structure supports three operations: insert, delete, and query. The insert (delete) operation takes as input a point and adds (removes) the point from the set of points associated with each update filter in that contains . The query operation takes as input a point . For each query filter in that contains we proceed by computing the dissimilarity to every point associated with the filter. If a point satisfying is encountered, then is returned and the query algorithm terminates. If no such point is found, the query algorithm returns a special symbol “” and terminates.
The data structure will combine the powering and tensoring techniques in order to simulate the collection of filters from two smaller collections: consisting of filters from and consisting of filters from . The collection of simulated filters is formed by taking all filters where is a member of and is a member of . It is due to the integer constraints on the parameter in the tensoring technique and the parameter in the powering technique that we simulate our filters from two underlying collections instead of just one. This gives us more freedom to hit a target level of amplification of the simulated filters which in turn makes it possible for the framework to support efficient solutions for a wider range of parameters of LSF families.
The initialization operation takes and parameters and samples and stores and . The filter evaluation algorithm used by the insert, delete, and query operation takes a point and computes for and , depending on the operation, the list of update or query filters containing . From these lists we are able to generate the list of filters in containing .
Setting the parameters of the data structure to guarantee correctness while balancing the contribution to the query time from the filter evaluation algorithm, the number of filters containing the query point, and the number of distant points examined, we obtain a partially dynamic data structure that solves the -near neighbor problem with failure probability . Using a standard dynamization technique by Overmars and Leeuwen [122, Thm. 1] we obtain a fully dynamic data structure resulting in Theorem 3.1. The details of the proof have been deferred to Appendix 19.
15 Gaussian filters on the unit sphere
In this section we show properties of a family of filters for the unit sphere under inner product similarity. Later we will show how to make use of this family to solve the near neighbor problem in other spaces, including for .
Lemma 3.3**.**
For every choice of , , and let denote the family of filters defined as follows: we sample a filter from by sampling and setting
[TABLE]
*Then is locality-sensitive on the unit sphere under inner product similarity with exponents *
[TABLE]
Laarhoven’s filter family [100] is identical to except that he normalizes the projection vectors to have unit length. The properties of can easily be verified with a simple back-of-the-envelope analysis using two facts: First, for a standard normal random variable we have that . Secondly, the invariance of Gaussian projections to rotations, allowing us to analyze the projection of arbitrary points with inner product in a two-dimensional setting and without any loss of generality. The proof of Lemma 3.3 as well as the proof of Theorem 3.2 has been deferred to Appendix 20.
16 Space-time tradeoffs under kernel similarity
In this section we will show how to combine the Gaussian filters for the unit sphere with kernel approximation techniques in order to solve the -similarity problem over for the class of similarity measures of the form where is a real-valued characteristic function [161]. For this class of functions there exists a feature map into a (possibly infinite-dimensional) dot product space such that . Through an elegant combination of Bochner’s Theorem and Euler’s Theorem, detailed in Appendix 21, Rahimi and Recht [138] show how to construct approximate feature maps, i.e., for every we can construct a function with the property that . We state a variant of their result for a mapping onto the unit sphere.
Lemma 3.4**.**
For every real-valued characteristic function and every positive integer there exists a family of functions such that for every and we have that
[TABLE]
Theorem 3.10 in Appendix 21 shows that Theorem 3.2 holds with the space replaced by .
16.1 Tradeoffs in -space
Consider the -near neighbor problem in for . We solve this problem by first applying the approximate feature map from Lemma 3.4 for the characteristic function of a standard -stable distribution [174], mapping the data onto the unit sphere, and then applying our solution from Theorem 3.2 to solve the appropriate -similarity problem on the unit sphere. The characteristic functions of -stable distributions take the following form:
Lemma 3.5** (Lévy [102]).**
For every positive integer and there exists a characteristic function of the form
[TABLE]
A result by Chambers et al. [46] shows how to sample efficiently from an -stable distributions.
To sketch the proof of Theorem 3.3 we proceed by upper bounding the exponents , from Theorem 3.2 when applying Lemma 3.4 to get and . We make use of the following standard fact (see e.g. [142]) that can be derived from the Taylor expansion of the exponential function: for it holds that . Scaling the data points such that and inserting the above values of and into the expressions for , in Lemma 3.3 we can set parameters and such that Theorem 3.3 holds.
17 Lower bounds
We begin by stating the lower bound on the LSH exponent by O’Donnell et al. [121].
Theorem 3.7** (O’Donnell et al. [121]).**
Fix , , and . Then for a certain choice of and under the assumption that we have that every -sensitive family of hash functions for must satisfy
[TABLE]
The following lemma shows how to use a filter family to construct a hash family .
Lemma 3.6**.**
Given a symmetric family of filters that is -sensitive in we can construct a -sensitive family of hash functions in .
Proof.
Given the filter family we sample a random function from the hash family taking an infinite sequence of independently sampled filters from and setting . The probability of collision is given by
[TABLE]
and the result follows from the properties of . ∎
If the LSH family in Lemma 3.6 had and then the lower bound would follow immediately. We apply the powering technique from Lemma 3.1 to the underlying filter family in order make the factor in disappear in the statement of as tends to infinity.
Theorem 1.4**.**
Every symmetric -sensitive filter family for must satisfy the lower bound of Theorem 3.7 with and .
Proof.
Given a family that satisfies the requirements from Theorem 3.7 there exists an integer such the hash family that results from applying Lemma 3.6 to the powered family also satisfies the requirements from Theorem 3.7. The constructed family is -sensitive for and . By our choice of we have that and the lower bound on from Theorem 3.7 applies. ∎
17.1 Asymmetric lower bound
The lower bound is based on an isoperimetric-type inequality that holds for randomly correlated points in Hamming space. We say that the pair of points is -correlated if is a random point in and is formed by taking and independently flipping each bit with probability . We are now ready to state O’Donnell’s generalized small-set expansion theorem. Notice the similarity to the value of for the Gaussian filter family described in Section 15 and Appendix 20.
Lemma 3.7** ([120, p. 285]).**
For every , , and satisfying that we have
[TABLE]
The argument for the lower bound assumes a regular -sensitive filter family for Hamming space where we set and for some choice of . We then proceed by deriving constraints on , , , , and minimize and subject to those constrains. The proof of Theorem 1.5 is provided in Appendix 22.
Theorem 1.5**.**
Fix . Then for every regular -sensitive filter family in -dimensional Hamming space with and where satisfies it must hold that
[TABLE]
when is set to minimize and we assume that .
18 Open problems
An important open problem is to find simple and practical data-dependent solutions to the -near neighbor problem. Current solutions, the Gaussian filters in this paper included, suffer from terms in the exponents that decrease very slowly in . A lower bound for the unit sphere by Andoni et al. [13] indicates that this might be unavoidable.
Another interesting open problem is finding the shape of provably exactly optimal filters in different spaces. In the random data setting in Hamming space, this problem boils down to maximizing the number of pairs of points below a certain distance threshold that is contained in a subset of the space of a certain size. This is a fundamental problem in combinatorics that has been studied by among others [94], but a complete answer remains elusive. The LSH and LSF lower bounds [114, 121, 19], along with classical isoperimetric inequalities such as Harper’s Theorem and more recent work summarized in the book by O’Donnell [120] hints that the answer is somewhere between a subcube and a generalized sphere.
A recent result by Chierichetti and Kumar [49] characterizes the set of transformations of LSH-able similarity measures as the set of probability-generating functions. This seems to have deep connections to result of this paper that uses characteristic functions that allow well-known kernel transformations. It seems possible that this paper can be viewed as a semi-explicit construction of their result, or that their result can be described as an application of Bochner’s Theorem.
Acknowledgment
I would like to thank Rasmus Pagh for suggesting the application of Rahimi & Recht’s result [138] and the MinHash-like [32] connection between LSF and LSH used in Theorem 1.4. I would also like to thank Gregory Valiant and Udi Wieder for useful discussions about locality-sensitive filtering and the analysis of boolean functions during my stay at Stanford. Finally, I would like to thank the Scalable Similarity Search group at the IT University of Copenhagen for feedback during the writing process, and in particular Martin Aumüller for pointing out the importance of a general framework for locality-sensitive filtering with space-time tradeoffs.
19 Appendix: Framework
We state a version of Theorem 3.1 where the parameters of the filter family are allowed to depend on .
Theorem 3.1.
Suppose we have access to a filter family that is -sensitive. Then we can construct a fully dynamic data structure that solves the -near neighbor problem. Assume that , , and are , then the data structure has
- –
query time ,
- –
update time ,
- –
space usage
where
[TABLE]
To prove Theorem 3.1, we begin by setting the parameters mentioned in the description of the LSF data structure in Section 14.1.
[TABLE]
We will now briefly explain the reasoning behind the parameter settings. Begin by observing that the powering and tensoring techniques both amplify the filters from . Let denote the number of simulated filters in our collection and let be an integer denoting the number of times each filter has been amplified. Ignoring the time it takes to evaluate the filters, the query time is determined by the sum of the number of filters that contain a query point and the number of distant points associated with those filters that the query algorithm inspects. The expected number of activated filters is given by while the worst case expected number of distant points to be inspected by the query algorithm is given by . Balancing the contribution to the query time from these two effects (ignoring the factor from distance computations) results in a target value of . Compared to having an oracle that is able to list the filters from a collection that contains a point, there is a small loss in efficiency from using the tensoring technique due to the increase in the number of filters required to guarantee correctness. The parameters of the LSF data structure are therefore set to minimize the use of tensoring such that the time spent evaluating our collection of filters roughly matches the minimum of the query and update time.
Consider the initialization operation of the LSF data structure with the parameters setting from above. We have that implying that . The initialization time and the space usage of the data structure prior to any insertions is dominated by the time and space used to sample and store the filters in . By the assumption that a filter from can be sampled in operations and stored using words, we get a space and time bound on the initialization operation of
[TABLE]
Importantly, this bound also holds for the running time of the filter evaluation algorithm, that is, the preprocessing time required for constant time generation of the next element in the list of filters in containing a point. In the following analysis of the update and query time we will temporarily ignore the running time of the filter evaluation algorithm.
The expected time to insert or delete a point is dominated by the number of update filters in that contains it. The probability that a particular update filter in contains a point is given by . Using a standard upper bound on the binomial coefficient we get that resulting in an expected update time of
[TABLE]
In the worst case where every data point is at distance greater than from the query point and has collision probablity the expected query time can be upper bounded by
[TABLE]
With respect to the correctness of the query algorithm, if a near neighbor to the query point exists in , then it is found by the query algorithm if is contained in a filter in as well as in a filter in . By Lemma 3.2 the first event happens with probability at least and by the choice of , the second event happens with probability at least . From the independence between and we can upper bound the failure probability . This completes the proof of Theorem 3.1.
20 Appendix: Gaussian filters
In this section we upper and lower bound the probability mass in the tail of the bivariate standard normal distribution when the correlation between the two standard normals is at most (upper bound) or at least (lower bound). We make use of the following upper and lower bounds on the univariate standard normal as well as an upper bound for the multivariate case.
Lemma 3.8** (Follows Szarek & Werner [153]).**
Let be a standard normal random variable. Then, for every we have that
[TABLE]
Lemma 3.9** (Lu & Li [104]).**
Let be a -dimensional vector of i.i.d. standard normal random variables and let be a closed convex domain that does not contain the origin. Let denote the Euclidean distance to the unique closest point in , then we have that
[TABLE]
Lemma 3.10** (Tail upper bound).**
For satisfying , , , and every pair of standard normal random variables with correlation satisfies
[TABLE]
where .
Proof.
For the result is trivial. For values of in the range we use the -stability of the normal distribution to analyze a tail bound for in terms of a Gaussian projection vector applied to unit vectors . That is, we can define and for some appropriate choice of and . Without loss of generality we set and note that for we must have that . If we consider the region of 2 where satisfies we get a closed domain defined by such that and . The squared Euclidean distance from the origin to the closest point in at least as can be seen by the fact that decreasing in . Combining this observation with Lemma 3.9 we get the desired result. ∎
Lemma 3.11** (Tail lower bound).**
For satisfying , , and every pair of standard normal random variables with correlation satisfies
[TABLE]
where .
Proof.
For the result follows directly from Lemma 3.8. For we use the trick from the proof of Lemma 3.10 and define and where and and is a vector of two i.i.d. standard normal random variables. This allows us to rewrite the probability as follows:
[TABLE]
By the restrictions on and we have that . The result follows from applying the lower bound from Lemma 3.8 and noting that the bound is increasing in . ∎
20.1 Space-time tradeoffs on the unit sphere
Summarizing the bound from the previous section, the family from Lemma 3.3 satisfies that
[TABLE]
We combine the Gaussian filters with Theorem 3.1 to show that we can solve the -similarity problem efficiently for the full range of space/time tradeoffs, even when are allowed to depend on , as long as the gap is not too small.
Theorem 3.2.
*For every choice of and we can construct a fully dynamic data structure that solves the -similarity problem in . Suppose that for some constant , that satisfies the guarantees from Theorem 3.1 with exponents and . *
Proof.
Assuming that there exists a constant where by setting the parameter of such that the family of filters satisfies the assumptions in Theorem 3.1 while guaranteeing that the second term in and from Lemma 3.3 are . ∎
Remark 3.1*.*
Theorem 3.2 aims for simplicity and generality while allowing and to depend on . For specific values of it is easy to find better bounds on the probabilties (e.g. the bounds by Savage [142]) and to adjust in Lemma 3.3 to avoid powering (setting ) in the LSF framework.
21 Appendix: Approximate feature maps, characteristic functions, and Bochner’s Theorem
We begin by defining what a characteristic function is and listing some properties that are useful for our application. More information about characteristic functions can be found in the books by Lukacs [105] and Ushakov [161].
Lemma 3.12** ([105, 161]).**
Let denote a random variable with distribution function . Then the characteristic function of is defined as
[TABLE]
and it has the following properties:
A distribution function is symmetric if and only if its characteristic function is real and even.
- -
Every characteristic function is uniformly continuous, has , and for all real .
- -
Suppose that denotes the characteristic function of an absolutely continuous distribution then .
- -
Let and be independent random variables with characteristic functions and . Then the characteristic function of is given by .
Bochner’s Theorem reveals the relation between characteristic functions and the class of real-valued functions that admit a feature space representation
Theorem 3.8** (Bochner’s Theorem [140]).**
A function is positive definite if and only if it can be written on the form
[TABLE]
where is the probability density function of a symmetric distribution.
Rahimi & Recht’s [138] family of approximate feature maps is constructed from Bochner’s Theorem by making use of Euler’s Theorem as follows:
[TABLE]
Where the third equality makes use of the fact that is real-valued to remove the complex part of the integral and the fifth equality uses that .
Now that we have an approximate feature map onto the sphere for the class of shift-invariant kernels, we will take a closer look at what functions this class contains, and what their applications are for similarity search. Given an arbitrary similarity function, we would like to be able to determine whether it is indeed a characteristic function. Unfortunately, there are no known simple techniques for answering this question in general. However, the machine learning literature contains many applications of different shift-invariant kernels [144] and many common distributions have real characteristic functions (see Appendix B in [161] for a long list of examples). Characteristic functions are also well studied from a mathematical perspective [105, 161], and a number of different necessary and sufficient conditions are known. A classical result by Pólya [135] gives simple sufficient conditions for a function to be a characteristic function. Through the vectorization property from Lemma 3.12, Pólya’s conditions directly imply the existence of a large class of similarity measures on d that can fit into the above framework.
Theorem 3.9** (Pólya [135]).**
Every even continuous function satisfying the properties
**
- -
**
- -
* is convex for *
is a characteristic function.
Based on the results of Section 16.1 one could hope for the existence of characteristic functions of the form for but it is known that such functions cannot exist [25, Theorem D.8]. Furthermore, Marcinkiewicz [108] shows that a function of the form cannot be a characteristic function if the degree of the polynomial is greater than two.
We state a more complete, constructive version of Lemma 3.4 as well as the proof here.
Lemma 3.13**.**
Let be a real-valued characteristic function with associated distribution function and let be a positive integer. Consider the family of functions where a randomly sampled function is defined by, independently for , sampling from and uniformly on , letting and normalizing . The family has the property that for every and we have that
[TABLE]
Proof.
Since is bounded between and , and we have independence for different values of , Hoeffding’s inequality [88] can be applied to show that for every fixed pair of points and it holds that
[TABLE]
From the properties of characteristic functions we have that and . The bound on the deviation of
[TABLE]
from follows from setting and using a union bound over the probabilities that the deviation of one of the inner products is too large. ∎
Combining the approximate feature map onto the unit sphere with Theorem 3.2 we obtain the following:
Theorem 3.10**.**
Let be a characteristic function and define the similarity measure . Assume that we have access to samples from the distribution associated with , then Theorem 3.2 holds with replaced by .
Proof.
According to Lemma 3.13, we can set to obtain a map such that the the inner product on preserves the pairwise similarity between points with additive error . This map has a space and time complexity of . After applying to the data we can solve the -similarity problem on by solving the -similarity problem on . We can use Theorem 3.2 to construct a fully dynamic data structure for solving this problem, adjusting the parameter so that it continues to lie in the admissible range. The space and time complexities follow. ∎
22 Appendix: Proof of tradeoff lower bound
Consider . Subject to the (implicit) LSF constraint that we see that is minimized by setting as small as possible and as large as possible. We will therefore derive lower bounds on and an upper bound on . For every value of and we minimize by choosing as small as possible.
For a random point it must hold that . This implies the existence of a fixed point with the property that . A regular filter family must therefore satisfy that and . Let be defined as in Lemma 3.7 then by a similar argument we have that .
In order to upper bound we make use of Lemma 3.7 together with the following lemma that follows directly from an application of Hoeffding’s inequality [88].
Lemma 3.14**.**
For every we have that
[TABLE]
In the following derivation, assume that satisfies , let denote randomly -correlated vectors in , and assume that , then
[TABLE]
Summarizing the bounds:
[TABLE]
When minimizing we have that . Setting results in . Putting things together:
[TABLE]
The derivation of the lower bound for is almost the same and the resulting expression is
[TABLE]
23 Appendix: Comparison to Kapralov
Kapralov uses to denote a parameter controlling the space-time tradeoff for his solution to the -near neighbor problem in Euclidean space. For every choice of tradeoff parameter , assuming that for arbitrarily small constant , Kapralov [95] obtains query and update exponents
[TABLE]
We convert Kapralov’s notation to our own by setting . To compare, Kapralov sets for near-linear space and we set . We want to write Kapralov’s exponents on the form
[TABLE]
for some that we will proceed to derive. We have that and . Multiplying the numerator and denominator in Kapralov’s exponents by we can write Kapralov’s exponents as
[TABLE]
We have that
[TABLE]
For every choice of , and under the assumption that for an arbitrarily small constant , this allows us to write Kapralov’s exponents as
[TABLE]
To compare Kapralov’s result against our own for search in -spaces we consider the exponents from Theorem 3.3, ignoring additive terms:
[TABLE]
Setting we obtain a data structure that uses near-linear space and we get a query exponent while Kapralov obtains an exponent of , ignoring terms. At the other end of the tradeoff, setting , we get a data structure with query time and update exponent while Kapralov gets an update exponent of , again ignoring additive terms.
The assumption made by Kapralov that means that in the case of a near-linear space data structure () sublinear query time can only be obtained for . In contrast, Theorem 3.3 gives sublinear query time for every constant .
24 Appendix: Details about dynamization and the model of computation
In order to obtain fully dynamic data structures we apply a powerful dynamization result of Overmars and Leeuwen [122] for decomposable searching problems. Their result allows us to turn a partially dynamic data structure into a fully dynamic data structure, supporting arbitrary sequences of queries and updates, at the cost of a constant factor in the space and running time guarantees. Suppose we have a partially dynamic data structure that solves the -near neighbor problem on a set of points. By partially dynamic we mean that, after initialization on a set of points, the data structure supports updates without changing the query time by more than a constant factor. Let , , and denote the query time, update time, and construction time of such a data structure containing points. Then, by Theorem 1 of Overmars and Leeuwen [122], there exists a fully dynamic version of the data structure with query time and update time that uses only a constant factor additional space. The data structures presented in this paper, as well as most related constructions from the literature, have the property that , allowing us to go from a partially dynamic to a fully dynamic data structure “for free” in big O notation.
In terms of guaranteeing that the query operation solves the -near neighbor problem on the set of points currently inserted into the data structure, we allow a constant failure probability , typically around , and omit it from our statements. We make the standard assumption that the adversary does not have knowledge of the randomness used by the data structure. Say we have a data structure with constant failure probability and a bound on the expected space usage. Then, for every positive integer we can create a collection of independent repetitions of the data structure such that for every sequence of operations it holds with high probability in that the space usage will never exceed the expectation by more than a constant factor and no query will fail.
24.1 Model of computation
We use the standard word RAM model as defined by Hagerup [85] with a word size of bits. Unless otherwise stated, we make the assumption that a point in can be stored in words and that the dissimilarity between two arbitrary points can be computed in operations where is a positive integer that corresponds to the dimension in the various well-studied settings mentioned in the main text. Furthermore, when describing framework-based solutions to the -near neighbor problem, we make the assumption that we can sample, evaluate, and represent elements from and with neglible error using space and time .
Many of the LSH and LSF families rely on random samples from the standard normal distribution. We will ignore potential problems resulting from rounding due to the fact that our model only supports finite precision arithmetic. This approach is standard in the literature and can be justified by noting that the error introduced by rounding is neglible. Furthermore, there exists small pseudorandom standard normal distributions that support sampling using only few uniformly distributed bits as noted by Charikar [47]. In much of the related literature the model of computation is left unspecified and statements about the complexity of solutions to the -near neighbor problem are usually made with respect to particular operations such as the hash function computations, distance computations, etc., leaving out other details [91, 86].
25 Addendum: An improved framework
The LSF framework in Theorem 3.1 suffers from large lower-order terms that depend on the -sensitivity properties of . With the parameterization in Appendix 19 the framework uses filters from where . In addition, the query and update time have a multiplicative factor which can potentially be very large and where we have to assume explicitly that . We will use a combination of techniques in recent work on set similarity search [56] and fast locality-sensitive hashing frameworks [64, 53] to give an improved LSF framework with more precise complexity bounds.
The data structure produced by the framework follows the high-level approach as outlined in Section 14.1: queries and updates are mapped to a collection of buckets that are searched for similar points in the case of a query, or updated to store a reference to the point in the case of an update. Let denote the mapping from query points to buckets and denote the corresponding map for updates. The set of buckets will be identified by the “survivors” of branching processes through collections of filters, similarly to the Chosen Path algorithm [56].
The data structure is initialized by sampling collections of filters. We will use the notation () to denote the th query (update) filter in the th collection. For let denote a pairwise independent random hash function. Let be a parameter to be determined later and let denote vector concatenation, then the locality-sensitive map is defined recursively as follows:
[TABLE]
The map is defined in the same way except it uses instead of .
Properties.
To show that the maps provide an efficient solution to the -near neighbor problem we need to show the following:
- •
An upper bound on the expected size of and to bound the expected number of buckets probed during queries/updates.
- •
An upper bound on the expected size of when to bound the expected number of distant points that will be encountered during the linear scan part of the query algorithm.
- •
That is non-empty with constant probability when to guarantee that the query algorithm encounters a point at distance at most with constant probability, provided such a point exists.
By the independence between the different levels in the branching process we have that
[TABLE]
Given define . Define where is sampled from . The expected number of collisions between and at level is then given by
[TABLE]
To show correctness of the scheme we will use Chebyshev’s inequality to show that with constant probability we have for points with . We proceed by upper bounding in order to bound the variance . To ease the derivation we define where we suppress the subscript . Without loss of generality we can assume that since .
[TABLE]
Since if we set we have that
[TABLE]
We will set the parameters in order to give a simple upper bound the worst-case performance of the data structure. The constants can be improved.
[TABLE]
We can now bound the variance of as follows:
[TABLE]
where we use the fact that . By Chebyshev’s inequality we have that
[TABLE]
By our parameter setting we have so collide with probability at least under , ensuring correctness.
25.1 Fast evaluation
We will use a hashing trick to compute in expected time . This technique is only briefly mentioned in [56]. Observe that for the correctness argument to hold, it suffices that the hash functions are sampled independently from a pairwise independent family [41, 42]. At the th step in the computation of we wish to determine, for each the set of satisfying and . In order to answer this efficiently we will make use of the property that a pairwise independent hash function can be decomposed as
[TABLE]
where are pairwise independent and denotes addition in an abelian group. For concreteness assume that map to -bit strings and let denote the exclusive-or operator. If we view the -bit output of as an integer in the set using the standard base two representation, the original condition can be transformed into the condition where . By choosing we can with high probability determine whether the condition is satisfied without reading more than bits, so we can effectively treat the output of the hash function as a real number at the cost of a small increase in the failure probability of the data structure.
Continuing with the new representation, in order for we must have that the leading bits of the output of is all zeroes. Given the leading bits of we can restrict our attention to with the same value in the leading bits of . At the beginning of the query algorithm, for each we determine the subset such that We then create a table with linked lists and for each we append to the linked list at the table entry given by the leading bits of . The running time and space usage of preparing these additional data structures is dominated by the complexity of evaluating and storing filters from .
Now, given we can compute in expected time by, for each , looking up the relevant table entry (given by the leading bits of ) and verifying whether the elements of the linked list satisfy the hashing condition. Every element of the linked list found in this way satisfies the hashing condition with constant probability by our setting of . To implement and we can use simple tabulation hashing [173].
One problem remains: long paths can take super constant time to hash. To prevent this we again use hashing to create -bit fingerprints of the paths that we work on instead. A conservative upper bound on the expected time to compute is since is non-decreasing in and the expected time spent at level is upper bounded by . We use the same approach to compute .
25.2 Framework
We are now ready to state the properties of the new framework.
Theorem 3.11**.**
Given a -sensitive family we can construct a fully dynamic data structure that solves the -near neighbor problem. Define , then:
- •
The data structure uses words of space in addition to the space required to store data points and filters from .
- •
The query operation uses word-RAM operations, distance computations, and filter evaluations.
- •
The update operation uses word-RAM operations and filter evaluations.
Compared to the usual formulation where the query time is stated as Theorem 3.11 offers a more precise statement of the complexity and can be converted to the other formulation. The lower order terms are now confined to the multiplicative factor which is a standard expression that also appears in the LSH framework as where is an upper bound on the collision probability between pairs of points with . The analysis can be tightened further by not using as an upper bound for when bounding the variance, but removing the multiplicative dependence on entirely as in the improved LSH framework [53] is an interesting open problem.
Chapter 4 Set similarity search beyond MinHash
‘From the ashes, a fire shall be woken’
We consider the problem of approximate set similarity search under Braun-Blanquet similarity . The -approximate Braun-Blanquet similarity search problem is to preprocess a collection of sets such that, given a query set , if there exists with , then we can efficiently return with .
We present a simple data structure that solves this problem with space usage and query time where and . Making use of existing lower bounds for locality-sensitive hashing by O’Donnell et al. [121] we show that this value of is tight across the parameter space, i.e., for every choice of constants .
In the case where all sets have the same size our solution strictly improves upon the value of that can be obtained through the use of state-of-the-art data-independent techniques in the Indyk-Motwani locality-sensitive hashing framework [91] such as Broder’s MinHash [35] for Jaccard similarity and Andoni et al.’s cross-polytope LSH [13] for cosine similarity. Surprisingly, even though our solution is data-independent, for a large part of the parameter space we outperform the currently best data-dependent method by Andoni and Razenshteyn [17].
26 Introduction
In this paper we consider the approximate set similarity problem or, equivalently, the problem of approximate Hamming near neighbor search in sparse vectors. Data that can be represented as sparse vectors is ubiquitous — a typical example is the representation of text documents as term vectors, where non-zero vector entries correspond to occurrences of words (or shingles). In order to perform identification of near-identical text documents in web-scale collections, Broder et al. [30, 36] designed and implemented MinHash (a.k.a. min-wise hashing), now understood as a locality-sensitive hash function [86]. This allowed approximate answers to similarity queries to be computed much faster than by other methods, and in particular made it possible to cluster the web pages of the AltaVista search engine (for the purpose of eliminating near-duplicate search results). Almost two decades after it was first described, MinHash remains one of the most widely used locality-sensitive hashing methods as witnessed by thousands of citations of [30, 36] as well as the ACM Paris Kanellakis Theory and Practice Award that Broder shared with Indyk and Charikar in 2012.
A similarity measure maps a pair of vectors to a similarity score in . It will often be convenient to interpret a vector as the set . With this convention the Jaccard similarity of two vectors can be expressed as . In approximate similarity search we are interested the problem of searching a data set for a vector of similarity at least with a query vector , but allow the search algorithm to return a vector of similarity . To simplify the exposition we will assume throughout the introduction that all vectors are -sparse, i.e., have the same Hamming weight .
Recent theoretical advances in data structures for approximate near neighbor search in Hamming space [17] make it possible to beat the asymptotic performance of MinHash-based Jaccard similarity search (using the LSH framework of [86]) in cases where the similarity threshold is not too small. However, numerical computations suggest that MinHash is always better when .
In this paper we address the problem: Can similarity search using MinHash be improved in general? We give an affirmative answer in the case where all sets have the same size by introducing Chosen Path: a simple data-independent search method that strictly improves MinHash, and is always better than the data-dependent method of [17] when . Similar to data-independent locality-sensitive filtering (LSF) methods [24, 100, 54] our method works by mapping each data (or query) vector to a set of keys that must be stored (or looked up). The name Chosen Path stems from the way the mapping is constructed: As paths in a layered random graph where the vertices at each layer is identified with the set of dimensions, and where a vector is only allowed to choose paths that stick to non-zero components . This is illustrated in Figure 5.
26.1 Related Work
High-dimensional approximate similarity search methods can be characterized in terms of their -value which is the exponent for which queries can be answered in time , where is the size of the set and denotes the dimensionality of the space. Here we focus on the “balanced” case where we aim for space , but note that there now exist techniques for obtaining other trade-offs between query time and space overhead [16, 54].
Locality-sensitive hashing methods.
We begin by describing results for Hamming space, which is a special case of similarity search on the unit sphere (many of the results cited apply to the more general case). In Hamming space the focus has traditionally been on the -value that can be obtained for solutions to the -approximate near neighbor problem: Preprocess a set of points such that, given a query point , if there exists with , then return with . In the literature this problem is often presented as the -approximate near neighbor problem where bounds for the -value are stated in terms of and, in the case of upper bounds, hold for every choice of , while lower bounds may only hold for specific choices of .
O’Donnell et al. [121] have shown that the value for -approximate near neighbor search in Hamming space, obtained in the seminal work of Indyk and Motwani [91], is the best possible in terms of for schemes based on Locality-Sensitive Hashing (LSH). However, the lower bound only applies when the distances of interest, and , are relatively small compared to , and better upper bounds are known for large distances. Notably, other LSH schemes for angular distance on the unit sphere such as cross-polytope LSH [13] give lower -values for large distances. Extensions of the lower bound of [121] to cover more of the parameter space were recently given in [16, 54]. Until recently the best -value known in terms of was , but in a sequence of papers Andoni et al. [14, 17] have shown how to use data-dependent LSH techniques to achieve , bypassing the lower bound framework of [121] which assumes the LSH to be independent of data.
Set similarity search.
There exists a large number of different measures of set similarity with various applications for which it would be desirable to have efficient approximate similarity search algorithms [51]. Given a measure of similarity assume that we have access to a family of locality-sensitive hash functions (defined in Section 27) such that for every pair of sets it holds that
[TABLE]
when is sampled randomly from . We will refer to a family of locality-sensitive hash functions with this specific property as a similarity-sensitive family. Given a similarity-sensitive family we can use the LSH framework to construct a solution for the -approximate similarity search problem with exponent .
Regarding the existence of similarity-sensitive families it was shown by Charikar [47] that if the similarity measure admits a similarity-sensitive LSH, then must be a metric. Recently, Chierichetti and Kumar [49] showed that, given a similarity that admits a similarity-sensitive LSH, the transformed similarity will continue to admit an LSH if is a probability generating function. The existence of an LSH that admits a similarity measure will therefore give rise to the existence of solutions to the approximate similarity search problem for the much larger class of similarities . However, this still leaves open the problem of finding efficient explicit constructions, and as it turns out, the property of similarity-sensitive families , while intuitively appealing and useful for similarity estimation, does not necessarily imply that the LSH is optimal for solving the approximate search problem. In fact, it was recently shown [50] that for Braun-Blanquet there does not exist a LSH scheme with . Moreover, it was shown that MinHash achieves a two-approximation to Braun-Blanquet similarity and that this is optimal for LSH schemes.
The problem of finding tight upper and lower bounds on the -value that can be obtained through the LSH framework for data-independent -approximate similarity search across the entire parameter space remains open for two of the most common measures of set similarity: Jaccard similarity and cosine similarity .
A random function from the MinHash family hashes a set to the first element of in a random permutation of the set . For we have that , yielding an LSH solution to the approximate Jaccard similarity search problem. For cosine similarity the SimHash family , introduced by Charikar [47], works by sampling a random hyperplane in d that passes through the origin and hashing according to what side of the hyperplane it lies on. For we have that , which can be used to derive a solution for cosine similarity, although not the clean solution that we could have hoped for in the style of MinHash for Jaccard similarity. There exists a number of different data-independent LSH approaches [156, 14, 13] that improve upon the -value of SimHash. Perhaps surprisingly, it turns out that these approaches yield lower -values for the -approximate Jaccard similarity search problem compared to MinHash for certain combinations of . Unfortunately, while asymptotically superior these techniques suffer from a non-trivial -term in the exponent that only decreases very slowly with . In comparison, both MinHash and SimHash are simple to describe and have closed expressions for their -values. Furthermore, MinHash and SimHash both have the advantage of being efficient in the sense that a hash function can be represented using space and the time to compute is .
In Table 2 we show how the upper bounds for similarity search under different measures of set similarity relate to each other in the case where all sets are -sparse. In addition to Hamming distance and Jaccard similarity, we consider Braun-Blanquet similarity [28] defined as
[TABLE]
which for -sparse vectors is identical to cosine similarity. When the query and the sets in can have different sizes the picture becomes muddled, and the question of which of the known algorithms is best for each measure of similarity is complicated and can depend on . In Section 30 we treat the problem of different set sizes and provide a brief discussion for Jaccard similarity, specifically in relation to our upper bound for Braun-Blanquet similarity.
Similarity search under set similarity and the batched version often referred to as set similarity join [20, 23] have also been studied extensively in the information retrieval and database literature, but mostly without providing theoretical guarantees on performance. Recently the notion of containment search, where the similarity measure is the (unnormalized) intersection size, was studied in the LSH framework [150]. This is a special case of maximum inner product search [150, 5]. However, these techniques do not give improvements in our setting.
Similarity estimation.
Finally, we mention that another application of MinHash [30, 36] is the (easier) problem of similarity estimation, where the task is to condense each vector into a short signature in such a way that the similarity can be estimated from and . A related similarity estimation technique was independently discovered by Cohen [60]. Thorup [157] has shown how to perform similarity estimation using just a small amount of randomness in the definition of the function . In another direction, Mitzenmacher et al. [112] showed that it is possible to improve the performance of MinHash for similarity estimation when the Jaccard similarity is close to 1, but for smaller similarities it is known that succinct encodings of MinHash such as the one in [103] comes within a constant factor of the optimal space for storing [128]. Curiously, our improvement to MinHash in the context of similarity search comes when the similarity is neither too large nor too small. Our techniques do not seem to yield any improvement for the similarity estimation problem.
26.2 Contribution
We show the following upper bound for approximate similarity search under Braun-Blanquet similarity:
Theorem 4.1**.**
For every choice of constants we can solve the -approximate similarity search problem under Braun-Blanquet similarity with query time and space usage where .
In the case where the sets are -sparse our Theorem 4.1 gives the first strict improvement on the -value for approximate Jaccard similarity search compared to the data-independent LSH approaches of MinHash and Angular LSH. Figure 6 shows an example of the improvement for a slice of the parameter space. The improvement is based on a new locality-sensitive mapping that considers a specific random collection of length- paths on the vertex set , and associates each vector with the paths in the collection that only visits vertices in . Our data structure exploits that similar vectors will be associated with a common path with constant probability, while vectors with low similarity have a negligible probability of sharing a path. However, the collection of paths has size superlinear in , so an efficient method is required for locating the paths associated with a particular vector. Our choice of the collection of paths balances two opposing constraints: It is random enough to match the filtering performance of a truly random collection of sets, and at the same time it is structured enough to allow efficient search for sets matching a given vector. The search procedure is comparable in simplicity to the classical techniques of bit sampling, MinHash, SimHash, and -stable LSH, and we believe it might be practical. This is in contrast to most theoretical advances in similarity search in the past ten years that suffer from terms in the exponent of complexity bounds.
Intuition.
Recall that we will think of a vector also as a set, . MinHash can be thought of as a way of sampling an element from , namely, we let where is a random hash function. For sets and the probability that equals their Jaccard similarity , which is much higher than if the samples had been picked independently. Consider the case in which , so . Another way of sampling is to compute , where , independently for each . The expected size of is 1, so this sample has the same expected “cost” as the MinHash-based sample. But if the Jaccard similarity is small, the latter samples are more likely to overlap:
[TABLE]
nearly a factor of 2 improvement. In fact, whenever we have . So in a certain sense, MinHash is not the best way of collecting evidence for the similarity of two sets. (This observation is not new, and has been made before e.g. in [62].)
Locality-sensitive maps.
The intersection of the samples does not correspond directly to hash collisions, so it is not clear how to turn this insight into an algorithm in the LSH framework. Instead, we will consider a generalization of both the locality sensitive filtering (LSF) and LSH frameworks where we define a distribution over maps . The map performs the same task as the LSH data structure: It takes a vector and returns a set of memory locations . A randomly sampled map has the property that if a pair of points are close then with constant probability, while if are distant then the expected size is small (much smaller than ). It is now straightforward to see that this distribution can be used to construct a data structure for similarity search by storing each data point in the set of memory locations or buckets . A query for a point is performed by computing the similarity between and every point contained in the set buckets , reporting the first sufficiently similar point found.
Chosen Path.
It turns out that to most efficiently filter out vectors of low similarity in the setting where all sets have equal size, we would like to map each data point to a collection of random subsets of that are contained in . Furthermore, to best distuinguish similar from dissimilar vectors when solving the approximate similarity search problem, we would like the random subsets of to have size . This leads to another obstacle: The collection of subsets of required to ensure that for similar points, i.e., that maps to a subset contained in , is very large. The space usage and evaluation time of a locality-sensitive map to fully random subsets of would far exceed , rendering the solution useless. To overcome this we create the samples in a gradual, correlated way using a pairwise independent branching process that turns out to yield “sufficiently random” samples for the argument to go through.
Lower bound.
On the lower bound side we show that our solution for Braun-Blanquet similarity is best possible in terms of parameters and within the class of solutions that can be characterized as data-independent locality-sensitive maps. The lower bound works by showing that a family of locality-sensitive maps for Braun-Blanquet similarity with a -value below can be used to construct a locality-sensitive hash family for the -approximate near neighbor problem in Hamming space with a -value below , thereby contradicting the LSH lower bound by O’Donnell et al. [121]. We state the lower bound here in terms of locality-sensitive hashing, formally defined in Section 27.
Theorem 4.2**.**
For every choice of constants any -sensitive hash family for under Braun-Blanquet similarity must satisfy
[TABLE]
The details showing how this LSH lower bound implies a lower bound for locality-sensitive maps are given in Section 29.
27 Preliminaries
As stated above we will view both as a vector and as a subset of . Define to be -sparse if ; we will be interested in the setting where , and typically the sparse setting . Although many of the concepts we use hold for general spaces, for simplicity we state definitions in the same setting as our results: the boolean hypercube under some measure of similarity .
Definition 4.1**.**
(Approximate similarity search) Let be a set of data vectors, let be a similarity measure, and let such that . A solution to the -similarity search problem is a data structure that supports the following query operation: on input for which there exists a vector with , return with .
Our data structures are randomized, and queries succeed with probability at least (the probability can be made arbitrarily close to by independent repetition). Sometimes similarity search is formulated as searching for vectors that are near according to the distance measure . For our purposes it is natural to phrase conditions in terms of similarity, but we compare to solutions originally described as “near neighbor” methods.
Many of the best known solutions to approximate similarity search problems are based on a technique of randomized space partitioning. This technique has been formalized in the locality-sensitive hashing framework [91] and the closely related locality-sensitive filtering framework [24, 54].
Definition 4.2**.**
(Locality-sensitive hashing [91]) A -sensitive family of hash functions for a similarity measure is a distribution over functions such that for all and random sampled according to :
- •
If then .
- •
If then .
The range of the family will typically be fairly small such that an element of can be represented in a constant number of machine words. In the following we assume for simplicity that the family of hash functions is efficient such that can be computed in time . Furthermore, we will assume that the time to compute the similarity can be upper bounded by the time it takes to compute the size of the intersection of preprocessed sets, i.e., .
Given a locality-sensitive family it is quite simple to obtain a solution to the approximate similarity search problem, essentially by hashing points to buckets such that close points end up in the same bucket while distant points are kept apart.
Lemma 4.1** (LSH framework [91, 86]).**
Given a -sensitive family of hash functions it is possible to solve the -similarity search problem with query time and space usage where .
The upper bound presented in this paper does not quite fit into the existing frameworks. However, we would like to apply existing LSH lower bound techniques to our algorithm. Therefore we define a more general framework that captures solutions constructed using the LSH and LSF framework, as well as the upper bound presented in this paper.
Definition 4.3** (Locality-sensitive map).**
A -sensitive family of maps for a similarity measure is a distribution over mappings (where denotes the power set of ) such that for all and random :
. 2. 2.
If then . 3. 3.
If then .
Once we have a family of locality-sensitive maps we can use it to obtain a solution to the -similarity search problem.
Lemma 4.2**.**
Given a -sensitive family of maps we can solve the -similarity search problem with query time and space usage where is the time to evaluate a map .
Proof.
We construct the data structure by sampling a map from and use it to place points in into buckets. To run a query for a point we proceed by evaluating and computing the similarity between and the points in the buckets associated with . If a sufficiently similar point is found we return it. We get rid of the expectation in the guarantees by independent repetitions and applying Markov’s inequality. ∎
Model of computation.
We assume the standard word RAM model [85] with word size , where . In order to be able to draw random functions from a family of functions we augment the model with an instruction that generates a machine word uniformly at random in constant time.
28 Upper Bound
We will describe a family of locality-sensitive maps for solving the -similarity search problem under Braun-Blanquet similarity (3). After describing we will give an efficient implementation of and show how to set parameters to obtain our Theorem 4.1.
28.1 Chosen Path
The Chosen Path family is defined by random hash functions where and is a positive integer. The evaluation of a map proceeds in a sequence of steps that can be analyzed as a Galton-Watson branching process, originally devised to investigate population growth under the assumption of identical and independent offspring distributions. In the first step we create a population of starting points
[TABLE]
In subsequent steps, every path that has survived so far produces offspring according to a random process that depends on and the element being evaluated. We use to denote concatenation of a path with a vertex .
[TABLE]
Observe that can only hold when , so the paths in are constrained to . The set is given by the paths that survive to the th step. We will proceed by bounding the evaluation time of as well as showing the locality-sensitive properties of . In particular, for similar points with we will show that with probability at least there will be a path that is chosen by both and .
Lemma 4.3** (Properties of Chosen Path).**
For all , integer , and random :
. 2. 2.
If then . 3. 3.
If then .
Proof.
We prove each property by induction on . The base cases follow from (4). Now consider the inductive step for property 1. Let denote the indicator function for predicate . Using independence of the hash functions we get:
[TABLE]
The last inequality uses the induction hypothesis. We use the same approach for the second property where we let .
[TABLE]
To prove the third property we bound the variance of and apply Chebyshev’s inequality to bound the probability of . First consider the case where and . Here it must hold that as intersecting paths exist () and always activate. In all other cases we have that
[TABLE]
Knowing the expected value we can apply Chebyshev’s inequality once we have an upper bound for . Specifically we show that , by induction on . To simplify notation we define the indicator variable
[TABLE]
where we suppress the subscript . First observe that
[TABLE]
By (5) we see that , which means:
[TABLE]
The third property now follows from a one-sided version of Chebychev’s inequality applied to . ∎
28.2 Implementation details
Lemma 4.3 continues to hold when the hash functions are individually 2-independent (and mutually independent) since we only use bounds on the first and second moment of the hash values. We can therefore use a simple and practical scheme such as Zobrist hashing [173] that hashes strings of bits to strings of bits in time using space, say, . It is not hard to see that the domain and range of can be compressed to bits (causing a neglible increase in the failure probability of the data structure). We simply hash the paths to intermediate values of bits, avoiding collisions with high probability, and in a similar vein, with high probability bits of precision suffice to determine whether .
We now consider how to parameterize to solve the -similarity problem for Braun-Blanquet similarity on a set of points for every choice of constant parameters , independent of . Note that we exclude (which would correspond to identical vectors that can be found in time by resorting to standard hashing) and (for which every data point would be a valid answer to a query). We set parameters
[TABLE]
from which it follows that is -sensitive with and where . To bound the expected evaluation time of we use Zobrist hashing as well as intermediate hashes for the paths as described above. In the th step in the branching process the expected number of hash function evaluations is bounded by times the number of paths alive at step . We can therefore bound the expected time to compute by
[TABLE]
This completes the proof of Theorem 4.1.111We know of a way of replacing the multiplicative factor in equation (6) by an additive term of by choosing the hash functions carefully, but do not discuss this improvement here since can be assumed to be polylogarithmic and our focus is on the exponent of .
28.3 Comparison
We will proceed by comparing our Theorem 4.1 to results that can be achieved using existing techniques. Again we focus on the setting where data points and query points are exactly -sparse. An overview of different techniques for three measures of similarity is shown in Table 2. To summarize: The Chosen Path algorithm of Theorem 4.1 improves upon all existing data-independent results over the entire parameter space. Furthermore, we improve upon the best known data-dependent techniques [17] for a large part of the parameter space (see Figure 9). The details of the comparisons are given in Appendix 33.
MinHash.
For -sparse vectors there is a 1-1 mapping between Braun-Blanquet and Jaccard similarity. In this setting . Let and be the Braun-Blanquet similarities corresponding to Jaccard similarities and . The LSH framework using MinHash achieves ; this should be compared to achieved in Theorem 4.1. Since the function is monotonically increasing in we have that , i.e., is always smaller than . As an example, for and we get while . Figure 7 shows the difference for the whole parameter space.
Angular LSH.
Since our vectors are exactly -sparse Braun-Blanquet similarities correspond directly to dot products (which in turn correspond to angles). Thus we can apply angular LSH such as SimHash [47] or cross-polytope LSH [13]. As observed in [54] one can express the -value of cross-polytope LSH in terms of dot products as . Since the function is negative and monotonically increasing in we have that , i.e., is always smaller than . In the above example, for and we have which is about more than Chosen Path. See Figure 8 for a visualization of the difference for the whole parameter space.
Data-dependent Hamming LSH.
The Hamming distance between two -sparse vectors with Braun-Blanquet similarity is , since the intersection of the vectors has size . This means that -similarity search under Braun-Blanquet similarity can be reduced to Hamming similarity search with approximation factor . As mentioned above, the data dependent LSH technique of [17] achieves ignoring terms. In terms of and this is , which in incomparable to the of Theorem 4.1. In Appendix 33 we show that whenever , or equivalently, whenever . Revisiting the above example, for and we have which is about more than Chosen Path. Figure 9 gives a comparison covering the whole parameter space.
29 Lower bound
In this section we will show a locality-sensitive hashing lower bound for under Braun-Blanquet similarity. We will first show that LSH lower bounds apply to the class of solutions to the approximate similarity search problem that are based on locality-sensitive maps, thereby including our own upper bound. Next we will introduce some relevant tools from the literature, in particular the LSH lower bounds for Hamming space by O’Donnell et al. [121] which we use, through a reduction, to show LSH lower bounds under Braun-Blanquet similarity.
Lower bounds for locality-sensitive maps.
Because our upper bound is based on a locality-sensitive map and not LSH-based we first show that LSH lower bounds apply to LSM-based solutions. This is not too surprising as both the LSH and LSF frameworks produce LSM-based solutions. We note that the idea of showing lower bounds for a more general class of algorithms that encompasses both LSH and LSF was used by Andoni et al. [16] in their list-of-points data structure lower bound for the space-time tradeoff of solutions to the approximate near neighbor problem in the random data regime. We use the approach of Christiani [54] to convert an LSM family into an LSH family using MinHash.
Lemma 4.4**.**
Suppose we have a -sensitive family of maps . Then we can construct a -sensitive family of hash functions with and where .
Proof.
We sample a function from by sampling a function from , modify to output a set of fixed size, and apply MinHash to the resulting set. For we define the function where we ensure that the size of the output set is . We note that the purpose of this step is to be able to simultaneously lower bound and upper bound for when we apply MinHash to the resulting sets.
[TABLE]
We proceed by applying MinHash to the set . Let denote a random permutation of the range of and define
[TABLE]
We then have
[TABLE]
summing over the finite set of all possible Jaccard similarities with . It is now fairly simple to lower bound and upper bound . Assume that satisfy that . To lower bound we use a union bound together with Markov’s inequality to bound the following probability:
[TABLE]
We therefore have that . In the event of a nonempty intersection the probability of collision is given by allowing us to conclude that .
Bounding the collision probability for distant pairs of points with we get
[TABLE]
∎
We are now ready to justify the statement that LSH lower bounds apply to LSM, allowing us to restrict our attention to proving LSH lower bounds for Braun-Blanquet similarity.
Corollary 4.1**.**
Suppose that we have an LSM-based solution to the -similarity search problem with query time . Then there exists a family of locality-sensitive hash functions with .
Proof.
The existence of the LSM-based solution implies that for every there exists a -sensitive family of maps with and . The upper bound on follows from applying Lemma 4.4. ∎
LSH lower bounds for Hamming space.
There exist a number of powerful results that lower bound the -value that is attainable by locality-sensitive hashing and related approaches in various settings [114, 132, 121, 19, 54, 16]. O’Donnell et al. [121] showed an LSH lower bound of for -dimensional Hamming space under the assumption that is not too small compared to , i.e., . The lower bound by O’Donnell et al. holds for -sensitive families for a particular choice of that depends on , , and , and where is small compared to (for instance, we have that when and are constant).
We state a simplified version of the lower bound due to O’Donnell et al. where that we will use as a tool to prove our lower bound for Braun-Blanquet similarity. The full proof of Lemma 4.5 is given in Appendix 32.
Lemma 4.5**.**
For every , , and every -sensitive hash family for under Hamming distance must have
[TABLE]
In general, good lower bounds for the entire parameter space are not known, although the techniques by O’Donnell et al. appear to yield a bound of . This is far from tight as can be seen by comparing it to the bit-sampling [91] upper bound of . Existing lower bounds are tight in two different settings. First, in the setting where (random data), lower bounds [114, 73, 19] match various instantiations of angular LSH [156, 14, 13]. Second, in the setting where , the lower bound by O’Donnell et al. [121] becomes , matching bit-sampling LSH [91] as well as Angular LSH.
29.1 Braun-Blanquet LSH lower bound
We are now ready to prove the LSH lower bound from Theorem 4.2. The lower bound together with Corollary 4.1 shows that the -value of Theorem 4.1 is best possible up to terms within the class of data-independent locality-sensitive maps for Braun-Blanquet similarity. Furthermore, the lower bound also applies to angular distance on the unit sphere where it comes close to matching the best known upper bounds for much of the parameter space as can be seen from Figure 8.
Proof sketch.
The proof works by assuming the existence of a -sensitive family for under Braun-Blanquet similarity with for some . We use a transformation from Hamming space to Braun-Blanquet similarity to show that the existence of implies the existence of a -sensitive family for -dimensional Hamming space that will contradict the lower bound of O’Donnell et al. [121] as stated in Lemma 4.5 for some appropriate choice of .
We proceed by giving an informal description of a simple “tensoring” technique for converting a similarity search problem in Hamming space into a Braun-Blanquet set similarity problem for target similarity thresholds . For define
[TABLE]
and for a positive integer define . We have that and
[TABLE]
where . For every choice of constants we can choose , , , and such that and . Now, given an LSH family for Braun-Blanquet with we would be able to obtain an LSH family for Hamming space with
[TABLE]
For appropriate choices of parameters this would contradict the O’Donnell et al. LSH lower bound of for Hamming space. The proof itself is mostly an exercise in setting parameters and applying the right bounds and approximations to make everything fit together with the intuition above. Importantly, we use sampling in order to map to a dimension that is much lower than the from the proof sketch in order to make the proof hold for small values of in relation to .
Hamming distance to Braun-Blanquet similarity.
Let and let be constant as in Theorem 4.2. Let be a parameter to be determined. We want to show how to use a transformation from Hamming distance to Braun-Blanquet similarity together with our family to construct a -sensitive family for -dimensional Hamming space with parameters
[TABLE]
where and remain to be determined.
The function takes as parameters positive integers , , and . The output of consists of concatenated -bit strings, each of of Hamming weight one. Each of the strings is constructed independently at random according to the following process: Sample a vector of indices uniformly at random from and define as . Let be indexed by and set the bits of as follows:
[TABLE]
Next we apply a random function in order to map down to an -bit string of Hamming weight one while approximately preserving Braun-Blanquet similarity. For we set
[TABLE]
Finally we set
[TABLE]
where each is constructed independently at random.
We state the properties of for the following parameter setting:
[TABLE]
Lemma 4.6**.**
For every and there exists a distribution over functions of the form such that for all and random :
. 2. 2.
If then with probability at least . 3. 3.
If then with probability at least .
Proof.
The first property is trivial. For the second property we consider with where we would like to lower bound
[TABLE]
We know that so it remains to lower bound the size of the intersection . Consider the expectation
[TABLE]
We have that if and take on the same value in the underlying bit-positions that are sampled to construct . Under the assumption that , then for greater than some sufficiently large constant we can use a standard approximation to the exponential function (detailed in Lemma 4.10 in Appendix 32) to show that
[TABLE]
Seeing as is the sum of independent Bernoulli trials we can apply Hoeffding’s inequality to yield the following bound:
[TABLE]
This proves the second property of .
For the third property we consider the Braun-Blanquet similarity of distant pairs of points with . Again, under our assumption that and for greater than some constant we have
[TABLE]
There are two things that can cause the event to fail. First, the sum of the independent Bernoulli trials for the event can deviate too much from its expected value. Second, the mapping down to -bit strings that takes place from to can lead to an additional increase in the similarity due to collisions. Let denote the sum of the Bernoulli trials for the events associated with . We again apply a standard Hoeffding bound to show that
[TABLE]
Let denote the number of collisions when performing the universe reduction to -bit strings. By our choice of we have that . Another application of Hoeffding’s inequality shows that
[TABLE]
We therefore get that
[TABLE]
This proves the third property of . ∎
Contradiction.
To summarize, using the random map together with the LSH family we can obtain an -sensitive family for -dimensional Hamming space with and for . For our choice of we plug the family into the lower bound of Lemma 4.5 and use that which follows from our constraint that .
[TABLE]
Under our assumed properties of , we can upper bound the value of for . For simplicity we temporarily define and assume that and . The latter property holds without loss of generality through use of the standard LSH powering technique [91, 86, 121] that allows us to transform an LSH family with to a family that has without changing its associated -value.
[TABLE]
We get a contradiction between our upper bound and lower bound for whenever violates the following relation that summarizes the bounds:
[TABLE]
In order for a contradiction to occur, the value of has to satisfy
[TABLE]
By our setting of and we have that . We can cause a contradiction for a setting of where is some constant and where we assume that is greater than some constant. The value of for which the lower bound holds can be upper bounded by
[TABLE]
This completes the proof of Theorem 4.2.
30 Equivalent set similarity problems
In this section we consider how to use our data structure for Braun-Blanquet similarity search to support other similarity measures such as Jaccard similarity. We already observed in the introduction that a direct translation exists between several similarity measures whenever the size of every sets is fixed to . Call an -similarity search problem (,)-regular if is restricted to vectors of weight and queries are restricted to vectors of weight . Obviously, a -regular similarity search problem is no harder than the general similarity search problem, but it also cannot be too much easier when expressed as a function of the thresholds : For every pair we can construct a (,)-regular data structure (such that each point is represented in the data structures with ), and answer a query for by querying all data structures with . Thus, the time and space for the general -similarity search problem is at most times larger than the time and space of the most expensive (,)-regular data structure. This does not mean that we cannot get better bounds in terms of other parameters, and in particular we expect that the difficulty of -regular similarity search problems depends on parameters and .
Dimension reduction.
If the dimension is large a factor of may be significant. However, for most natural similarity measures a -similarity problem in dimensions can be reduced to a logarithmic number of -similarity problems on in dimensions with and . Since the similarity gap is close to the one in the original problem, , where and are assumed to be independent of , the difficulty (-value) remains essentially the same. First, split into size classes such that vectors in class have size in . For each size class the reduction is done independently and works by a standard technique: sample a sequence of random sets , , and set . The size of each set is chosen such that when . By Chernoff bounds this mapping preserves the relative weight of vectors up to size up to an additive term with high probability. Assume now that the similarity measure is such that for vectors in we only need to consider in the range from to (since if the size difference is larger, the similarity is negligible). The we can apply Chernoff bounds to the relative weights of the dimension-reduced vectors , and the intersection . In particular, we get that the Jaccard similarity of a pair of vectors is preserved up to an additive error of with high probability. The class of similarity measures for which dimension reduction to dimensions is possible is large, and we do not attempt to characterize it here. Instead, we just note that for such similarity measures we can determine the complexity of similarity search up to a factor by only considering regular search problems.
Equivalence of regular similarity search problems.
We call a set similarity measure on symmetric if it can be written in the form , where each function is nondecreasing. All 59 set similarity measures listed in the survey [51], normalized to yield similarities in , are symmetric. In particular this is the case for Jaccard similarity (where ) and for Braun-Blanquet similarity. For a symmetric similarity measure, the predicate is equivalent to the predicate , where , and is equivalent to the predicate , where . This means that every (,)-regular -similarity search problem on is equivalent to an -similarity search problem on , where . In other words, all symmetric similarity search problems can be translated to each other, and it suffices to study a single one, such as Braun-Blanquet similarity.
Jaccard similarity.
We briefly discuss Jaccard similarity since it is the most widely used measure of set similarity. If we consider the problem of -approximate Jaccard similarity search in the -regular case with then our Theorem 4.1 is no longer guaranteed to yield the lowest value of among competing data-independent approaches such as MinHash and Angular LSH. To simplify the comparision between different measures we introduce parameters and defined by and (note that ). The three primary measures of set similarity considered in this paper can then be written as follows:
[TABLE]
As shown in Figure 10 among angular LSH, MinHash, and Chosen Path, the technique with the lowest -value is different depending on the parameters and asymmetry .
We know that Chosen Path is optimal and strictly better than the competing data-independent techniques across the entire parameter space when , but it remains open to find tight upper and lower bounds in the case where .
31 Conclusion and open problems
We have seen that, perhaps surprisingly, there exists a relatively simple way of strictly improving the -value for data-independent set similarity search in the case where all sets have the same size. To implement the required locality-sensitive map efficiently we introduce a new technique based on branching processes that could possibly lead to more efficient solutions in other settings.
It remains an open problem to find tight upper and lower bounds on the -value for Jaccard and cosine similarity search that hold for the entire parameter space in the general setting with arbitrary set sizes. Perhaps a modified version of the Chosen Path algorithm can yield an improved solution to Jaccard similarity search in general. One approach is to generalize the condition to use different thresholds for queries and updates. This yields different space-time tradeoffs when applying the Chosen Path algorithm to Jaccard similarity search.
Another interesting question is if the improvement shown for sparse vectors can be achieved in general for inner product similarity. A similar, but possibly easier, direction would be to consider weighted Jaccard similarity.
Acknowledgment
We thank Thomas Dybdahl Ahle for comments on a previous version of this manuscript.
32 Appendix: Details behind the lower bound
32.1 Tools
For clarity we state some standard technical lemmas that we use to derive LSH lower bounds.
Lemma 4.7** (Hoeffding [88, Theorem 1]).**
Let be independent random variables satisfying for . Define , , and , then:
For and we have that .
- -
For and we have that .
Lemma 4.8** (Chernoff [113, Thm. 4.4 and 4.5]).**
Let be independent Poisson trials and define and . Then, for we have
.
- -
.
Lemma 4.9** (Bounding the logarithm [160]).**
For we have that .
Lemma 4.10** (Approximating the exponential function [115, Prop. B.3]).**
For all with we have that .
32.2 Proof of Lemma 4.5
Preliminaries.
We will reuse the notation of Section 3. from O’Donnell et al. [121].
Definition 4.4**.**
For we say that are -correlated if is chosen uniformly at random from and is constructed by rerandomizing each bit from independently at random with probability .
Let be -correlated and let be a family of hash functions on , then we define
[TABLE]
We have that is a log-convex function which implies the following property that underlies the lower bound:
Lemma 4.11**.**
For every family of hash functions on , every , and we have
[TABLE]
The idea behind the proof is to tie to and to through Chernoff bounds and then apply Lemma 4.11 to show that .
Proof.
Begin by assuming that we have a family that satisfies the conditions of Lemma 4.5. Note that the expected Hamming distance betwee -correlated points and is given by . We set and and let denote -correlated random strings and denote )-correlated random strings. By standard Chernoff bounds we get the following guarantees:
[TABLE]
We will establish a relationship between and on the one hand, and and on the other hand, for the following choice of parameters and :
[TABLE]
By the properties of and from the definition of we have that
[TABLE]
Let . By Lemma 4.11 and our setting of and we can use the bounds on the natural logarithm from Lemma 4.9 to show the following:
[TABLE]
We proceed by lower bounding where we make use of the inequalities derived above.
[TABLE]
By Lemma 4.11 combined with the restrictions on our parameters, for greater than some constant we have that . Furthermore, we lower bound by using that together with the restriction that and the properties of . For greater than some constant it therefore holds that from which it follows that .
[TABLE]
By the arguments above we have that
[TABLE]
Inserting the lower bound for results in the lemma.
33 Appendix: Comparisons
For completeness we state the proofs behind the comparisons between the -values obtained by the Chosen Path algorithm and other LSH techniques.
33.1 MinHash
For data sets with fixed sparsity and Braun-Blanquet similarities we have that where . If is monotone increasing in then . For we have that where . The function equals zero at and has the derivative which is negative for values of . We can thefore see that is positive in the interval and it follows that for every choice of .
33.2 Angular LSH
We have that if is a monotone increasing function for . For we have that where . We note that and . Therefore, if for it holds that and is monotone increasing in the same interval. We have that and implying that in the interval.
33.3 Data-dependent LSH
Lemma 4.12**.**
Let and fix such that . Then we have that for every value of .
Proof.
We will compare and when is fixed at , or equivalently, . We can solve the quadratic equation to see that for we have that only when . The derivative of with respect to is negative when . Under this restriction we therefore have that for which is equivalent to in the fixed-weight setting. ∎
To compare -values over the full parameter space we use the following two lemmas.
Lemma 4.13**.**
For every choice of fixed let . Then is decreasing in for .
Proof.
The sign of the derivative of with respect to is equal to the sign of the function for . We have that and for which shows that in the interval. ∎
Lemma 4.14**.**
For we have that .
Proof.
For fixed consider as a function of in the interval . We want to show that for . In the endpoints the function takes the value [math]. Between the endpoints we find that and that is a quadratic form with only one solution in . By Lemma 4.12 we know that that for and it holds that . Since , only in a single point in , and we can conclude that the lemma holds. ∎
Corollary 4.2**.**
For every choice of satisfying and we have that .
Proof.
If the property holds by Lemma 4.14. If we define new variables , setting and initially consider . In this setting we again have that . According to Lemma 4.13 it holds that is decreasing in for fixed . Therefore, as decreases to where we have that remains constant while increases. Since it held that at the initial values of it must also hold for . ∎
Numerical comparison of MinHash and Data-dep. LSH.
Comparing to we can verify numerically that even for fixed as low as we can find values of (for example such that .
Chapter 5 Adaptive similarity join
‘A light from the shadows shall spring’
Set similarity join is a fundamental and well-studied database operator. It is usually studied in the exact setting where the goal is to compute all pairs of sets that exceed a given similarity threshold (measured e.g. as Jaccard similarity). But set similarity join is often used in settings where 100% recall may not be important — indeed, where the exact set similarity join is itself only an approximation of the desired result set.
We present a new randomized algorithm for set similarity join that can achieve any desired recall up to 100%, and show theoretically and empirically that it significantly improves on existing methods. The present state-of-the-art exact methods are based on prefix-filtering, the performance of which depends on the prevalence of rare elements in the sets. Our method is robust against the absence of such structure in the data. At 90% recall our algorithm is often more than an order of magnitude faster than state-of-the-art exact methods, depending on how well a data set lends itself to prefix filtering. Our experiments on benchmark data sets also show that the method is several times faster than comparable approximate methods. Our algorithm makes use of recent theoretical advances in high-dimensional sketching and indexing.
34 Introduction
It is increasingly important for data processing and analysis systems to be able to work with data that is imprecise, incomplete, or noisy. Similarity join has emerged as a fundamental primitive in data cleaning and entity resolution over the last decade [21, 48, 141]. In this paper we focus on set similarity join: Given collections and of sets the task is to compute
[TABLE]
where is a similarity measure and is a threshold parameter. We deal with sets , where the number of distinct tokens can be naturally thought of as the dimensionality of the data.
Many measures of set similarity exist [51], but perhaps the most well-known such measure is the Jaccard similarity,
[TABLE]
For example, the sets IT, University, Copenhagen and University, Copenhagen, Denmark have Jaccard similarity which could suggest that they both correspond to the same entity. In the context of entity resolution we want to find a set that contains if and only if and correspond to the same entity. The quality of the result can be measured in terms of precision and recall , both of which should be as high as possible. We will be interested in methods that achieve 100% precision, but that might not have 100% recall. We refer to methods with 100% recall as exact, and others as approximate.
34.1 Our Contributions
We present a new approximate set similarity join algorithm: Chosen Path Similarity Join (CPSJoin). We cover its theoretical underpinnings, and show experimentally that it achieves high recall with a substantial speedup compared to state-of-the-art exact techniques. The key ideas behind CPSJoin are:
- •
A new recursive filtering technique inspired by the recently proposed ChosenPath index for set similarity search [56], adding new ideas to make the method parameter-free, near-linear space, and adaptive to a given data set.
- •
Apply efficient sketches for estimating set similarity [103] that take advantage of modern hardware.
We compare CPSJoin to the exact set similarity join algorithms in the comprehensive empirical evaluation of Mann et al. [107], using the same data sets, and to other approximate set similarity join methods suggested in the literature. We find that CPSJoin outperforms other approximate methods and scales better than exact methods when the sets are relatively large (100 tokens or more) and the similarity threshold is low (e.g. Jaccard similarity 0.5) where we see speedups of more than an order of magnitude at 90% recall. Our experiments on benchmark datasets show that exact methods are faster in the case of high similarity thresholds, when the average set size is small, and when sets have many rare elements, whereas approximate methods are faster in the case of low similarity thresholds and when sets are large. This finding is consistent with theory and is further corroborated by experiments on synthetic datasets.
34.2 Related Work
For space reasons we present just a sample of the most related previous work, and refer to the book of Augsten and Böhlen [21] for a survey of algorithms for exact similarity join in relational databases, covering set similarity joins as well as joins based on string similarity.
Exact Similarity Join.
Early work on similarity join focused on the important special case of detecting near-duplicates with similarity close to 1, see e.g. [31, 141]. A sequence of results starting with the seminal paper of Bayardo et al. [23] studied the range of thresholds that could be handled. Recently, Mann et al. [107] conducted a comprehensive study of 7 state-of-the-art algorithms for exact set similarity join for Jaccard similarity threshold . These algorithms all use the idea of prefix filtering [48], which generates a sequence of candidate pairs of sets that includes all pairs with similarity above the threshold. The methods differ in how much additional filtering is carried out. For example, [171] applies additional length and suffix filters to prune the candidate pairs.
Prefix filtering uses an inverted index that for each element stores a list of the sets in the collection containing that element. Given a set , assume that we wish to find all sets such that . A valid result set must be contained in at least one of the inverted lists associated with any subset of elements of , or we would have . In particular, to speed up the search, prefix filtering looks at the elements of that have the shortest inverted lists.
The main finding by Mann et al. is that while more advanced filtering techniques do yield speedups on some data sets, an optimized version of the basic prefix filtering method (referred to as “ALL”) is always competitive within a factor 2.16, and most often the fastest of the algorithms. For this reason we will be comparing our results against ALL.
Locality-sensitive hashing.
Locality-sensitive hashing (LSH) is a theoretically well-founded randomized method for generating candidate pairs [78]. A family of locality-sensitive hash functions is a distribution over functions with the property that the probability that similar points (or sets in our case) are more likely to have the same function value. We know only of a few papers using LSH techniques to solve similarity join. Cohen et al. [61] used LSH techniques for set similarity join in a knowledge discovery context before the advent of prefix filtering. They sketch a way of choosing parameters suitable for a given data set, but we are not aware of existing implementations of this approach. Chakrabarti et al. [44] improved plain LSH with an adaptive similarity estimation technique, BayesLSH, that reduces the cost of checking candidate pairs and typically improves upon an implementation of the basic prefix filtering method by –. Our experiments include a comparison against both methods [61, 44]. We refer to the survey paper [125] for an overview of newer theoretical developments on LSH-based similarity joins, but point out that these developments have not matured sufficiently to yield practical improvements.
Distance estimation.
Similar to BayesLSH [44] we make use of algorithms for similarity estimation, but in contrast to BayesLSH we use algorithms that make use of bit-level parallelism. This approach works when there exists a way of picking a random hash function such that
[TABLE]
for every choice of sets and . Broder et al. [35] presented such a hash function for Jaccard similarity, now known as “minhash” or “minwise hashing”. In the context of distance estimation, 1-bit minwise hashing of Li and König [103] maps minhash values to a compact sketch, often using just 1 or 2 machine words. Still, this is sufficient information to be able to estimate the Jaccard similarity of two sets and just based on the Hamming distance of their sketches.
Locality-sensitive mappings.
Several recent theoretical advances in high-dimensional indexing [16, 54, 56] have used an approach that can be seen as a generalization of LSH. We refer to this approach as locality-sensitive mappings (also known as locality-sensitive filters in certain settings). The idea is to construct a function , mapping a set into a set of machine words, such that:
- •
If then is nonempty with some fixed probability .
- •
If , then the expected intersection size is “small”.
Here the exact meaning of “small” depends on the difference , but in a nutshell, if it is the case that almost all pairs have similarity significantly below then we can expect for almost all pairs. Performing the similarity join amounts to identifying all candidate pairs for which (for example by creating an inverted index), and computing the similarity of each candidate pair. To our knowledge these indexing methods have not been tried out in practice, probably because they are rather complicated. An exception is the recent paper [56], which is relatively simple, and indeed our join algorithm is inspired by the index described in that paper.
35 Preliminaries
The CPSJoin algorithm solves the -similarity join problem with a probabilistic guarantee on recall, formalized as follows:
Definition 5.1**.**
An algorithm solves the -similarity join problem with threshold and recall probability if for every the output of the algorithm satisfies .
It is important to note that the probability is over the random choices made by the algorithm, and not over a random choice of . This means that for any the probability that the pair is not reported in independent repetitions of the algorithm is bounded by . For example if it takes just repetitions to bound the recall to at least .
35.1 Similarity Measures
Our algorithm can be used with a broad range of similarity measures through randomized embeddings. This allows it to be used with, for example, Jaccard and cosine similarity thresholds.
Embeddings map data from one space to another while approximately preserving distances, with accuracy that can be tuned. In our case we are interested in embeddings that map data to sets of tokens. We can transform any so-called LSHable similarity measure , where we can choose to make (9) hold, into a set similarity measure by the following randomized embedding: For a parameter pick hash functions independently from a family satisfying (9). The embedding of is the following set of size :
[TABLE]
It follows from (9) that the expected size of the intersection is . Furthermore, it follows from standard concentration inequalities that the size of the intersection will be close to the expectation with high probability. For our experiments with Jaccard similarity thresholds , we found that gave sufficient precision for recall.
In summary we can perform the similarity join for any LSHable similarity measure by creating two corresponding relations and , and computing with respect to the similarity measure
[TABLE]
This measure is the special case of Braun-Blanquet similarity where the sets are known to have size [51]. Our implementation will take advantage of the set size being fixed, though it is easy to extend to general Braun-Blanquet similarity.
The class of LSHable similarity measures is large, as discussed in [49]. If approximation errors are tolerable, even edit distance can be handled by our algorithm [45, 172].
35.2 Notation
We are interested in sets where an element, is a set with elements from some universe . To avoid confusion we sometimes use “record” for and “token” for the elements of . Throughout this paper we will think of a record both as a set of tokens from , as well as a vector from , where:
[TABLE]
It is clear that these representations are equivalent. The set is equivalent to , is equivalent to , etc.
36 Overview of approach
Our high-level approach is recursive and works as follows. To compute we consider each and either:
Compare to each record in (referred to as “brute forcing” ), or 2. 2.
create several subproblems with , , and solve them recursively.
The approach of [56] corresponds to choosing option 2 until reaching a certain level of the recursion, where we finish the recursion by choosing option 1. This makes sense for certain worst-case data sets, but we propose an improved parameter-free method that is better at adapting to the given data distribution. In our method the decision on which option to choose depends on the size of and the average similarity of to the records of . We choose option 1 if has size below some (constant) threshold, or if the average Braun-Blanquet similarity of and , , is close to the threshold . In the former case it is cheap to finish the recursion. In the latter case many records will have larger than or close to , so we do not expect to be able to produce output pairs with in sublinear time in .
If neither of these pruning conditions apply we choose option 2 and include in recursive subproblems as described below. But first we note that the decision of which option to use can be made efficiently for each , since the average similarity of pairs from can be computed from token frequencies in time . Pseudocode for a self-join version of CPSJoin is provided in Algorithm 1 and 2.
36.1 Recursion
We would like to ensure that for each pair the pair is computed in one of the recursive subproblems, i.e., that for some . In particular, we want the expected number of subproblems containing to be at least 1, i.e.,
[TABLE]
To achieve (11) for every pair we proceed as follows: for each we recurse with probability on the subproblem with sets
[TABLE]
where denotes the size of records in and . It is not hard to check that (11) is satisfied for every pair with . Of course, expecting one subproblem to contain does not directly imply a good probability that is contained in at least one subproblem. But it turns out that we can use results from the theory of branching processes to show such a bound; details are provided in section 37.
37 Chosen Path Similarity Join
The CPSJoin algorithm solves the -set similarity join (Definition 5.1) for every choice of and with a guarantee on that we will lower bound in the analysis.
To simplify the exposition we focus on a self-join version where we are given a set of subsets of and we wish to report . Handling a general join follows the overview in section 36 and requires no new ideas: Essentially consider a self-join on but make sure to consider only pairs in for output. We also make the simplifying assumption that all sets in have a fixed size . As argued in section 35.1 the general case can be reduced to this one by embedding.
37.1 Description
The CPSJoin algorithm (see Algorithm 1 for pseudocode) works by recursively splitting the data set on elements of that are selected according to a random process, forming a recursion tree with at the root and subsets of that are non-increasing in size as we get further down the tree. The randomized splitting has the property that the probability of a pair of sets being in a random subproblem is increasing as a function of .
Before each recursive splitting step we run the BruteForce subprocedure (see Algorithm 2 for pseudocode) that identifies subproblems that are best solved by brute force. It has two parts:
-
If is below some constant size, controlled by the parameter limit, we report exactly using a simple loop with distance computations (BruteForcePairs) and exit the recursion. In our experiments we have set limit to , with the precise choice seemingly not having a large effect as shown experimentally in Section 39.2.
-
If is larger than limit the second part activates: for every we check whether the expected number of distance computations involving is going to decrease by continuing the recursion. If this is not the case, we immediately compare against every point in (BruteForcePoint), reporting close pairs, and proceed by removing from . The BruteForce procedure is then run again on the reduced set.
This procedure where we choose to handle some points by brute force crucially separates our algorithm from many other approximate similarity join methods in the literature that typically are LSH-based [126, 61]. By efficiently being able to remove points at the “right” time, before they generate too many expensive comparisons further down the tree, we are able to beat the performance of other approximate similarity join techniques in both theory and practice. Another benefit of this approach is that it reduces the number of parameters compared to the usual LSH setting where the depth of the tree has to be selected by the user.
37.2 Comparison to Chosen Path
The CPSJoin algorithm is inspired by the Chosen Path algorithm [56] for the approximate near neighbor problem and uses the same underlying random splitting tree that we will refer to as the Chosen Path Tree. In the approximate near neighbor problem, the task is to construct a data structure that takes a query point and correctly reports an approximate near neighbor, if such a point exists in the data set. Using the Chosen Path data structure directly to solve the -set similarity join problem has several drawbacks that we avoid in the CPSJoin algorithm. First, the Chosen Path data structure is parameterized in a non-adaptive way to provide guarantees for worst-case data, vastly increasing the amount of work done compared to the optimal parameterization when data is not worst-case. Our recursion rule avoids this and instead continuously adapts to the distribution of distances as we traverse down the tree. Secondly, the data structure uses space where , storing the Chosen Path Tree of size for every data point. The CPSJoin algorithm, instead of storing the whole tree, essentially performs a depth-first traversal, using only near-linear space in in addition to the space required to store the output. Finally, the Chosen Path data structure only has to report a single point that is approximately similar to a query point, and can report points with similarity . To solve the approximate similarity join problem the CPSJoin algorithm has to satisfy reporting guarantees for every pair of points in the exact join.
37.3 Analysis
The Chosen Path Tree for a set is defined by a random process: at each node, starting from the root, we sample a random hash function and construct children for every element such that . Nodes at depth in the tree are identified by their path . Formally, the set of nodes at depth in the Chosen Path Tree for is given by
[TABLE]
where denotes vector concatenation and . The subset of the data set that survives to a node with path is given by
[TABLE]
The random process underlying the Chosen Path Tree belongs to the well studied class of Galton-Watson branching processes [87]. Originally these where devised to answer questions about the growth and decline of family names in a model of population growth assuming i.i.d. offspring for every member of the population across generations [166]. In order to make statements about the properties of the CPSJoin algorithm we study in turn the branching processes of the Chosen Path Tree associated with a point , a pair of points , and a set of points . Note that we use the same random hash functions for different points in .
Brute forcing.
The BruteForce subprocedure described by Algorithm 2 takes two global parameters: and . The parameter controls the minimum size of before we discard the CPSJoin algorithm for a simple exact similarity join by brute force pairwise distance computations. The second parameter, , controls the sensitivity of the BruteForce step to the expected number of comparisons that a point will generate if allowed to continue in the branching process. The larger the more aggressively we will resort to the brute force procedure. In practice we typically think of as a small constant, say , but for some of our theoretical results we will need a sub-constant setting of to show certain running time guarantees. The BruteForce step removes a point from the Chosen Path branching process, instead opting to compare it against every other point , if it satisfies the condition
[TABLE]
In the pseudocode of Algorithm 2 we let count denote a hash table that keeps track of the number of times each element appears in . This allows us to evaluate the condition in equation (37.3) for an element in time by rewriting it as
[TABLE]
We claim that this condition minimizes the expected number of comparisons performed by the algorithm: Consider a node in the Chosen Path Tree associated with a set of points while running the CPSJoin algorithm. For a point , we can either remove it from immediately at a cost of comparisons, or we can choose to let continue in the branching process (possibly into several nodes) and remove it later. The expected number of comparisons if we let it continue levels before removing it from every node that it is contained in, is given by
[TABLE]
This expression is convex and increasing in the similarity between and other points , allowing us to state the following remark:
Remark 5.1* (Recursion).*
Let and consider a set containing a point such that satisfies the recursion condition in equation (37.3). Then the expected number of comparisons involving if we continue branching exceeds at every depth . If does not satisfy the condition, the opposite is observed.
Tree depth.
We proceed by bounding the maximal depth of the set of paths in the Chosen Path Tree that are explored by the CPSJoin algorithm. Having this information will allow us to bound the space usage of the algorithm and will also form part of the argument for the correctness guarantee. Assume that the parameter limit in the BruteForce step is set to some constant value, say . Consider a point and let be the subset of points in that are not too similar to . For every the expected number of vertices in the Chosen Path Tree at depth that contain both and is upper bounded by
[TABLE]
Since we use Markov’s inequality to show the following bound:
Lemma 5.1**.**
Let satisfy that then the probability that there exists a vertex at depth in the Chosen Path Tree that contains and is at most .
If does not share any paths with points that have similarity that falls below the threshold for brute forcing, then the only points that remain are ones that will cause to be brute forced. This observation leads to the following probabilistic bound on the tree depth:
Lemma 5.2**.**
With high probability the maximal depth of paths explored by the CPSJoin algorithm is .
Correctness.
Let and be two sets of equal size such that . We are interested in lower bounding the probability that there exists a path of length in the Chosen Path Tree that has been chosen by both and , i.e. . Agresti [3] showed an upper bound on the probability that a branching process becomes extinct after at most steps. We use it to show the following lower bound on the probability of a close pair of points colliding at depth in the Chosen Path Tree.
Lemma 5.3** (Agresti [3]).**
If then for every we have that .
The bound on the depth of the Chosen Path Tree for explored by the CPSJoin algorithm in Lemma 5.2 then implies a lower bound on .
Lemma 5.4**.**
Let be constant. Then for every set of points the CPSJoin algorithm solves the set similarity join problem with .
Remark 5.2*.*
This analysis is very conservative: if either or is removed by the BruteForce step prior to reaching the maximum depth then it only increases the probability of collision. We note that similar guarantees can be obtained when using fast pseudorandom hash functions as shown in the paper introducing the Chosen Path algorithm [56].
Space usage.
We can obtain a trivial bound on the space usage of the CPSJoin algorithm by combining Lemma 5.2 with the observation that every call to CPSJoin on the stack uses additional space at most . The result is stated in terms of working space: the total space usage when not accounting for the space required to store the data set itself (our algorithms use references to data points and only reads the data when performing comparisons) as well as disregarding the space used to write down the list of results.
Lemma 5.5**.**
With high probability the working space of the CPSJoin algorithm is at most .
Remark 5.3*.*
We conjecture that the expected working space is due to the size of being geometrically decreasing in expectation as we proceed down the Chosen Path Tree.
Running time.
We will bound the running time of a solution to the general set similarity self-join problem that uses several calls to the CPSJoin algorithm in order to piece together a list of results . In most of the previous related work, inspired by Locality-Sensitive Hashing, the fine-grainedness of the randomized partition of space, here represented by the Chosen Path Tree in the CPSJoin algorithm, has been controlled by a single global parameter [78, 126]. In the Chosen Path setting this rule would imply that we run the splitting step without performing any brute force comparison until reaching depth where we proceed by comparing against every other point in nodes containing , reporting close pairs. In recent work by Ahle et al. [4] it was shown how to obtain additional performance improvements by setting an individual depth for every . We refer to these stopping strategies as global and individual, respectively. Together with our recursion strategy, this gives rise to the following stopping criteria for when to compare a point against everything else contained in a node:
- •
Global: Fix a single depth for every .
- •
Individual: For every fix a depth .
- •
Adaptive: Remove when the expected number of comparisons is non-decreasing in the tree-depth.
Let denote the running time of our similarity join algorithm. We aim to show the following relation between the running time between the different stopping criteria when applied to the Chosen Path Tree:
[TABLE]
First consider the global strategy. We set to balance the contribution to the running time from the expected number of vertices containing a point, given by , and the expected number of comparisons between pairs of points at depth , resulting in the following expected running time for the global strategy:
[TABLE]
The global strategy is a special case of the individual case, and it must therefore hold that . The expected running time for the individual strategy is upper bounded by:
[TABLE]
We will now argue that the expected running time of the CPSJoin algorithm under the adaptive stopping criteria is no more than a constant factor greater than when we set the global parameters of the BruteForce subroutine as follows:
[TABLE]
Let and consider a path where is removed in from by the BruteForce step. Let denote the depth of the node (length of ) at which is removed. Compared to the individual strategy that removes at depth we are in one of three cases, also displayed in Figure 11.
The point is removed from at depth . 2. 2.
The point is removed from at depth . 3. 3.
The point is removed from at depth .
The underlying random process behind the Chosen Path Tree is not affected by our choice of termination strategy. In the first case we therefore have that the expected running time is upper bounded by the same (conservative) expression as the one used by the individual strategy. In the second case we remove earlier than we would have under the individual strategy. For every we have that since for larger values of the expected number of nodes containing exceeds . We therefore have that . Let denote the set of points in the node where was removed by the BruteForce subprocedure. There are two rules that could have triggered the removal of : Either or the condition in equation (37.3) was satisfied. In the first case, the expected cost of following the individual strategy would have been simply from the children containing in the next step. This is no more than a constant factor smaller than the adaptive strategy. In the second case, when the condition in equation (37.3) is activated we have that the expected number of comparisons involving resulting from if we had continued under the individual strategy is at least
[TABLE]
which is no better than what we get with the adaptive strategy. In the third case where we terminate at depth , if we retrace the path to depth we know that was not removed in this node, implying that the expected number of comparisons when continuing the branching process on is decreasing compared to removing at depth . We have shown that the expected running time of the adaptive strategy is no greater than a constant times the expected running time of the individual strategy.
We are now ready to state our main theoretical contribution, stated below as Theorem 5.1. The theorem combines the above argument that compares the adaptive strategy against the individual strategy together with Lemma 5.2 and Lemma 5.4, and uses runs of the CPSJoin algorithm to solve the set similarity join problem for every choice of constant parameters .
Theorem 5.1**.**
For every LSHable similarity measure and every choice of constant threshold and probability of recall we can solve the -set similarity join problem on every set of points using working space and with expected running time
[TABLE]
38 Implementation
We implement an optimized version of the CPSJoin algorithm for solving the Jaccard similarity self-join problem. In our experiments (described in Section 39) we compare the CPSJoin algorithm against the approximate methods of MinHash LSH [78, 35] and BayesLSH [44], as well as the AllPairs [23] exact similarity join algorithm. The code for our experiments is written in C++ and uses the benchmarking framework and data sets of the recent experimental survey on exact similarity join algorithms by Mann et al. [107]. For our implementation we assume that each set is represented as a list of 32-bit unsigned integers. We proceed by describing the details of each implementation in turn.
38.1 Chosen Path Similarity Join
The implementation of the CPSJoin algorithm follows the structure of the pseudocode in Algorithm 1 and Algorithm 2, but makes use of a few heuristics, primarily sampling and sketching, in order to speed things up. The parameter setting is discussed and investigated experimentally in section 39.2.
Preprocessing.
Before running the algorithm we use the embedding described in section 35.1. Specifically independent MinHash functions are used to map each set to a list of hash values . The MinHash function is implemented using Zobrist hashing [173] from 32 bits to 64 bits with 8-bit characters. We sample a MinHash function by sampling a random Zobrist hash function and let . Zobrist hashing (also known as simple tabulation hashing) has been shown theoretically to have strong MinHash properties and is very fast in practice [134, 159]. We set in our experiments, see discussion later.
During preprocessing we also prepare sketches using the 1-bit minwise hashing scheme of Li and König [103]. Let denote the length in 64-bit words of a sketch of a set . We construct sketches for a data set by independently sampling MinHash functions and Zobrist hash functions that map from 32 bits to 1 bit. The th bit of the sketch is then given by . In the experiments we set .
Similarity estimation using sketches.
We use 1-bit minwise hashing sketches for fast similarity estimation in the BruteForcePairs and BruteForcePoint subroutines of the BruteForce step of the CPSJoin algorithm. Given two sketches, and , we compute the number of bits in which they differ by going through the sketches word for word, computing the popcount of their XOR using the gcc builtin _mm_popcnt_u64 that translates into a single instruction on modern hardware. Let denote the estimated similarity of a pair of sets . If is below a threshold , we exclude the pair from further consideration. If the estimated similarity is greater than we compute the exact similarity and report the pair if .
The speedup from using sketches comes at the cost of introducing false negatives: A pair of sets with may have an estimated similarity less than , causing us to miss it. We let denote a parameter for controlling the false negative probability of our sketches and set such that for sets with we have that . In our experiments we set the sketch false negative probability to be .
Splitting step.
In the recursive step of the CPSJoin algorithm (Algorithm 1) the set is split into buckets using the following heuristic: Instead of sampling a random hash function and evaluating it on each element , we sample an expected elements from and split according to the corresponding minhash values from the preprocessing step. This saves the linear overhead in the size of our sets , reducing the time spent placing each set into buckets to . Internally, a collection of sets is represented as a C++ std::vector<uint32_t> of set ids. The collection of buckets is implemented using Google’s dense_hash hash map implementation from the sparse_hash package [81].
BruteForce step.
Having reduced the overhead for each set to in the splitting step, we wish to do the same for the BruteForce step (described in Algorithm 2), at least in the case where we do not call the BruteForcePairs or BruteForcePoint subroutines. The main problem is that we spend time for each set when constructing the count hash map and estimating the average similarity of to sets in . To get around this we construct a 1-bit minwise hashing sketch of length for the set using sampling and our precomputed 1-bit minwise hashing sketches. The sketch is constructed as follows: Randomly sample elements of and set the th bit of to be the th bit of the th sample from . This allows us to estimate the average similarity of a set to sets in in time using word-level parallelism. A set is removed from if its estimated average similarity is greater than . To further speed up the running time we only call the BruteForce subroutine once for each call to CPSJoin, calling BruteForcePoint on all points that pass the check rather than recomputing each time a point is removed. Pairs of sets that pass the sketching check are verified using the same verification procedure as the AllPairs implementation by Mann et al. [107]. In our experiments we set the parameter . Duplicates are removed by sorting and performing a single linear scan.
Repetitions.
In theory, for any constant desired recall it suffices with independent repetitions of the CPSJoin algorithm. In practice this number of repetitions is prohibitively large and we therefore set the number of independent repetitions used in our experiments to be fixed at ten. With this setting we were able to achieve more than recall across all datasets and similarity thresholds.
38.2 MinHash LSH
We implement a locality-sensitive hashing similarity join using MinHash according to the pseudocode in Algorithm 3. A single run of the MinHash algorithm can be divided into two steps: First we split the sets into buckets according to the hash values of concatenated MinHash functions . Next we iterate over all non-empty buckets and run BruteForcePairs to report all pairs of points with similarity above the threshold . The BruteForcePairs subroutine is shared between the MinHash and CPSJoin implementation. MinHash therefore uses 1-bit minwise sketches for similarity estimation in the same way as in the implementation of the CPSJoin algorithm described above.
The parameter can be set for each dataset and similarity threshold to minimize the combined cost of lookups and similarity estimations performed by algorithm. This approach was mentioned by Cohen et al. [61] but we were unable to find an existing implementation. In practice we set to the value that results in the minimum estimated running time when running the first part (splitting step) of the algorithm for values of in the range and estimating the running time by looking at the number of buckets and their sizes. Once is fixed we know that each repetition of the algorithm has probability at least of reporting a pair with . For a desired recall we can therefore set . In our experiments we report the actual number of repetitions required to obtain a desired recall rather than using the setting of required for worst-case guarantees.
38.3 AllPairs
To compare our approximate methods against a state-of-the-art exact similarity join we use Bayardo et al.’s AllPairs algorithm [23] as recently implemented in the set similarity join study by Mann et al. [107]. The study by Mann et al. compares implementations of several different exact similarity join methods and finds that the simple AllPairs algorithm is most often the fastest choice. Furthermore, for Jaccard similarity, the AllPairs algorithm was at most times slower than the best out of six different competing algorithm across all the data sets and similarity thresholds used, and for most runs AllPairs is at most slower than the best exact algorithm (see Table 7 in Mann et al. [107]). Since our experiments run in the same framework and using the same datasets and with the same thresholds as Mann et al.’s study, we consider their AllPairs implementation to be a good representative of exact similarity join methods for Jaccard similarity.
38.4 BayesLSH
For a comparison against previous experimental work on approximate similarity joins we use an implementation of BayesLSH in C as provided by the BayesLSH authors [44, 43]. The BayesLSH package features a choice between AllPairs and LSH as candidate generation method. For the verification step there is a choice between BayesLSH and BayesLSH-lite. Both verification methods use sketching to estimate similarities between candidate pairs. The difference between BayesLSH and BayesLSH-lite is that the former uses sketching to estimate the similarity of pairs that pass the sketching check, whereas the latter uses an exact similarity computation if a pair passes the sketching check. Since the approximate methods in our CPSJoin and MinHash implementations correspond to the approach of BayesLSH-lite we restrict our experiments to this choice of verification algorithm. In our experiments we will use BayesLSH to represent the fastest of the two candidate generation methods, combined with BayesLSH-lite for the verification step.
39 Experiments
We run experiments using the implementations of CPSJoin, MinHash, BayesLSH, and AllPairs described in the previous section. In the experiments we perform self-joins under Jaccard similarity for similarity thresholds . We are primarily interested in measuring the join time of the algorithms, but we also look at the number of candidate pairs considered by the algorithms during the join as a measure of performance. Note that the preprocessing step of the approximate methods only has to be performed once for each set and similarity measure, and can be re-used for different similarity joins, we therefore do not count it towards our reported join times. In practice the preprocessing time is at most a few minutes for the largest data sets.
Data sets.
The performance is measured across real world data sets along with synthetic data sets described in Table 3. All datasets except for the TOKENS datasets were provided by the authors of [107] where descriptions and sources for each data set can also be found. Note that we have excluded a synthetic ZIPF dataset used in the study by Mann et al.[107] due to it having no results for our similarity thresholds of interest. Experiments are run on versions of the datasets where duplicate records are removed and any records containing only a single token are ignored.
In addition to the datasets from the study of Mann et al. we add three synthetic datasets TOKENS10K, TOKENS15K, and TOKENS20K, designed to showcase the robustness of the approximate methods. These datasets have relatively few unique tokens, but each token appears in many sets. Each of the TOKENS datasets were generated from a universe of tokens () and each token is contained in respectively, , , and different sets as denoted by the name. The sets in the TOKENS datasets were generated by sampling a random subset of the set of possible tokens, rejecting tokens that had already been used in more than the maximum number of sets ( for TOKENS10K). To sample sets with expected Jaccard similarity the size of our sampled sets should be set to . For the TOKENS datasets each have random sets planted with expected Jaccard similarity . This ensures an increasing number of results for our experiments where we use thresholds . The remaining sets have expected Jaccard similarity . We believe that the TOKENS datasets give a good indication of the performance on real-world data that has the property that most tokens appear in a large number of sets.
Recall.
In our experiments we aim for a recall of at least for the approximate methods. In order to achieve this for the CPSJoin and MinHash algorithms we perform a number of repetitions after the preprocessing step, stopping when the desired recall has been achieved. This is done by measuring the recall against the recall of AllPairs and stopping when reaching . In practice this approach is not feasible as the size of the true result set is not known. However, it can be efficiently estimated using sampling if it is not too small. Another approach is to perform the number of repetitions required to obtain the theoretical guarantees on recall as described for CPSJoin in Section 37.3 and for MinHash in Section 38.2. Unfortunately, with our current analysis of the CPSJoin algorithm the number of repetitions required to guarantee theoretically a recall of far exceeds the number required in practice as observed in our experiments where ten independent repetitions always suffice. For BayesLSH using LSH as the candidate generation method, the recall probability with the default parameter setting is , although we experience a recall closer to in our experiments.
Hardware.
All experiments were run on an Intel Xeon E5-2690v4 CPU at 2.60GHz with MB L,kB L and kB L cache and GB of RAM. Since a single experiment is always confined to a single CPU core we ran several experiments in parallel [155] to better utilize our hardware.
39.1 Results
Join time.
Table 39.1 shows the average join time in seconds over five independent runs, when approximate methods are required to have at least recall. We have omitted timings for BayesLSH since it was always slower than all other methods, and in most cases it timed out after 20 minutes when using LSH as candidate generation method. The join time for MinHash is always greater than the corresponding join time for CPSJoin except in a single setting: the dataset KOSARAK with threshold . Since CPSJoin is typically faster than MinHash we can restrict our attention to comparing AllPairs and CPSJoin where the picture becomes more interesting.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] ACM. ACM paris kanellakis theory and practice award. https://awards.acm.org/award_winners/charikar_0308379 , 2012. [Online; accessed 26-April-2018].
- 2[2] C. Aggarwal, D. A. Keim, and A. Hinneburg. On the surprising behaviour of distance metrics in high dimensional space. In Proc. ICDT ’01 , pages 420–434, 2001.
- 3[3] A. Agresti. Bounds on the extinction time distribution of a branching process. Advances in Applied Probability , 6(2):322–335, 1974.
- 4[4] T. D. Ahle, M. Aumüller, and R. Pagh. Parameter-free locality sensitive hashing for spherical range reporting. In Proc. SODA ’17 , pages 239–256, 2017.
- 5[5] T. D. Ahle, R. Pagh, I. P. Razenshteyn, and F. Silvestri. On the complexity of inner product similarity join. In Proc. PODS’16 , pages 151–164, 2016.
- 6[6] J. Alman and R. Williams. Probabilistic polynomials and hamming nearest neighbors. In Proc. FOCS ’15 , pages 136–150, 2015.
- 7[7] N. Alon and N. Asaf. k-wise independent random graphs. In Proc. FOCS ’08 , pages 813–822, 2008.
- 8[8] N. Alon, O. Goldreich, J. Håstad, and R. Peralta. Simple constructions of almost k-wise independent random variables. Random Structures & Algorithms , 3(3):289–304, 1992.
