Confirmation Sampling for Exact Nearest Neighbor Search
Tobias Christiani, Rasmus Pagh, Mikkel Thorup

TL;DR
This paper introduces confirmation sampling, a new technique for exact nearest neighbor search using LSH, providing a general reduction and a new query algorithm that achieves efficient, high-probability results.
Contribution
The paper presents confirmation sampling for exact nearest neighbor search, along with a reduction method and a novel query algorithm for LSH Forests that improves parameter tuning.
Findings
Achieves exact nearest neighbor with high probability using fewer queries.
Provides a general reduction transforming small-probability data structures into high-probability solutions.
Develops a new query algorithm for LSH Forests that matches the efficiency of well-tuned larger structures.
Abstract
Locality-sensitive hashing (LSH), introduced by Indyk and Motwani in STOC '98, has been an extremely influential framework for nearest neighbor search in high-dimensional data sets. While theoretical work has focused on the approximate nearest neighbor problems, in practice LSH data structures with suitably chosen parameters are used to solve the exact nearest neighbor problem (with some error probability). Sublinear query time is often possible in practice even for exact nearest neighbor search, intuitively because the nearest neighbor tends to be significantly closer than other data points. However, theory offers little advice on how to choose LSH parameters outside of pre-specified worst-case settings. We introduce the technique of confirmation sampling for solving the exact nearest neighbor problem using LSH. First, we give a general reduction that transforms a sequence of data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Confirmation Sampling for Exact Nearest Neighbor Search
Tobias Christiani111The research leading to these results has received funding from the European Research Council under the European Union’s 7th Framework Programme (FP7/2007-2013) / ERC grant agreement no. 614331.
IT University of Copenhagen and BARC
Rasmus Pagh∗222Supported by Villum Foundation grant 16582 to Basic Algorithms Research Copenhagen (BARC). Part of this work was done while visiting Simons Institute for the Theory of Computing.
IT University of Copenhagen and BARC
Mikkel Thorup333Supported by an Investigator Grant from the Villum Foundation, Grant No. 16582.
University of Copenhagen and BARC
Abstract
Locality-sensitive hashing (LSH), introduced by Indyk and Motwani in STOC ’98, has been an extremely influential framework for nearest neighbor search in high-dimensional data sets. While theoretical work has focused on the approximate nearest neighbor problems, in practice LSH data structures with suitably chosen parameters are used to solve the exact nearest neighbor problem (with some error probability). Sublinear query time is often possible in practice even for exact nearest neighbor search, intuitively because the nearest neighbor tends to be significantly closer than other data points. However, theory offers little advice on how to choose LSH parameters outside of pre-specified worst-case settings.
We introduce the technique of confirmation sampling for solving the exact nearest neighbor problem using LSH. First, we give a general reduction that transforms a sequence of data structures that each find the nearest neighbor with a small, unknown probability, into a data structure that returns the nearest neighbor with probability , using as few queries as possible. Second, we present a new query algorithm for the LSH Forest data structure with trees that is able to return the exact nearest neighbor of a query point within the same time bound as an LSH Forest of trees with internal parameters specifically tuned to the query and data.
1 Introduction
Locality-sensitive hashing [11] (LSH) is the leading theoretical approach to nearest neighbor problems in high dimensions. In nearest neighbor search we seek to preprocess a point set such that given a query point , we can quickly return the point in that is closest to according to some distance measure . Theoretical results are typically formulated as approximation algorithms that allow a point at distance to be returned if the nearest neighbor has distance from the query point, where is a user-specified approximation factor. In practice the quality parameter of interest is the recall, i.e., the empirical probability of retrieving the nearest neighbor (see e.g. [1]). As we will see below is not hard to show that LSH methods can obtain recall arbitrarily close to 1 if parameters are suitably chosen according to the given query and data set. However, choosing parameters well, in an efficient way, is a challenge [12].
1.1 Background
Locality-sensitive hashing.
A locality-sensitive family of hash functions (an “LSH family”) has the property that hash collision probability decreases as distance increases. Specifically, for the “hash bucket” is more likely to contain the nearest neighbor of than any other element of . For a given data set one would typically use a family such that the expected size of is constant (for every or on average for a certain query distribution) [14]. Given such a family , suppose the nearest neighbor is , and define to be the probability of a hash collision with the nearest neighbor. Then inspecting for a sequence of hash functions independently sampled from we will fail to find with probability . To make this as efficient as possible we can use a hash table that given allows us to retrieve in time If we assume that the distance between and can be computed in constant time, the expected time for this procedure is . There are several issues with the above construction:
- •
If is large then the query algorithm still goes through hash buckets, even though we expect to see within the first buckets.
- •
If is small, the recall is close to zero.
Notice that depends on the nearest neighbor that we are searching for, resulting in a chicken-and-egg situation: we would like to conduct the search with knowledge of , but we only know if the search finds (and we know how the collision probability depends on ). We will introduce a technique called confirmation sampling for dealing with the former problem of when to terminate the search when we have no knowledge of . The latter problem requires us to take a new look at how to query the so-called LSH forest data structure, described below.
Approximation versus recall.
Early theoretical work on high-dimensional nearest neighbor search dealt with the simpler case of near neighbor search where it is assumed that a maximum distance to the nearest neighbor is known and a point within distance must be returned. A reduction with logarithmic overhead in time and space extends this to solve the approximate nearest neighbor problem with unknown distance [11, 10]. These reductions increase the approximation factor by , with space usage proportional to , and do not seem to provide any guarantee on recall even if used with a near-neighbor data structure with approximation factor .
A data structure known as LSH forest, first described by Charikar [6] and later generalized and baptized by Bawa et al. [5], removes the logarithmic overhead in space but the query algorithm still only provides -approximate results and does not guarantee a specific recall. Indeed, it is not hard to construct examples where there are many -approximate nearest neighbors and the probability of returning the exact nearest neighbor is negligible.
LSH Forest.
Since we will describe a new query algorithm for the LSH Forest data structure we review the data structure here. We will again make use of an LSH family , but this family can be “weak” in the sense that collision probabilities are large, say, for . Assume for simplicity that we can sample and evaluate in constant time. For parameters and and , independently sample hash functions . Associate each point with a string . For the th part of the LSH Forest is a trie that stores prefixes of the set of strings . Specifically, for each it stores the shortest prefix of that is unique among strings in (if such a prefix exists, otherwise the whole string ). A pointer to is placed in the leaf corresponding to a prefix of . The space for the data structure, not counting space for storing the points in , is words naïvely, and can be improved to words using path compression [5].
Querying LSH Forest.
On a query and for a parameter , LSH Forest allows us to retrieve the hash bucket of points in matching a length- prefix of in time . We will use as shorthand for the collision probability between and . We have
[TABLE]
The larger the “level” is the smaller is in expectation. Conversely the probability of finding in the hash bucket is which decreases exponentially with . The query algorithm described in [5] chooses the level to inspect as the smallest level where the number of collisions is linear, , for some constant . The probability of failing to find the nearest neighbor by inspecting all buckets at level is , so to bound the failure probability we need to choose large enough. For example, if the nearest neighbor of is in a dense cluster of points whose points almost surely reside in the same LSH bucket, the algorithm fails to find the nearest neighbor almost surely. So LSH Forest is only “self-tuning” to a limited extent if high recall is desired: choosing a suitable parameter requires at least approximate knowledge of the distance distribution from to points of . Instead, we would like to be simply a parameter that determines the space usage, and use a different query algorithm that adapts to the data automatically.
1.2 Our results
LSH methods work by performing many iterations, each inspecting a hash table with a small (and unknown) probability of finding the nearest neighbor. It is easy to see that after iterations the nearest neighbor will be retrieved with probability at least . We show that this number of iterations can be matched in expectation without knowledge of , and in fact even without estimating any collision probabilities. Using a technique we call confirmation sampling we obtain the following result on LSH-like methods:
Theorem 1**.**
Suppose there is a sequence of independent, randomized data structures , such that on query , returns the nearest neighbor of in with probability at least and each other point in with probability at most . Let be given. There is an algorithm that depends on but not on that on input queries data structures , performs distance computations, where , and returns the nearest neighbor of with probability at least .
Theorem 1 shows that, at least in the case where we may use quadratic space to store a sufficiently long sequence of data structures , it suffices to focus on minimizing the product of the expected time for and the number of iterations .
In practice one would of course not have access to an unbounded sequence of data structures, but rather to a fixed number of data structures. If these data structures offer a trade-off between query time and probability of returning the nearest neighbor it is still possible to apply Theorem 1: For run confirmation sampling in rounds of steps with time budget for each data structure . Terminate as soon as confirmation sampling returns a result — by a union bound over the rounds the error probability is at most .
Our second result addresses how to adapt not only to the collision probability of the nearest neighbor, but to the whole distance distribution from to points in . In particular, we design and analyze a new adaptive query algorithm for the LSH Forest data structure [6, 5] discussed above. LSH Forest is known to be able to adapt to the distance distribution to some extent, but previous work has required the query algorithm to depend on the distance to the nearest neighbor in . In contrast our query algorithm is independent of properties of the data. The only requirement is that the LSH family used is monotone in the sense that collision probability is non-increasing with distance. We compare our adaptive algorithm to an optimal algorithm in a class of natural algorithms that choose a level and a number of tries (which may depend on the distance distribution between and ) and inspect the first buckets at level .
Theorem 2**.**
Let denote the optimal cost of a natural algorithm that queries an LSH Forest data structure with trees and levels and returns the nearest neighbor with probability at least . Further assume that the LSH family is monotone. Then there is an adaptive algorithm that queries an LSH Forest data structure with trees and levels that returns the nearest neighbor using time with probability .
LSH Forest is not an asymptotically optimal data structure for approximate nearest neighbor search in general. For example, it is known that data-dependent methods can be asymptotically faster in several important spaces, and data structures obtaining better space-time trade-offs are known [3, 2]. Generalizing our results for exact nearest neighbors to a data-dependent setting, say, in Euclidean space, is an interesting open direction. Note that the data structures in Theorem 1 could be data dependent, though present data-dependent LSH techniques rely on knowing the (approximate) distance to the nearest neighbor.
1.3 Related work
There is a large literature on using LSH for nearest neighbors search in practice, often generalized to the -nearest neighbor problem where the closest points in must be returned. For simplicity we concentrate on the case , but most results extend to arbitrary . Many heuristics that work well in practice come without guarantees on either result quality or query time in high dimensions, or provides guarantees only under certain assumptions on the data set.
Guarantees on recall.
In practice, the performance of locality-sensitive hashing techniques is usually measured by their recall: the fraction of the true -nearest neighbors found on average, see e.g. [1, 4]. From a theoretical point of view it is natural to bound the expected recall, i.e., the probability that the nearest neighbor is found. We are only aware of very few works that provide theoretical guarantees on expected recall in conjunction with sublinear query time in high dimensions and without assumptions on data.
Dong et al. [9] outline an “adaptive” method for achieving a given expected recall in the context of multiprobe LSH (with no formal statement of guarantees). The idea is to determine, after inspecting buckets, whether to terminate or to inspect bucket based on the collision probability between and the nearest neighbor found in the first buckets. This requires an efficient method for computing , which might not be known, especially for small collision probabilities. This is not just a theoretical problem: Prominent LSH methods such as -stable LSH [8] and cross-polytope LSH [1] do not have closed-form expressions for collision probabilities. Our adaptive algorithm is similar in spirit, but entirely avoids having to compute collision probabilities.
For the related near neighbor problem where a search radius is given it is easier to give guarantees on recall, especially when collision probabilities at distance can be computed, see e.g. [7].
Parameter tuning.
Since the performance of LSH data structures depends on parameter choices, a lot of work has gone into devising ways of choosing good parameters for a given data set, both during data structure construction and adaptively for the query algorithm. Slaney et al. [14] propose to select parameters based on the “distance profile” of a data set, but needs a bound on the distance to the nearest neighbor to function.
The state-of-the-art FALCONN library [1] uses grid search over parameters to empirically estimate the best parameters, assuming that the data and query distributions are identical.
We note that the adaptive method of Dong et al. [9] does not adapt search depth to the distance distribution from the query point . In fact, choosing good parameters for LSH and especially multi-probe LSH was mentioned by Lv et al. [12] as a challenge in the paper celebrating their VLDB 10-year Best Paper Award.
2 Confirmation sampling
Let denote a probability distribution with finite support . Further assume that elements of are equipped with a total ordering relation , and define as the smallest element in the support with respect to the ordering . Consider the problem of identifying given that we only have access to samples from the distribution and to the ordering, i.e., given elements we can determine whether , , or . We propose a simple randomized algorithm for solving this problem that we call confirmation sampling. The algorithm works by drawing samples from while keeping track of the smallest element seen so far together with the number of times it has been sampled in addition to the first sample — the number of confirmations. Once the smallest element has been confirmed times, the algorithm reports that element and terminates. We use to denote an element that is larger than all elements of .
Theorem 3**.**
Let denote a probability distribution with finite support . For and let and let be the largest sampling probability among elements of other than . Then:
[TABLE]
The expected number of samples made by ConfirmationSampling is bounded by .
Before we show Theorem 3 we observe that it implies Theorem 1: Define an ordering on by . It can be turned into a total ordering by an arbitrary but fixed tie-breaking rule. Choose and run ConfirmationSampling with the th sample from being produced by querying for the nearest neighbor of . Since and we have that the error probability is bounded by .
Proof.
If the algorithm fails to report it must have happened at least times that the confirmation counter was incremented (line 1) due to a sample satisfying the condition for . We will refer to such events as false confirmations and proceed by upper bounding the probability that the algorithm performs false confirmations. Prior to each sample the probability of performing a false confirmation is maximized if for some maximizing the sampling probability, i.e., . Note also that the first sample can never result in a false confirmation. The probability of the algorithm performing false confirmations before sampling can therefore be upper bounded by the probability that the first sample is not equal to and that we in the following samples observe samples of before sampling . The probability that we sample conditioned on sampling either or is exactly , and the probability of this happening times in a row is .
To analyze the number of samples, consider an infinite sequence of independent samples , and suppose that in the th iteration the algorithm uses sample . Observe that the algorithm terminates no later than iteration if is sampled times in . The expected number of iterations needed to sample times is exactly . ∎
Theorem 3 is tight in the case where only assigns non-zero probability to two elements. In Appendix A we derive the exact distribution of the output of ConfirmationSampling. We observe that for the proof to work, the distribution from which samples are drawn does not need to be the same in each iteration of ConfirmationSampling, as long as is a lower bound on sampling and is an upper bound on sampling each element other than . If for some we have that every distribution satisfies then we can upper bound the error probability by .
2.1 Application to locality-sensitive hashing
Assume that we have an LSH family that is tuned to give few collisions between query and non-neighbor points for a given query and data distribution. Such a “tuned” LSH family may be obtained if the query distribution is known as discussed in section 1.3. We can use confirmation sampling to adjust query time according to the distance to the nearest neighbor.
Let denote a distance space. That is, is equipped with a distance function . We define locality-sensitive hashing [11] as follows:
Definition 4**.**
Let denote a distribution over functions . We say that is locality-sensitive over if there exists a non-increasing such that for all we have that
[TABLE]
We use the ordering defined above and define a distribution that is most easily described as a sampling procedure. For now we will not care about the efficiency of implementing the sampling. To create a sample , sample , compute the “bucket”
[TABLE]
Now define as the element of closest to , if such an element exists, and otherwise a random element in .444The sampling of a random element ensures compatibility with ConfirmationSampling, which requires a sample to be returned even if there is no hash collision. It is not really necessary from an algorithmic viewpoint, but also does not hurt the asymptotic performance. More precisely: If we pick as the unique minimum element in according to the total order , and if we pick uniformly at random from .
Lemma 5**.**
For and any , .
Proof.
Since is locality-sensitive we have that . Thus
[TABLE]
∎
As before Theorem 3 now implies that confirmation sampling succeeds with good probability:
Lemma 6**.**
Let be the nearest neighbor of in (breaking any ties according to ). ConfirmationSampling returns with probability least . The expected number of samples from is bounded by , where .
To efficiently sample from we independently sample , and construct a sequence of hash tables that allow us to find in time . Random samples from can be realized using an array of pointers to elements of .
We note that the above is not an entirely satisfactory solution, since the number of data structures needed cannot be bounded ahead of time (or rather, data structures may be needed to succeed, resulting in quadratic space usage). A possible remedy if the algorithm does not terminate after inspecting hash tables is multi-probing [13, 12] where more than one bucket is inspected in each hash table. Multiprobing increases the probability of finding the nearest neighbor in each hash table. In the next section we consider another approach to dealing with a space-bounded data structure.
3 Fully adaptive nearest neighbor search
We present an adaptive algorithm for nearest neighbor search in an LSH Forest that succeeds with high probability555For every choice of constant there exists a constant such that for we can obtain success probability where denotes the size of the set of data points. and matches the minimum expected running time that can be obtained by a natural algorithm that has full knowledge of the LSH collision probabilities between the query point and all the data points, provided we are are allowed a constant factor increase in the number of trees used by the algorithm. We define as the minimum expected search time that can be achieved by an algorithm with access to an LSH Forest of trees of depth where the algorithm can choose to search trees at level with the requirement that the nearest neighbor should be reported with probability at least .
[TABLE]
We note that only reflects the optimal running time under the assumption that is bounded away from . If for example we had the multiplicative overhead of in the running time would not be needed.
Overview of our approach.
The algorithm works by measuring the number of collisions at different levels in the LSH Forest and with high probability adapting to search at a level that will result in running time. Ideally, given sufficiently many trees, we would like to search the level that balances the number of hash function evaluations and the expected number of collisions with the query point. However such a level might not exist as the expected number of collisions can decrease by more than a constant factor as we increase the level.
We begin by introducing some notation. Let , where denotes the nearest neighbor to in and define:
[TABLE]
[TABLE]
Observe that is the expected number of collisions with the query point at level , and is the expected running time of an algorithm that searches at level and guarantees reporting the nearest neighbor of with some constant probability. If we let denote the choice of level resulting in the minimum value of then . Finally, define to be the smallest integer such that .
Given that the number of trees is sufficiently large we can show that searching either the first trees at level or the first trees at level results in an expected running time that is bounded by while we report the nearest neighbor with constant probability at least . That is, one of the two levels right around where the number of hash function evaluations and the number of collisions balance out (we have and ) result in optimal running time for constant failure probability. Since we don’t know we can search both of these levels using confirmation sampling, in parallel, until one of them terminates. This gives us an algorithm that with constant probability terminates in time and reports the nearest neighbor. In order to reduce the failure probability to while obtaining optimal running time in the high probability regime we can perform independent repetitions, so that conceptually there are independent forests, and stop the search once a constant fraction terminates.
Query algorithm and parameters.
There are two circumstances that prevent us from being able to use the approach outlined above. The primary problem is that we don’t know the value of and estimating it appears to be difficult. The solution proposed by our algorithm is to instead search the “empirical” and : we measure the number of collisions at different levels and search level and where is set to the minimum level where the average number of collisions is smaller than . This procedure is described in pseudocode in the for-loop section of Algorithm 2.
The second problem is that restrictions on and can make it necessary for us to search a level , either because or because is too small to ensure that we find the nearest neighbor by searching at level . The second part of Algorithm 2 that runs when deals with this problem by searching through the LSH forests bottom-up until a level that results in optimal running time is encountered.
We aim for matching the running time of up to constant factors when we are allowed to use trees. Algorithm 2 operates on LSH Forests that each has trees where is a sufficiently large power of two. The confirmation sampling used to search in these forests has a parameter setting of since we only need each search to terminate and correctly report the nearest neighbor with a sufficiently large constant probability.
The proof of Theorem 2 is based on two arguments. First we will show that the stopping condition that of the forests at a given level terminates within the time budget ensures that the nearest neighbor is always reported with high probability. Second, we show that with high probability the algorithm terminates in time .
Correctness.
The choice of made by the algorithm always satisfies since there can be no more than collisions at any level. If we show correctness with high probability at a fixed level then we can use a simple union bound over the first levels to show that with high probability at every level where of the searches terminate we have found the nearest neighbor of the query point. The instances of confirmation sampling used by Algorithm 2 use confirmations before terminating. According to Lemma 6 the probability of terminating and reporting a point different from the nearest neighbor is at most . By applying a standard Chernoff bound we can show that over independent runs of confirmation sampling with high probability less than the instances will fail to report the nearest neighbor.
Bounding the running time.
We remind the reader that we use to denote the underlying choice of level that minimizes , that denotes the minimum level such that , and that is the choice of level made by the query algorithm.
Consider line 2 of Algorithm 2 where the level is set to the smallest level where the first trees in at least half the forest have at most collisions. This operation can be completed in time per forest by proceeding top-down across all the forests and for each forest summing up the number of collisions across all its tries at the current level until level is reached. We make use of constant-time access to the size of buckets/subtrees as we search down in an LSH Forest trie (either by explicitly storing the size of subtrees when we construct the trie, or by inspecting the pointers to the bucket associated with a given prefix).
We will now argue that with high probability Algorithm 2 terminates in time in each of the two following cases:
Case 1: .
We will show that there exists a value of such that with high probability the algorithm terminates at this value (or earlier) and in time. Consider the first iteration of the for-loop where . Such a exists by the restrictions underlying the choice of level that minimizes and by our freedom to set . By Markov’s inequality the probability that the number of collisions in the first trees of a forest at level is greater than is at most . Therefore it happens with high probability that the algorithm sets where the last inequality follows from the definition of and the assumption that . By our choice of we know that confirmation sampling at level will terminate in each forest with a large constant probability, say, . With high probability we therefore have that in at least of the forests confirmation sampling at level terminates within the budget of . To bound the total running time we use that with high probability for every value of and since is doubled at every step of the for loop we can bound the running time in all LSH forests by .
Case 2: .
Consider first the sub-case where . Suppose there exists a minimum such that , is an integer power of 2, and (the latter condition holds by the assumption ). We previously argued that with high probability the algorithm sets . In the first iteration of the for-loop where takes on this value the following holds: If then level is searched with a sufficiently large budget to ensure termination with high probability. If then level is searched up until tree number , again ensuring termination with high probability. In both of these cases the running time is bounded by . Otherwise, if then with high probability the time spent in the for-loop part of the algorithm is upper bounded by , and if level was not searched in the for-loop then it will be searched in the first step of the bottom-up part of the algorithm (because with high probability) where we are guaranteed to terminate in optimal time with high probability.
Consider now the sub-case where . Let denote the largest level satisfying and . The query algorithm will terminate with high probability when having searched sufficiently many trees at level . We will proceed by bounding the cost up to the point where trees have been searched in half of the forests at level . The cost of running the for-loop part of the algorithm is bounded by with high probability. The number of collisions encountered through the bottom-up search when having searched level is with high probability bounded by since by our choice of . Finally, the cost of searching at level until of the forests terminate is bounded by with high probability.
Next we show that the sum of all these costs is bounded by . For every it holds by monotonicity that and it follows that for every we have . Applying this inequality we get the bound that is used to bound the number of collisions from the bottom-up search. The same approach also gives a bound on the number of collisions at level . In order to bound the contribution from the for-loop note that where the last inequality holds by the definition of . It also holds that by the choice of . Combining these two inequalities . The bound on the total running time is then given by .
4 Conclusion and open problems
We have introduced confirmation sampling as a technique for identifying the minimum element from a discrete distribution. Confirmation sampling works particularly well when the minimum element is at least as likely to be sampled as other elements. Combining confirmation sampling with locality-sensitive hashing we obtain a randomized solution to the exact nearest neighbor search problem that works without knowledge of the probability of collision between pairs of points. We use these techniques to design a new adaptive query algorithm for the LSH Forest data structure with trees that returns the nearest neighbor of a query point with the same time bound that is achieved if the query algorithm has access to an LSH forest of trees with internal parameters specifically tuned to the query and data.
We can use confirmation sampling with LSH to solve the -nearest neighbor problem with high probability in by keeping track of the top- closest points and requiring each to be confirmed times. If we are able to compute the collision probabilities we can use the adaptive stopping rule of Dong et al. [9] to stop the search once we have sampled buckets, where is the collision probability between the query point and the th nearest neighbor candidate found by the query algorithm. This stopping rule guarantees that if is a -nearest neighbor to the query point, and the LSH family is monotone, then is reported with probability at least . It would be interesting to find a similarly efficient stopping rule for that works without knowledge of the collision probabilities.
Our adaptive query algorithm for the LSH Forest data structure makes use of union bounds over the levels of the data structure when showing correctness and also uses that with high probability it doesn’t search too far (something which could potentially cost time ). When we compare our performance against an optimally tuned algorithm that must succeed with high probability we can afford to pay for this extra overhead. It remains an open problem to find an adaptive query algorithm that matches an optimally tuned algorithm when we only require constant success probability, even if we can compute collision probabilities
Appendix A Exact distribution of the output of ConfirmationSampling
Suppose , where indices are chosen such that is non-decreasing in : . Given a distribution and a parameter let denote the probability that ConfirmationSampling reports element . Consider an infinite sequence of i.i.d. samples from . If is sampled times before a single sample of with then the algorithm reports . It is easy to see that since the only way that gets reported is if the first samples are equal to . We can extend this idea to obtain the expression for .
Lemma 7**.**
[TABLE]
Proof.
We will gradually reveal information about the outcomes of the sequence in order to arrive at the expression in the Lemma. We begin by asking the question for each whether or . Only if the first samples are equal to does the algorithm report . Otherwise we can restrict our attention to the elements and ask the same question for and so on. ∎
For a specific choice of distribution we can compare the exact probability that confirmation sampling fails to report the minimum element with our upper bound in Theorem 3. From inspection: if we consider the uniform distribution the failure probability appears identical for and as we increase the upper bound is at most twice as large as the actual failure probability.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt. Practical and optimal lsh for angular distance. In Proc. NIPS ’15 , pages 1225–1233, 2015.
- 2[2] A. Andoni, T. Laarhoven, I. P. Razenshteyn, and E. Waingarten. Optimal hashing-based time-space trade-offs for approximate near neighbors. In Proc. SODA ’17 , pages 47–66, 2017.
- 3[3] A. Andoni and I. Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. In Proc. STOC ’15 , pages 793–801, 2015.
- 4[4] M. Aumüller, E. Bernhardsson, and A. Faithfull. Ann-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. In Proc. SISAP ’17 , pages 34–49, 2017.
- 5[5] M. Bawa, T. Condie, and P. Ganesan. LSH forest: self-tuning indexes for similarity search. In Proc. WWW ’05 , pages 651–660, 2005.
- 6[6] M. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. STOC ’02 , pages 380–388, 2002.
- 7[7] T. Christiani, R. Pagh, and J. Sivertsen. Scalable and robust set similarity join. Co RR , abs/1707.06814, 2017.
- 8[8] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proc. SOCG ’04 , pages 253–262, 2004.
