Performance bounds for nearest neighbor search with k-d trees
Marco Bazzani, Sanjoy Dasgupta

TL;DR
This paper provides theoretical bounds on the efficiency and accuracy of k-d tree nearest neighbor search methods, especially in high dimensions, clarifying their limitations and performance guarantees.
Contribution
It offers the first non-asymptotic bounds on runtime and accuracy for defeatist and comprehensive k-d tree search strategies in high-dimensional settings.
Findings
Defeatist search is no better than random guessing in high dimensions.
Comprehensive search visits all cells with high probability in high dimensions.
On uniform data, comprehensive search visits at most 2^{O(d)} cells under certain conditions.
Abstract
The -d tree is one of the oldest and most widely used data structures for nearest neighbor search. It partitions Euclidean space into axis-aligned rectangular cells. There are two standard ways to find the nearest neighbor to a query in a -d tree. Defeatist search returns the closest data point in the query's cell, while comprehensive search also searches other cells as needed to guarantee it finds the nearest neighbor. Both strategies are commonly believed to perform poorly in high dimensions, but there have been few theoretical results explaining this. We prove non-asymptotic bounds on the runtime of comprehensive search and the accuracy of defeatist search. Under mild distributional assumptions, when the dimension is at least polylogarithmic in the number of data points, defeatist search is no more likely to return the nearest neighbor than random guessing, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
