Neighborhood Stability as a Measure of Nearest Neighbor Searchability
Thomas Vecchiato, Sebastian Bruch

TL;DR
This paper introduces two stability measures that predict the effectiveness of clustering-based approximate nearest neighbor search in high-dimensional Euclidean spaces, enabling dataset searchability assessment without prior clustering.
Contribution
It proposes novel neighborhood stability measures that evaluate dataset and clustering quality, providing analytical tools to assess searchability for clustering-based ANNS.
Findings
Clustering-NSM predicts ANNS accuracy based on clustering quality.
Point-NSM assesses dataset clusterability and searchability.
Measures are applicable to various distance functions, including inner product.
Abstract
Clustering-based Approximate Nearest Neighbor Search (ANNS) organizes a set of points into partitions, and searches only a few of them to find the nearest neighbors of a query. Despite its popularity, there are virtually no analytical tools to determine the suitability of clustering-based ANNS for a given dataset -- what we call "searchability." To address that gap, we present two measures for flat clusterings of high-dimensional points in Euclidean space. First is Clustering-Neighborhood Stability Measure (clustering-NSM), an internal measure of clustering quality -- a function of a clustering of a dataset -- that we show to be predictive of ANNS accuracy. The second, Point-Neighborhood Stability Measure (point-NSM), is a measure of clusterability -- a function of the dataset itself -- that is predictive of clustering-NSM. The two together allow us to determine whether a dataset is…
Peer Reviews
Decision·Submitted to ICLR 2026
Novel and High-Impact Problem: The problem of a priori algorithm selection is difficult and valuable. The paper's focus on "searchability" addresses a pain point familiar to any practitioner who has had to "guess and check" ANN indexing strategies. Practical Utility: If the proposed measures are computationally efficient, they could save significant time and resources by providing a strong signal for or against using clustering-based ANN without the need to complete the costly experiment. Gene
The major incentive for designing NSM's is to save the run-time of the whole clustering + ANN approach, hence the utility of point-NSM is entirely dependent on its computational complexity. The measure is described as a "statistic summarizing the distribution of point-NSMs," where a single point's NSM is derived from its "r nearest neighbors." Calculating nearest neighbors for all points can be quadratic operation, which is prohibitive. I think the point of clustering is to "shrink" the scope of
Picking an appropriate ANN datastructure for a dataset is a very practical problem as there are many choices available to practitioners. Furthermore, the introduced measure is demonstrated to have a high correlation with performance on (simple) ANN datastructures based on clustering.
- One of the paper's drawback is the following: it attempts to define and measure a dataset's suitability for clustering-based ANNS, which is a function of a specific clustering algorithm, rather than simply measuring the dataset's clusterability itself. Thus, it is not clear to me why one would study "searchability" (a function of a clustering) instead of the dataset's intrinsic clusterability. In what practical situation would a dataset that is "clusterable" (e.g., as measured by k-means loss)
(S1) The paper proposes a theoretically sound and novel clustering measure that fulfills the four axioms of Ben-David and Ackerman (2008), which are considered fundamental requirements for a clustering quality function. It establishes a theoretical framework showing that clustering-NSM satisfies these axioms. Furthermore, the authors derive probabilistic bounds connecting point-NSM and clustering-NSM under the assumption that the data follow a spherical flat clustering structure and are uniforml
(W1) Generalizability of the empirical evaluation of point-NSM for cluster ability: The computation of point-NSM requires nearest-neighbor calculations across the entire dataset, which can be computationally expensive. The authors acknowledge this in their empirical evaluation of point-NSM for clusterability, where they mitigate the cost by randomly subsampling 5% of the points to estimate the point-NSM distribution. But does the contribution stay the same when increasing the amount of subsampli
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Data Management and Algorithms · Topological and Geometric Data Analysis
