The Impacts of Data, Ordering, and Intrinsic Dimensionality on Recall in Hierarchical Navigable Small Worlds

Owen Pendrigh Elliott; Jesse Clark

arXiv:2405.17813·cs.IR·June 10, 2025

The Impacts of Data, Ordering, and Intrinsic Dimensionality on Recall in Hierarchical Navigable Small Worlds

Owen Pendrigh Elliott, Jesse Clark

PDF

Open Access

TL;DR

This paper investigates how data properties, insertion order, and intrinsic dimensionality affect the recall performance of HNSW in vector search, revealing significant influences and the need for more nuanced benchmarks.

Contribution

It provides a comprehensive analysis of HNSW's performance across diverse datasets, highlighting the impact of intrinsic dimensionality and insertion order on recall, and emphasizes the importance of realistic benchmarking.

Findings

01

Recall is linked to intrinsic dimensionality and insertion order.

02

Insertion order can shift recall by up to 12 percentage points.

03

Benchmark dataset choice can alter model rankings by up to three positions.

Abstract

Vector search systems, pivotal in AI applications, often rely on the Hierarchical Navigable Small Worlds (HNSW) algorithm. However, the behaviour of HNSW under real-world scenarios using vectors generated with deep learning models remains under-explored. Existing Approximate Nearest Neighbours (ANN) benchmarks and research typically has an over-reliance on simplistic datasets like MNIST or SIFT1M and fail to reflect the complexity of current use-cases. Our investigation focuses on HNSW's efficacy across a spectrum of datasets, including synthetic vectors tailored to mimic specific intrinsic dimensionalities, widely-used retrieval benchmarks with popular embedding models, and proprietary e-commerce image data with CLIP models. We survey the most popular HNSW vector databases and collate their default parameters to provide a realistic fixed parameterisation for the duration of the paper.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · AI-based Problem Solving and Planning · Competitive and Knowledge Intelligence

MethodsContrastive Language-Image Pre-training