QVCache: A Query-Aware Vector Cache
An{\i}l Eren G\"o\c{c}er, Ioanna Tsakalidou, Hamish Nicholson, Kyoungmin Kim, Anastasia Ailamaki

TL;DR
QVCache introduces a query-aware caching layer for vector databases that significantly reduces latency and maintains high recall, enabling scalable and efficient approximate nearest neighbor search at a small memory cost.
Contribution
It is the first query-level caching system for ANN search that dynamically learns similarity thresholds, operating as a backend-agnostic layer with bounded memory and latency.
Findings
Reduces query latency by up to 1000x
Maintains high recall comparable to underlying ANN systems
Operates with a megabyte-scale memory footprint
Abstract
Vector databases have become a cornerstone of modern information retrieval, powering applications in recommendation, search, and retrieval-augmented generation (RAG) pipelines. However, scaling approximate nearest neighbor (ANN) search to high recall under strict latency SLOs remains fundamentally constrained by memory capacity and I/O bandwidth. Disk-based vector search systems suffer severe latency degradation at high accuracy, while fully in-memory solutions incur prohibitive memory costs at billion-scale. Despite the central role of caching in traditional databases, vector search lacks a general query-level caching layer capable of amortizing repeated query work. We present QVCache, the first backend-agnostic, query-level caching system for ANN search with bounded memory footprint. QVCache exploits semantic query repetition by performing similarity-aware caching rather than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Caching and Content Delivery · Advanced Image and Video Retrieval Techniques
