HAVEN: High-Bandwidth Flash Augmented Vector Engine for Large-Scale Approximate Nearest-Neighbor Search Acceleration

Po-Kai Hsu; Weihong Xu; Qunyou Liu; Tajana Rosing; Shimeng Yu

arXiv:2603.01175·cs.AR·March 3, 2026

HAVEN: High-Bandwidth Flash Augmented Vector Engine for Large-Scale Approximate Nearest-Neighbor Search Acceleration

Po-Kai Hsu, Weihong Xu, Qunyou Liu, Tajana Rosing, Shimeng Yu

PDF

Open Access

TL;DR

HAVEN introduces a GPU architecture with High-Bandwidth Flash to enable billion-scale vector database storage on-device, significantly reducing latency and increasing throughput for large-scale approximate nearest neighbor search in retrieval-augmented generation.

Contribution

The paper presents HAVEN, a novel GPU architecture augmented with High-Bandwidth Flash technology, allowing entire large-scale vector databases to reside on-device, thus eliminating off-GPU data movement bottlenecks.

Findings

01

Up to 20x increase in reranking throughput

02

Up to 40x reduction in latency for billion-scale datasets

03

Enables high-recall retrieval with GPU-based reranking

Abstract

Retrieval-Augmented Generation (RAG) relies on large-scale Approximate Nearest Neighbor Search (ANNS) to retrieve semantically relevant context for large language models. Among ANNS methods, IVF-PQ offers an attractive balance between memory efficiency and search accuracy. However, achieving high recall requires reranking which fetches full-precision vectors for reranking, and the billion-scale vector databases need to reside in CPU DRAM or SSD due to the limited capacity of GPU HBM. This off-GPU data movement introduces substantial latency and throughput degradation. We propose HAVEN, a GPU architecture augmented with High-Bandwidth Flash (HBF) which is a recently introduced die-stacked 3D NAND technology engineered to deliver terabyte-scale capacity and hundreds of GB/s read bandwidth. By integrating HBF and near-storage search unit as an on-package complement to HBM, HAVEN enables…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Natural Language Processing Techniques · Advanced Data Storage Technologies