HAVEN: High-Bandwidth Flash Augmented Vector Engine for Large-Scale Approximate Nearest-Neighbor Search Acceleration
Po-Kai Hsu, Weihong Xu, Qunyou Liu, Tajana Rosing, Shimeng Yu

TL;DR
HAVEN introduces a GPU architecture with High-Bandwidth Flash to enable billion-scale vector database storage on-device, significantly reducing latency and increasing throughput for large-scale approximate nearest neighbor search in retrieval-augmented generation.
Contribution
The paper presents HAVEN, a novel GPU architecture augmented with High-Bandwidth Flash technology, allowing entire large-scale vector databases to reside on-device, thus eliminating off-GPU data movement bottlenecks.
Findings
Up to 20x increase in reranking throughput
Up to 40x reduction in latency for billion-scale datasets
Enables high-recall retrieval with GPU-based reranking
Abstract
Retrieval-Augmented Generation (RAG) relies on large-scale Approximate Nearest Neighbor Search (ANNS) to retrieve semantically relevant context for large language models. Among ANNS methods, IVF-PQ offers an attractive balance between memory efficiency and search accuracy. However, achieving high recall requires reranking which fetches full-precision vectors for reranking, and the billion-scale vector databases need to reside in CPU DRAM or SSD due to the limited capacity of GPU HBM. This off-GPU data movement introduces substantial latency and throughput degradation. We propose HAVEN, a GPU architecture augmented with High-Bandwidth Flash (HBF) which is a recently introduced die-stacked 3D NAND technology engineered to deliver terabyte-scale capacity and hundreds of GB/s read bandwidth. By integrating HBF and near-storage search unit as an on-package complement to HBM, HAVEN enables…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Natural Language Processing Techniques · Advanced Data Storage Technologies
