RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference

Yaoqi Chen; Jinkai Zhang; Baotong Lu; Qianxi Zhang; Chengruidong Zhang; Jing Liu; Jingjia Luo; Di Liu; Huiqiang Jiang; Qi Chen; Bailu Ding; Xiao Yan; Jiawei Jiang; Chen Chen; Mingxing Zhang; Cheng Li; Yuqing Yang; Fan Yang; Mao Yang

arXiv:2505.02922·cs.LG·April 28, 2026

RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference

Yaoqi Chen, Jinkai Zhang, Baotong Lu, Qianxi Zhang, Chengruidong Zhang, Jing Liu, Jingjia Luo, Di Liu, Huiqiang Jiang, Qi Chen, Bailu Ding, Xiao Yan, Jiawei Jiang, Chen Chen, Mingxing Zhang, Cheng Li, Yuqing Yang, Fan Yang, Mao Yang

PDF

1 Repo

TL;DR

RetroInfer is a vector storage engine that enhances long-context LLM inference by efficiently managing sparse attention and KV cache retrieval, significantly boosting throughput while maintaining accuracy.

Contribution

It introduces a novel attention-aware vector index and GPU-CPU buffer management to improve the tradeoff between accuracy and retrieval cost in long-context inference.

Findings

01

Achieves up to 4.4X decoding throughput at 120K context length.

02

Attains up to 12.2X speedup over sparse attention baselines at 1 million tokens.

03

Maintains full-attention-level accuracy during long-context inference.

Abstract

Recent large language models (LLMs) are rapidly extending their context windows, yet inference throughput lags due to increasing GPU memory and bandwidth demands. This is because the key-value (KV) cache, an intermediate structure storing token representations, grows linearly with context length and requires an iterative linear scan for attention computation. A promising direction to accelerate long-context inference is to exploit attention's inherent sparsity by offloading the KV cache to CPU memory and retrieving only a small subset of tokens important to the current generation step. However, prior sparse attention approaches struggle to balance accuracy and retrieval cost due to varying sparsity patterns and inefficient GPU-CPU memory management. We present RetroInfer, a vector storage engine that realizes a sparsity-based KV cache for long-context inference. RetroInfer introduces…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/RetrievalAttention
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.