RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference
Yaoqi Chen, Jinkai Zhang, Baotong Lu, Qianxi Zhang, Chengruidong Zhang, Jing Liu, Jingjia Luo, Di Liu, Huiqiang Jiang, Qi Chen, Bailu Ding, Xiao Yan, Jiawei Jiang, Chen Chen, Mingxing Zhang, Cheng Li, Yuqing Yang, Fan Yang, Mao Yang

TL;DR
RetroInfer is a vector storage engine that enhances long-context LLM inference by efficiently managing sparse attention and KV cache retrieval, significantly boosting throughput while maintaining accuracy.
Contribution
It introduces a novel attention-aware vector index and GPU-CPU buffer management to improve the tradeoff between accuracy and retrieval cost in long-context inference.
Findings
Achieves up to 4.4X decoding throughput at 120K context length.
Attains up to 12.2X speedup over sparse attention baselines at 1 million tokens.
Maintains full-attention-level accuracy during long-context inference.
Abstract
Recent large language models (LLMs) are rapidly extending their context windows, yet inference throughput lags due to increasing GPU memory and bandwidth demands. This is because the key-value (KV) cache, an intermediate structure storing token representations, grows linearly with context length and requires an iterative linear scan for attention computation. A promising direction to accelerate long-context inference is to exploit attention's inherent sparsity by offloading the KV cache to CPU memory and retrieving only a small subset of tokens important to the current generation step. However, prior sparse attention approaches struggle to balance accuracy and retrieval cost due to varying sparsity patterns and inefficient GPU-CPU memory management. We present RetroInfer, a vector storage engine that realizes a sparsity-based KV cache for long-context inference. RetroInfer introduces…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
