KVBuffer: IO-aware Serving for Linear Attention

Longwei Zou; Lin Zhong

arXiv:2605.19049·cs.LG·May 20, 2026

KVBuffer: IO-aware Serving for Linear Attention

Longwei Zou, Lin Zhong

PDF

TL;DR

KVBuffer is an IO-aware serving mechanism that improves the efficiency of linear attention in long-context inference by buffering recent keys and values, reducing latency and increasing request capacity.

Contribution

It introduces a novel buffering approach for linear attention serving that enhances memory efficiency and decoding speed, especially for speculative decoding and short contexts.

Findings

01

Reduces linear attention decoding latency by up to 45.17%.

02

Increases maximum serving requests by 5x for speculative decoding.

03

Enables chunkwise computation and parallel verification of draft tokens.

Abstract

Linear attention has recently gained significant attention for long-context inference due to its constant decoding cost with respect to context length. However, existing serving systems typically serve linear attention by recurrently computing and updating a large linear attention state in every decoding step. Since the state is much larger than the per-token key and value, recurrent decoding incurs substantial memory access and becomes inefficient for serving linear attention. In this paper, we propose KVBuffer, an IO-aware serving mechanism for linear attention. By buffering recent keys and values, KVBuffer enables serving systems to compute linear attention outputs in more flexible and memory-efficient ways. For decoding, KVBuffer enables chunkwise computation, which reduces average memory access and decoding latency by deferring state updates and applying them in batch. For…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.