KVBuffer: IO-aware Serving for Linear Attention
Longwei Zou, Lin Zhong

TL;DR
KVBuffer is an IO-aware serving mechanism that improves the efficiency of linear attention in long-context inference by buffering recent keys and values, reducing latency and increasing request capacity.
Contribution
It introduces a novel buffering approach for linear attention serving that enhances memory efficiency and decoding speed, especially for speculative decoding and short contexts.
Findings
Reduces linear attention decoding latency by up to 45.17%.
Increases maximum serving requests by 5x for speculative decoding.
Enables chunkwise computation and parallel verification of draft tokens.
Abstract
Linear attention has recently gained significant attention for long-context inference due to its constant decoding cost with respect to context length. However, existing serving systems typically serve linear attention by recurrently computing and updating a large linear attention state in every decoding step. Since the state is much larger than the per-token key and value, recurrent decoding incurs substantial memory access and becomes inefficient for serving linear attention. In this paper, we propose KVBuffer, an IO-aware serving mechanism for linear attention. By buffering recent keys and values, KVBuffer enables serving systems to compute linear attention outputs in more flexible and memory-efficient ways. For decoding, KVBuffer enables chunkwise computation, which reduces average memory access and decoding latency by deferring state updates and applying them in batch. For…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
