Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers
Victor Norgren

TL;DR
This paper presents a stateful transformer inference method that significantly reduces latency in streaming workloads by maintaining persistent caches and pre-evaluating questions, enabling faster and more efficient processing.
Contribution
It introduces a novel stateful session model with a persistent KV cache and a multi-tenant scheduler, enabling constant latency and up to 5.9x speedup in streaming inference.
Findings
Achieves up to 5.9x speedup over traditional inference engines.
Maintains constant query latency despite growing context size.
Enables multiple sessions to coexist efficiently on a single GPU.
Abstract
Conventional transformer inference engines are request-driven, paying an O(n) prefill cost on every query. In streaming workloads, where data arrives continuously and queries probe an ever-growing context, this cost is prohibitive. We introduce a data-driven computational model centred on stateful sessions: a persistent KV cache advanced incrementally as new data arrives, so prefill is moved off the critical path and query latency becomes O(|q|), independent of accumulated context size. Building on this, Flash Queries reclaim idle GPU cycles between data arrivals to pre-evaluate registered questions and return cached answers before the user asks, a pattern that is structurally impossible in stateless engines because they discard intermediate state between requests. A multi-tenant continuous-batching scheduler with cell-budget admission and prefix-aware grouped prefill lets dozens of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
