Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
Mikhail Shirokikh, Sergey Nikolenko

TL;DR
This paper introduces a sparse prefix caching method for hybrid and recurrent LLMs that optimizes latency by storing exact recurrent states at strategic checkpoints, improving efficiency for shared prefix requests.
Contribution
It formalizes sparse prefix caching as an optimal checkpoint placement problem and demonstrates its effectiveness in real-world scenarios with shared prefixes.
Findings
Outperforms fixed-budget baselines on Pareto frontier metrics.
Uses fewer checkpoints while maintaining or improving performance.
Most beneficial when many requests share substantial prefixes.
Abstract
Prefix caching is a key latency optimization for autoregressive LLM serving, yet existing systems assume dense per-token key/value reuse. State-space models change the structure of the problem: a recurrent layer can resume from a single stored state rather than requiring the entire token history. This asymmetry opens a new design point between no reuse and dense caching: store exact recurrent states at a sparse set of checkpoint positions and, on a cache hit, resume from the deepest stored checkpoint and recompute the remaining suffix exactly. We formalize sparse prefix caching as checkpoint placement under a distribution over overlap depths, yielding an exact O(NM) dynamic program. For use cases where requests share a non-trivial prefix (e.g. asking different questions about a single long document), we show that our method consistently improves the Pareto frontier traced by standard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
