Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

Mikhail Shirokikh; Sergey Nikolenko

arXiv:2605.05219·cs.LG·May 8, 2026

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

Mikhail Shirokikh, Sergey Nikolenko

PDF

TL;DR

This paper introduces a sparse prefix caching method for hybrid and recurrent LLMs that optimizes latency by storing exact recurrent states at strategic checkpoints, improving efficiency for shared prefix requests.

Contribution

It formalizes sparse prefix caching as an optimal checkpoint placement problem and demonstrates its effectiveness in real-world scenarios with shared prefixes.

Findings

01

Outperforms fixed-budget baselines on Pareto frontier metrics.

02

Uses fewer checkpoints while maintaining or improving performance.

03

Most beneficial when many requests share substantial prefixes.

Abstract

Prefix caching is a key latency optimization for autoregressive LLM serving, yet existing systems assume dense per-token key/value reuse. State-space models change the structure of the problem: a recurrent layer can resume from a single stored state rather than requiring the entire token history. This asymmetry opens a new design point between no reuse and dense caching: store exact recurrent states at a sparse set of checkpoint positions and, on a cache hit, resume from the deepest stored checkpoint and recompute the remaining suffix exactly. We formalize sparse prefix caching as checkpoint placement under a distribution over overlap depths, yielding an exact O(NM) dynamic program. For use cases where requests share a non-trivial prefix (e.g. asking different questions about a single long document), we show that our method consistently improves the Pareto frontier traced by standard…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.