Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving
Bole Ma, Jan Eitzinger, Harald K\"ostler

TL;DR
Irminsul introduces a content-addressed caching method for agentic LLM workloads, significantly improving cache hit rates and energy efficiency by addressing token position shifts without architectural costs.
Contribution
The paper presents Irminsul, a novel content-addressed caching system that enhances performance and efficiency for agentic LLM serving by leveraging content hashing and delta-rotation techniques.
Findings
Recovers up to 83% of prompt tokens above exact-prefix
Achieves 63% prefill energy savings per cache hit
Effective across multiple large-scale MLA-MoE deployments
Abstract
Agentic LLM workloads put bit-identical tokens at shifted positions every turn, voiding prefix caches at the first byte of divergence. Operators report cache-hit regressions ranging from moderate slowdowns to severe TTFT spikes of 10-16s on unchanged content. Prior position-independent caching systems correct RoPE on the full -dimensional key, an architectural cost imposed by GQA, not by caching itself. Multi-Head Latent Attention, deployed at scale in DeepSeek-V2/V3/R1, Kimi-K2/Moonlight, GLM-5, and Mistral Large 3, factors each KV row into a position-free and a 64-dim correctable in closed form; this structure motivates content-addressed caching as a natural fit rather than a GQA workaround. We present Irminsul, which extends SGLang's radix cache with content-hash keying over CDC-chunked segments and a -rotation rule for . We evaluate three native…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
