Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving

Bole Ma; Jan Eitzinger; Harald K\"ostler

arXiv:2605.05696·cs.DC·May 8, 2026

Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving

Bole Ma, Jan Eitzinger, Harald K\"ostler

PDF

TL;DR

Irminsul introduces a content-addressed caching method for agentic LLM workloads, significantly improving cache hit rates and energy efficiency by addressing token position shifts without architectural costs.

Contribution

The paper presents Irminsul, a novel content-addressed caching system that enhances performance and efficiency for agentic LLM serving by leveraging content hashing and delta-rotation techniques.

Findings

01

Recovers up to 83% of prompt tokens above exact-prefix

02

Achieves 63% prefill energy savings per cache hit

03

Effective across multiple large-scale MLA-MoE deployments

Abstract

Agentic LLM workloads put bit-identical tokens at shifted positions every turn, voiding prefix caches at the first byte of divergence. Operators report cache-hit regressions ranging from moderate slowdowns to severe TTFT spikes of 10-16s on unchanged content. Prior position-independent caching systems correct RoPE on the full $d_{K}$ -dimensional key, an architectural cost imposed by GQA, not by caching itself. Multi-Head Latent Attention, deployed at scale in DeepSeek-V2/V3/R1, Kimi-K2/Moonlight, GLM-5, and Mistral Large 3, factors each KV row into a position-free $c_{K V}$ and a 64-dim $k_{r}$ correctable in closed form; this structure motivates content-addressed caching as a natural fit rather than a GQA workaround. We present Irminsul, which extends SGLang's radix cache with content-hash keying over CDC-chunked segments and a $δ$ -rotation rule for $k_{r}$ . We evaluate three native…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.