Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing
Junyi Wen, Junyuan Liang, Zicong Hong, Wuhui Chen, Ting Cai, Zibin Zheng

TL;DR
Krul is a system that improves multi-turn conversation efficiency in large language models by dynamically optimizing key-value cache compression based on attention similarity, reducing latency and storage without losing accuracy.
Contribution
Krul introduces a dynamic, conversation-specific KV cache compression and restoration approach, enhancing efficiency over static methods by considering attention pattern variability.
Findings
Achieves 1.5x-2.68x reduction in time-to-first-token
Reduces KV cache storage by 1.33x-2.35x
Maintains generation quality comparable to state-of-the-art methods
Abstract
Efficient state restoration in multi-turn conversations with large language models (LLMs) remains a critical challenge, primarily due to the overhead of recomputing or loading full key-value (KV) caches for all historical tokens. To address this, existing approaches compress KV caches across adjacent layers with highly similar attention patterns. However, these methods often apply a fixed compression scheme across all conversations, selecting the same layer pairs for compression without considering conversation-specific attention dynamics. This static strategy overlooks variability in attention pattern similarity across different conversations, which can lead to noticeable accuracy degradation. We present Krul, a multi-turn LLM inference system that enables accurate and efficient KV cache restoration. Krul dynamically selects compression strategies based on attention similarity across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
