Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing

Junyi Wen; Junyuan Liang; Zicong Hong; Wuhui Chen; Ting Cai; Zibin Zheng

arXiv:2507.08045·cs.CL·August 27, 2025

Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing

Junyi Wen, Junyuan Liang, Zicong Hong, Wuhui Chen, Ting Cai, Zibin Zheng

PDF

TL;DR

Krul is a system that improves multi-turn conversation efficiency in large language models by dynamically optimizing key-value cache compression based on attention similarity, reducing latency and storage without losing accuracy.

Contribution

Krul introduces a dynamic, conversation-specific KV cache compression and restoration approach, enhancing efficiency over static methods by considering attention pattern variability.

Findings

01

Achieves 1.5x-2.68x reduction in time-to-first-token

02

Reduces KV cache storage by 1.33x-2.35x

03

Maintains generation quality comparable to state-of-the-art methods

Abstract

Efficient state restoration in multi-turn conversations with large language models (LLMs) remains a critical challenge, primarily due to the overhead of recomputing or loading full key-value (KV) caches for all historical tokens. To address this, existing approaches compress KV caches across adjacent layers with highly similar attention patterns. However, these methods often apply a fixed compression scheme across all conversations, selecting the same layer pairs for compression without considering conversation-specific attention dynamics. This static strategy overlooks variability in attention pattern similarity across different conversations, which can lead to noticeable accuracy degradation. We present Krul, a multi-turn LLM inference system that enables accurate and efficient KV cache restoration. Krul dynamically selects compression strategies based on attention similarity across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.