FlowKV: Enhancing Multi-Turn Conversational Coherence in LLMs via Isolated Key-Value Cache Management
Xiang Liu, Hong Chen, Xuming Hu, Xiaowen Chu

TL;DR
FlowKV introduces a multi-turn isolation mechanism for KV Cache management in LLMs, significantly improving multi-turn conversational coherence and reducing information loss without additional training.
Contribution
It proposes a novel multi-turn isolation mechanism for KV Cache management that enhances coherence and performance in multi-turn conversations without requiring model training.
Findings
Outperforms baseline strategies in instruction-following accuracy.
Maintains user preference retention from 10.90% to 75.40%.
Effective especially in later conversational turns.
Abstract
Large Language Models (LLMs) are increasingly deployed in multi-turn conversational applications, where the management of the Key-Value (KV) Cache presents a significant bottleneck. The linear growth of the KV Cache with dialogue history imposes substantial computational costs, and existing eviction strategies often degrade performance by repeatedly compressing early conversational context, leading to information loss and context forgetting. This paper introduces FlowKV, a novel \textbf{multi-turn isolation mechanism} for KV Cache management, which can be applied to any KV Cache compression method without training. FlowKV's core innovation is a multi-turn isolation mechanism that preserves the accumulated compressed KV cache from past turns. Compression is then strategically applied only to the newly generated KV pairs of the latest completed turn, effectively preventing the…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
S1. This paper tackles an important problem of efficient multi-round conversion. S2. The approach proposed in this paper is simple and easy to understand. S3. The experiments were conducted over a range of different baselines.
W1. The judge LLM, GPT-4o, is a legacy model, older than the evaluated Llama3.1 and Qwen2.5 models. I would suggest to user a new SOTA model as a judge. W2. The baseline LLMs are somehow outdated. Llama3.1 is fine but Qwen2.5 should be replaced by Qwen3. W3. Compressing each round individually seems to have the problem of lower compression ratios compared with re-compression over the entire history, which might be less effective in terms of space saving and extremely long-round memory.
- Addresses a real and important problem in multi-turn efficiency for LLMs, namely the recursive compression and cumulative information loss across dialogue turns. - The proposed approach is simple, general, and easy to integrate with existing KV compression methods (e.g., SnapKV, ChunkKV, Expected Attention).
1. The claimed "multi-turn isolation mechanism" is essentially a straightforward and obvious engineering adaptation of existing frameworks such as `kvpress` to multi-turn settings. This is a natural and expected implementation choice when extending any prefilling compression method to multi-turn use. The paper does not introduce a new compression function or theoretical principle; instead, it modifies the scheduling of existing operations. Hence, the core novelty is minimal. A **deeper** explora
The topic is timely due to the rise of Agentic AI. The observation that SOTA approaches compress the earlier parts of the query-response history more often and than later parts is fairly obvious but may not yet have been exploited in the literature. The presentation is overall clear, except some questions listed below.
The proposed modification of the SOTA approach (in each step compress only the parts that have not been compressed) is straightforward. Compressing each part only once seems to increase the required cache size, which needs to be discussed and experimentally evaluated. Experiments with other SOTA KV Cache methods such as TOVA and KeyDiff would strengthen the argument that the approach of FlowKV generalizes well. The authors performed experiments for prompt length 8192 and output length 4096
1. The proposed isolation mechanism is intuitive and directly addresses the issue of cumulative compression loss in multi-turn LLM interactions. 2. It requires no retraining and can be combined with any existing KV compression method. 3. The figures are informative and greatly aid in understanding the proposed method and experimental results.
1. While the method is well-motivated, the theoretical section (Appendix D) remains descriptive rather than analytical. A more formal quantification of “information degradation under repeated compression” would strengthen the contribution. 2. The study primarily focuses on instruction-following and preference tasks. Additional experiments on open-domain dialogue or reasoning datasets (e.g., LongBench, SCBench full set) would improve generalization claims. In particular, the latency analysis is
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsService-Oriented Architecture and Web Services · Speech and dialogue systems · Semantic Web and Ontologies
