FlowKV: Enhancing Multi-Turn Conversational Coherence in LLMs via Isolated Key-Value Cache Management

Xiang Liu; Hong Chen; Xuming Hu; Xiaowen Chu

arXiv:2505.15347·cs.CL·October 9, 2025

FlowKV: Enhancing Multi-Turn Conversational Coherence in LLMs via Isolated Key-Value Cache Management

Xiang Liu, Hong Chen, Xuming Hu, Xiaowen Chu

PDF

Open Access 4 Reviews

TL;DR

FlowKV introduces a multi-turn isolation mechanism for KV Cache management in LLMs, significantly improving multi-turn conversational coherence and reducing information loss without additional training.

Contribution

It proposes a novel multi-turn isolation mechanism for KV Cache management that enhances coherence and performance in multi-turn conversations without requiring model training.

Findings

01

Outperforms baseline strategies in instruction-following accuracy.

02

Maintains user preference retention from 10.90% to 75.40%.

03

Effective especially in later conversational turns.

Abstract

Large Language Models (LLMs) are increasingly deployed in multi-turn conversational applications, where the management of the Key-Value (KV) Cache presents a significant bottleneck. The linear growth of the KV Cache with dialogue history imposes substantial computational costs, and existing eviction strategies often degrade performance by repeatedly compressing early conversational context, leading to information loss and context forgetting. This paper introduces FlowKV, a novel \textbf{multi-turn isolation mechanism} for KV Cache management, which can be applied to any KV Cache compression method without training. FlowKV's core innovation is a multi-turn isolation mechanism that preserves the accumulated compressed KV cache from past turns. Compression is then strategically applied only to the newly generated KV pairs of the latest completed turn, effectively preventing the…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

S1. This paper tackles an important problem of efficient multi-round conversion. S2. The approach proposed in this paper is simple and easy to understand. S3. The experiments were conducted over a range of different baselines.

Weaknesses

W1. The judge LLM, GPT-4o, is a legacy model, older than the evaluated Llama3.1 and Qwen2.5 models. I would suggest to user a new SOTA model as a judge. W2. The baseline LLMs are somehow outdated. Llama3.1 is fine but Qwen2.5 should be replaced by Qwen3. W3. Compressing each round individually seems to have the problem of lower compression ratios compared with re-compression over the entire history, which might be less effective in terms of space saving and extremely long-round memory.

Reviewer 02Rating 4Confidence 4

Strengths

- Addresses a real and important problem in multi-turn efficiency for LLMs, namely the recursive compression and cumulative information loss across dialogue turns. - The proposed approach is simple, general, and easy to integrate with existing KV compression methods (e.g., SnapKV, ChunkKV, Expected Attention).

Weaknesses

1. The claimed "multi-turn isolation mechanism" is essentially a straightforward and obvious engineering adaptation of existing frameworks such as `kvpress` to multi-turn settings. This is a natural and expected implementation choice when extending any prefilling compression method to multi-turn use. The paper does not introduce a new compression function or theoretical principle; instead, it modifies the scheduling of existing operations. Hence, the core novelty is minimal. A **deeper** explora

Reviewer 03Rating 2Confidence 4

Strengths

The topic is timely due to the rise of Agentic AI. The observation that SOTA approaches compress the earlier parts of the query-response history more often and than later parts is fairly obvious but may not yet have been exploited in the literature. The presentation is overall clear, except some questions listed below.

Weaknesses

The proposed modification of the SOTA approach (in each step compress only the parts that have not been compressed) is straightforward. Compressing each part only once seems to increase the required cache size, which needs to be discussed and experimentally evaluated. Experiments with other SOTA KV Cache methods such as TOVA and KeyDiff would strengthen the argument that the approach of FlowKV generalizes well. The authors performed experiments for prompt length 8192 and output length 4096

Reviewer 04Rating 2Confidence 5

Strengths

1. The proposed isolation mechanism is intuitive and directly addresses the issue of cumulative compression loss in multi-turn LLM interactions. 2. It requires no retraining and can be combined with any existing KV compression method. 3. The figures are informative and greatly aid in understanding the proposed method and experimental results.

Weaknesses

1. While the method is well-motivated, the theoretical section (Appendix D) remains descriptive rather than analytical. A more formal quantification of “information degradation under repeated compression” would strengthen the contribution. 2. The study primarily focuses on instruction-following and preference tasks. Additional experiments on open-domain dialogue or reasoning datasets (e.g., LongBench, SCBench full set) would improve generalization claims. In particular, the latency analysis is

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsService-Oriented Architecture and Web Services · Speech and dialogue systems · Semantic Web and Ontologies