FlowMM: Cross-Modal Information Flow Guided KV Cache Merging for Efficient Multimodal Context Inference
Kunxi Li, Yufan Xiong, Zhonghua Jiang, Yiyun Zhou, Zhaode Wang, Chengfei Lv, Shengyu Zhang

TL;DR
FlowMM introduces an adaptive cross-modal information flow-guided KV cache merging framework that significantly reduces memory and latency in multimodal models while preserving performance.
Contribution
This work proposes FlowMM, a novel framework that dynamically guides KV cache merging using cross-modal information flow and sensitivity-aware token matching.
Findings
Reduces KV cache memory by 80-95%.
Speeds up decoding latency by 1.3-1.8 times.
Maintains competitive task performance.
Abstract
Traditional KV cache eviction strategies, which discard less critical KV-pairs based on attention scores, often degrade generation quality, causing context loss or hallucinations. Recent efforts shift toward KV merging, merging eviction tokens with retention tokens based on similarity. However, in multimodal scenarios, distributional biases across modality tokens and attentional biases in cross-modal interactions limit its effectiveness. This work introduces FlowMM, an adaptive framework for cross-modal information flow-guided multimodal KV cache merging. FlowMM leverages cross-modal information flow to dynamically apply layer-specific merging strategies, capturing modality-specific patterns while preserving contextual integrity. Furthermore, we introduce a sensitivity-adaptive token matching mechanism that jointly evaluates token similarity and task-critical sensitivity, merging…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The exposition is clear and the problem is well-motivated, with the method inspired from known insights in multimodal interpretability literature. - Evaluation on diverse long-context tasks showing strong improvements over baselines in the reported experimental setting. - Solid ablation studies validating both major components contribute to performance.
- Critical concern: LOOK-M baseline performance appears severely degraded compared to original paper. LOOK-M's original paper reports near-lossless performance at similar compression ratios, often matching or exceeding full cache. FlowMM reports massive degradations for LOOK-M at the same settings. This raises serious questions about baseline implementation quality versus genuine failure to generalize to newer architectures. The authors must provide comparison on original LOOK-M models (LLaVA-v1
1. Motivation and Problem Identification: The paper effectively highlights the limitations of existing KV cache eviction and merging strategies in multimodal contexts, pinpointing "distributional biases" and "attentional biases" as key challenges. This sets a clear and compelling stage for the proposed work. 2. Core Idea: The concept of using cross-modal information flow to guide layer-specific merging is the paper's most significant contribution. The observation that shallow layers are intra-mo
1. Clarity of the Merging Operation: The paper clearly defines how to decide when and what to merge (using ρ^l and sensitivity). However, the exact mechanism of how the merging is performed (the functions f_merge and g_merge in Eq. 5) is somewhat glossed over. A more detailed explanation or a reference to the specific merging function (e.g., weighted averaging based on attention) would be helpful. 2. Definition and Calculation of Sensitivity: The proposal to use attention scores as a proxy for
1. The idea of steering KV merging using estimated cross modal information flow and layer specific policies is a clear conceptual step beyond attention score based eviction or naive similarity based merging. 2. The framing of sensitivity adaptive matching acknowledges that some tokens are risky to merge even if similar, which is a practical insight for long context multimodal prompts where a few critical tokens can dominate correctness. 3. The abstract clearly separates the components of the p
1. The abstract does not explain how cross modal information flow is computed. Without a well justified estimator and ablations that vary it, reviewers cannot assess whether gains come from the estimator or from generic similarity based merging. 2. The claim of maintaining competitive task performance is broad. It is important to see results across many task types such as OCR heavy tasks, visual reasoning that requires spatial grounding, audio text alignment if applicable, and multi turn dialogu
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Caching and Content Delivery
