LongFlow: Efficient KV Cache Compression for Reasoning Models
Yi Su, Zhenxu Tian, Dan Qiao, Yuechi Zhou, Juntao Li, Min Zhang

TL;DR
LongFlow is a novel KV cache compression technique for reasoning models that significantly reduces memory and bandwidth costs while maintaining high throughput and minimal accuracy loss.
Contribution
It introduces an efficient importance estimation metric and a fused kernel to optimize KV cache compression specifically for long-output reasoning models.
Findings
Achieves up to 11.8x throughput improvement.
Provides 80% KV cache compression with minimal accuracy impact.
Outperforms existing methods in long-output scenarios.
Abstract
Recent reasoning models such as OpenAI-o1 and DeepSeek-R1 have shown strong performance on complex tasks including mathematical reasoning and code generation. However, this performance gain comes with substantially longer output sequences, leading to significantly increased deployment costs. In particular, long outputs require large KV caches, resulting in high memory consumption and severe bandwidth pressure during attention computation. Most existing KV cache optimization methods are designed for long-input, short-output scenarios and are ineffective for the long-output setting of reasoning models. Moreover, importance estimation in prior work is computationally expensive and becomes prohibitive when continuous re-evaluation is required during long generation. To address these challenges, we propose LongFlow, a KV cache compression method with an efficient importance estimation metric…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
