Hold Onto That Thought: Assessing KV Cache Compression On Reasoning
Minghui Liu, Aadi Palnitkar, Tahseen Rabbani, Hyunwoo Jae, Kyle Rui Sang, Dixi Yao, Shayan Shabihi, Fuheng Zhao, Tian Li, Ce Zhang, Furong Huang, Kunpeng Zhang

TL;DR
This paper evaluates KV cache compression strategies on long reasoning tasks in large language models, revealing that certain methods like H2O and SnapKV excel in reasoning contexts and highlighting tradeoffs between cache size and inference costs.
Contribution
It provides a comprehensive benchmark of cache compression strategies on reasoning tasks, identifying effective methods and tradeoffs for long-context inference in LLMs.
Findings
H2O and SnapKV outperform other strategies on reasoning tasks
Eviction strategies at low budgets enable longer reasoning traces
Tradeoff exists between cache size and inference cost
Abstract
Large language models (LLMs) have demonstrated remarkable performance on long-context tasks, but are often bottlenecked by memory constraints. Namely, the KV cache, which is used to significantly speed up attention computations, grows linearly with context length. A suite of compression algorithms has been introduced to alleviate cache growth by evicting unimportant tokens. However, several popular strategies are targeted towards the prefill phase, i.e., processing long prompt context, and their performance is rarely assessed on reasoning tasks requiring long decoding. In particular, short but complex prompts, such as those in benchmarks like GSM8K and MATH500, often benefit from multi-step reasoning and self-reflection, resulting in thinking sequences thousands of tokens long. In this work, we benchmark the performance of several popular compression strategies on long-reasoning tasks.…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Clear empirical gap: Most cache compression work targets long-prompt tasks (e.g., LongBench, RULER), but this paper focuses on long-reasoning scenarios where the generation dominates memory use. This is a genuinely underexplored regime. 2. Comprehensive evaluation: The benchmark suite spans eight reasoning datasets (DROP, ReClor, FOLIO, StrategyQA, CommonSenseQA, OpenBookQA, GSM8K, MATH-500) and multiple models (Llama-3.1-8B-Instruct, DeepSeek-R1-Distill-Llama/Qwen, Nemotron-Nano-8B). This b
1. Limited novelty: The work is primarily empirical. Extending SnapKV for decoding and modifying kvpress are incremental, though valuable, contributions. 2. Implementation details under-specified: The paper lacks detail on how token importance is updated during decoding for SnapKV-D—e.g., whether attention aggregation happens online or at fixed intervals. 3. Normalization and fairness: Because compression methods produce different output lengths, accuracy comparisons may not be normalized by g
1. This paper focuses on an increasingly important setting: long reasoning LLMs, where decode-phase KV dominates. 2. Systematic comparison across multiple KV compression methods and budgets. 3. Empirical evidence that heavy-hitter approaches (SnapKV-D, H2O) outperform naive recency or norm methods for reasoning.
1. The paper evaluates eviction‐based strategies but does not compare against modern sparse attention approaches designed for long reasoning (e.g., DeepSeek sparse attention, SeerAttention). These are important reference points for understanding where eviction sits in the design space. 2. No evaluation on multi-turn or interactive reasoning. Long decoding commonly occurs in iterative workflows (MathChat-style step reasoning, planning with feedback). Single-turn benchmarks may not fully reflect p
1. Addresses a critical issue in LLMs related to memory constraints during reasoning tasks. 2. Provides a comprehensive evaluation of multiple KV cache compression strategies across various reasoning benchmarks. 3. Identifies specific strategies (H2O and SnapKV variant) that enhance performance in reasoning tasks.
1. The study focuses on a single LLM (Llama-3.1-8B-Instruct), which may limit the generalizability of the findings to other models. 2. Lacks a discussion on the computational overhead introduced by implementing the recommended compression strategies. 3. Does not provide detailed explanations of the underlying mechanisms of the identified effective strategies.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling
