ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution
Zican Dong, Peiyu Liu, Junyi Li, Zhipeng Chen, Han Peng, Shuo Wang, Wayne Xin Zhao

TL;DR
ForesightKV is a training-based framework that learns to effectively evict key-value pairs in large language models' caches, balancing memory efficiency and reasoning performance during long-text generation.
Contribution
It introduces a novel supervised and reinforcement learning approach for cache eviction, utilizing a Golden Eviction algorithm and Markov Decision Process formulation.
Findings
Outperforms prior methods with half the cache budget
Improves reasoning performance on AIME benchmarks
Combines supervised and reinforcement learning for eviction
Abstract
Recently, large language models (LLMs) have shown remarkable reasoning abilities by producing long reasoning traces. However, as the sequence length grows, the key-value (KV) cache expands linearly, incurring significant memory and computation costs. Existing KV cache eviction methods mitigate this issue by discarding less important KV pairs, but often fail to capture complex KV dependencies, resulting in performance degradation. To better balance efficiency and performance, we introduce ForesightKV, a training-based KV cache eviction framework that learns to predict which KV pairs to evict during long-text generations. We first design the Golden Eviction algorithm, which identifies the optimal eviction KV pairs at each step using future attention scores. These traces and the scores at each step are then distilled via supervised training with a Pairwise Ranking Loss. Furthermore, we…
Peer Reviews
Decision·Submitted to ICLR 2026
Clear motivation, meaningful gains on math reasoning, smart use of future attention and RL, and a promising results for learned memory over heuristic KV strategies.
Heavy reliance on math-specific reward signals, unclear generalization to other tasks, and missing training details that limit reproducibility and practical adoption.
1. Prior eviction work focuses on heuristics; this paper introduces a learned scoring policy. 2. Empirical attention visualizations provide qualitative motivation. 3. Parameter efficient training with smaller scorer, backbone LLM remains frozen. 4. Throughput and batch-size improvements are convincing.
1. All experiments focus on math reasoning (AIME), and the scorer is trained on reasoning traces from STILL-like data. This raises serious concerns about domain overfitting and limits claims of generality. 2. KV eviction is most relevant when serving >7B parameters. It is unclear whether attention patterns and eviction policies scale. 3. The RL reward specifically penalizes spikes on low-entropy symbolic tokens. This may not generalize to summarization, code. 4. Many practical workloads require
- The experimental is well-designed, and the ablation studies are well-considered. - The paper is easy to follow.
- In the efficiency evaluation, the authors compress sequence lengths from 16K and 32K to 1K and 2K, achieving substantial efficiency gains through extremely high compression ratios. Notably, even on AIME 24 and AIME 25, where the average length is under 16K, the method incurs significant accuracy drops when the KV budget is limited to 1K, with accuracy losses exceeding 40% in some cases. Thus, such aggressive compression may be impractical in real-world scenarios, limiting the practical applica
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Topic Modeling · Multimodal Machine Learning Applications
