Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving

Hui Zeng; Daming Zhao; Pengfei Yang; WenXuan Hou; Tianyang Zheng; Hui Li; Weiye Ji; Jidong Zhai

arXiv:2511.06029·cs.LG·December 16, 2025

Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving

Hui Zeng, Daming Zhao, Pengfei Yang, WenXuan Hou, Tianyang Zheng, Hui Li, Weiye Ji, Jidong Zhai

PDF

Open Access 1 Video

TL;DR

Lethe is a dynamic cache pruning framework for large language models that adaptively manages key-value caches during reasoning tasks, significantly improving efficiency while maintaining quality.

Contribution

It introduces layerwise sparsity-aware allocation and recency-aware token pruning, addressing the dynamic and layer-sensitive nature of long-form generation in LLMs.

Findings

01

Increases throughput by up to 2.56x.

02

Balances efficiency and quality across diverse models.

03

Addresses layer and temporal adaptivity in KV cache management.

Abstract

Generative reasoning with large language models (LLMs) often involves long decoding sequences, leading to substantial memory and latency overheads from accumulating key-value (KV) caches. While existing KV compression methods primarily focus on reducing prefill memory from long input sequences, they fall short in addressing the dynamic and layer-sensitive nature of long-form generation, which is central to reasoning tasks. We propose Lethe, a dynamic KV cache management framework that introduces adaptivity along both the spatial and temporal dimensions of decoding. Along the spatial dimension, Lethe performs layerwise sparsity-aware allocation, assigning token pruning budgets to each transformer layer based on estimated attention redundancy. Along the temporal dimension, Lethe conducts multi-round token pruning during generation, driven by a Recency-Aware Selective Retention} (RASR)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications