DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference

Zahra Dehghanighobadi; Asja Fischer

arXiv:2604.24647·cs.CL·April 28, 2026

DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference

Zahra Dehghanighobadi, Asja Fischer

PDF

TL;DR

DepthKV introduces a layer-dependent KV cache pruning method for long-context LLM inference, optimizing memory use by allocating pruning budgets based on layer sensitivity, leading to improved performance.

Contribution

The paper proposes a novel layer-dependent pruning framework that allocates KV cache pruning budgets based on layer sensitivity, outperforming uniform pruning strategies.

Findings

01

DepthKV outperforms uniform pruning at the same global ratio.

02

Layer sensitivity varies significantly across model layers.

03

Layer-dependent pruning improves memory efficiency and model performance.

Abstract

Long-context reasoning is a critical capability of large language models (LLMs), enabling applications such as long-document understanding, summarization, and code generation. However, efficient autoregressive inference relies on the key-value (KV) cache, whose memory footprint grows linearly with sequence length, leading to a major memory bottleneck. To mitigate this overhead, KV cache pruning methods discard cached tokens with low attention scores during inference. Most existing methods apply a uniform pruning ratio across layers, implicitly assuming that all layers contribute equally to overall model performance. We show that this assumption is suboptimal, as layers differ significantly in their sensitivity to pruning. We propose DepthKV, a layer-dependent pruning framework that allocates a fixed global KV budget across layers based on their sensitivity, rather than using a uniform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.