ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs

Jianlong Lei; Shashikant Ilager

arXiv:2603.08727·cs.AR·March 11, 2026

ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs

Jianlong Lei, Shashikant Ilager

PDF

Open Access

TL;DR

ARKV introduces an adaptive, data-driven KV cache management framework that dynamically allocates precision levels based on attention importance, significantly reducing memory usage while maintaining high accuracy in long-context LLM inference.

Contribution

The paper presents ARKV, a novel adaptive framework that dynamically manages KV cache precision levels based on attention dynamics, improving memory efficiency without retraining.

Findings

01

Reduces KV memory usage by 4x while preserving ~97% accuracy.

02

Outperforms uniform quantization on GSM8K math reasoning tasks.

03

Maintains full-precision performance on short-context tasks.

Abstract

Large Language Models (LLMs) are increasingly deployed in scenarios demanding ultra-long context reasoning, such as agentic workflows and deep research understanding. However, long-context inference is constrained by the KV cache, a transient memory structure that grows linearly with sequence length and batch size, quickly dominating GPU memory usage. Existing memory reduction techniques, including eviction and quantization, often rely on static heuristics and suffer from degraded quality under tight budgets. In this paper, we propose ARKV, a lightweight and adaptive framework that dynamically allocates precision levels to cached tokens based on per-layer attention dynamics and token-level importance. During a short prefill phase, ARKV estimates the original quantization (OQ) ratio of each layer by computing statistical scores such as attention entropy, variance and kurtosis. During…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Parallel Computing and Optimization Techniques · Advanced Neural Network Applications