Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery: Sublinear Memory Growth for Efficient LLM Inference

Adilet Metinov; Gulida M. Kudakeeva; Bolotbek uulu Nursultan; Gulnara D. Kabaeva

arXiv:2512.11221·cs.LG·December 15, 2025

Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery: Sublinear Memory Growth for Efficient LLM Inference

Adilet Metinov, Gulida M. Kudakeeva, Bolotbek uulu Nursultan, Gulnara D. Kabaeva

PDF

Open Access

TL;DR

This paper introduces ASR-KF-EGR, a training-free inference framework that reduces memory usage in large language models by selectively freezing key-value pairs based on entropy, without sacrificing generation quality.

Contribution

It proposes a reversible soft-freeze mechanism with entropy-guided recovery and sublinear freeze scheduling, enabling efficient long-context LLM inference without fine-tuning.

Findings

01

Achieves 55-67% reduction in KV cache size on LLaMA-3 8B

02

Maintains generation quality and retrieval performance

03

Architecture-agnostic and requires no fine-tuning

Abstract

We present Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery (ASR-KF-EGR), a training-free inference-time framework for efficient large language model generation. Our method introduces a reversible soft-freeze mechanism that temporarily suspends key-value (KV) updates for low-importance tokens identified within a sliding attention window. Unlike eviction-based approaches that permanently discard context, ASR-KF-EGR preserves all tokens in off-GPU storage and restores them on demand. We extend the framework with sublinear freeze scheduling, where freeze duration grows sublinearly with repeated low-importance detections, preventing over-aggressive compression. Preliminary experiments on LLaMA-3 8B demonstrate 55-67% reduction in active KV cache size while maintaining generation quality and passing needle-in-haystack retrieval tests. The method is architecture-agnostic, requires…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Natural Language Processing Techniques