Inference-Time Hyper-Scaling with KV Cache Compression

Adrian {\L}a\'ncucki; Konrad Staniszewski; Piotr Nawrot; Edoardo M. Ponti

arXiv:2506.05345·cs.LG·November 10, 2025

Inference-Time Hyper-Scaling with KV Cache Compression

Adrian {\L}a\'ncucki, Konrad Staniszewski, Piotr Nawrot, Edoardo M. Ponti

PDF

Open Access 8 Models 1 Video

TL;DR

This paper introduces a novel method called Dynamic Memory Sparsification (DMS) for compressing KV caches in Transformer LLMs, enabling inference-time hyper-scaling that improves accuracy without increasing compute or memory load.

Contribution

The paper proposes DMS, a new KV cache compression technique that maintains accuracy at high compression ratios and demonstrates its effectiveness across multiple LLMs and tasks.

Findings

01

DMS achieves 8× compression with only 1K training steps.

02

DMS improves accuracy on multiple benchmarks for scaled inference.

03

Inference-time hyper-scaling boosts LLM performance without additional latency.

Abstract

Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in Transformer LLMs, generation cost is bottlenecked by the size of the key-value (KV) cache, rather than the number of generated tokens. Hence, we explore inference-time hyper-scaling: by compressing the KV cache, we can generate more tokens within the same compute budget and further improve the accuracy of scaled inference. The success of this approach, however, hinges on the ability of compression methods to preserve accuracy even at high compression ratios. To make hyper-scaling practical, we introduce Dynamic Memory Sparsification (DMS), a novel method for sparsifying KV caches that only requires 1K training steps to achieve 8 $\times$ compression, while maintaining better accuracy than training-free sparse attention. Instead of prematurely discarding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

Inference-Time Hyper-Scaling with KV Cache Compression· slideslive

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Advanced Neural Network Applications

MethodsAbsolute Position Encodings · Layer Normalization · Byte Pair Encoding · Label Smoothing · Softmax · Dropout · Dense Connections · Transformer