Inference-Time Hyper-Scaling with KV Cache Compression
Adrian {\L}a\'ncucki, Konrad Staniszewski, Piotr Nawrot, Edoardo M. Ponti

TL;DR
This paper introduces a novel method called Dynamic Memory Sparsification (DMS) for compressing KV caches in Transformer LLMs, enabling inference-time hyper-scaling that improves accuracy without increasing compute or memory load.
Contribution
The paper proposes DMS, a new KV cache compression technique that maintains accuracy at high compression ratios and demonstrates its effectiveness across multiple LLMs and tasks.
Findings
DMS achieves 8× compression with only 1K training steps.
DMS improves accuracy on multiple benchmarks for scaled inference.
Inference-time hyper-scaling boosts LLM performance without additional latency.
Abstract
Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in Transformer LLMs, generation cost is bottlenecked by the size of the key-value (KV) cache, rather than the number of generated tokens. Hence, we explore inference-time hyper-scaling: by compressing the KV cache, we can generate more tokens within the same compute budget and further improve the accuracy of scaled inference. The success of this approach, however, hinges on the ability of compression methods to preserve accuracy even at high compression ratios. To make hyper-scaling practical, we introduce Dynamic Memory Sparsification (DMS), a novel method for sparsifying KV caches that only requires 1K training steps to achieve 8 compression, while maintaining better accuracy than training-free sparse attention. Instead of prematurely discarding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/KVzap-linear-Qwen3-8Bmodel· 25 dl· ♡ 125 dl♡ 1
- 🤗nvidia/KVzap-mlp-Qwen3-8Bmodel· 349 dl· ♡ 3349 dl♡ 3
- 🤗nvidia/KVzap-mlp-Qwen3-32Bmodel· 20 dl· ♡ 520 dl♡ 5
- 🤗nvidia/KVzap-linear-Qwen3-32Bmodel· 11 dl· ♡ 311 dl♡ 3
- 🤗nvidia/KVzap-linear-Llama-3.1-8B-Instructmodel· 194 dl194 dl
- 🤗nvidia/KVzap-mlp-Llama-3.1-8B-Instructmodel· 145 dl· ♡ 3145 dl♡ 3
- 🤗nvidia/Qwen3-8B-DMS-8xmodel· 959 dl· ♡ 34959 dl♡ 34
- 🤗g023/Qwen3-8B-DMS-8x-4bit-NF4model· 136 dl· ♡ 1136 dl♡ 1
Videos
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Advanced Neural Network Applications
MethodsAbsolute Position Encodings · Layer Normalization · Byte Pair Encoding · Label Smoothing · Softmax · Dropout · Dense Connections · Transformer
