EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models
Minsoo Cheong, Donghyun Son, Woosang Lim, Sungjoo Yoo

TL;DR
EntropyCache is a novel, training-free method for KV caching in diffusion language models that uses token entropy to efficiently decide when to recompute, significantly speeding up inference with minimal overhead.
Contribution
It introduces a new entropy-based, constant-cost decision mechanism for KV cache updates that is independent of context length and model size.
Findings
Achieves 15.2x-26.4x speedup on standard benchmarks.
Maintains competitive accuracy with minimal decision overhead.
Decision process accounts for only 0.5% of inference time.
Abstract
Diffusion-based large language models (dLLMs) rely on bidirectional attention, which prevents lossless KV caching and requires a full forward pass at every denoising step. Existing approximate KV caching methods reduce this cost by selectively updating cached states, but their decision overhead scales with context length or model depth. We propose EntropyCache, a training-free KV caching method that uses the maximum entropy of newly decoded token distributions as a constant-cost signal for deciding when to recompute. Our design is grounded in two empirical observations: (1) decoded token entropy correlates with KV cache drift, providing a cheap proxy for cache staleness, and (2) feature volatility of decoded tokens persists for multiple steps after unmasking, motivating recomputation of the most recently decoded tokens. The skip-or-recompute decision requires only computation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Neural Network Applications · Natural Language Processing Techniques
