MAGE: All-[MASK] Block Already Knows Where to Look in Diffusion LLM
Omin Kwon, Yeonjae Kim, Doyeon Kim, Minseo Kim, Yeonhong Park, Jae W. Lee

TL;DR
MAGE introduces a novel method for block diffusion LLMs that predicts important memory entries at the first denoising step, enabling efficient sparse attention and significant speedups with minimal fine-tuning.
Contribution
This work presents a new approach leveraging initial denoising attention to optimize KV caching in block diffusion LLMs, outperforming existing autoregressive sparse attention methods.
Findings
Achieves near-lossless accuracy with reduced KV budget.
Delivers 3-4x end-to-end speedup on long-context benchmarks.
Requires only a few hours of fine-tuning on a single GPU.
Abstract
Block diffusion LLMs are emerging as a promising next paradigm for language generation, but their use of KV caching makes memory access a dominant bottleneck in long-context settings. While dynamic sparse attention has been actively explored, existing methods designed for autoregressive LLMs rely on approximate importance estimation and perform poorly when adapted to block diffusion. This work identifies a key opportunity unique to block diffusion: attention at the first All-[MASK] denoising step reliably predicts important KV entries and budget requirements, enabling MAGE to perform a single exact attention pass per block and reuse it for training-free sparse denoising. Across long-context benchmarks including LongBench and Needle-in-a-Haystack, MAGE achieves near-lossless accuracy with a fraction of the KV budget while delivering up to 3-4x end-to-end speedup, consistently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Ferroelectric and Negative Capacitance Devices
