TL;DR
DARE introduces token-wise activation reuse techniques for diffusion language models, significantly reducing inference latency while maintaining high output quality, and can be combined with existing methods for further efficiency gains.
Contribution
It proposes novel token-wise reuse mechanisms (DARE-KV and DARE-O) for diffusion LLMs, improving efficiency without retraining and with minimal performance loss.
Findings
Up to 1.20x per-layer latency reduction.
Reuses up to 87% of attention activations.
Negligible degradation on reasoning and code-generation benchmarks.
Abstract
Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to auto-regressive (AR) models, offering greater expressive capacity and potential for parallel generation and faster inference. However, open-source dLLMs remain immature, lagging behind AR models in both efficiency and quality. We identify an underexplored property of dLLMs: *token-wise redundancy* in bi-directional self-attention. Self-attention activations are highly correlated across tokens, and temporal changes in query representations can predict redundancy in corresponding key, value, and output activations. We introduce DARE, with two complementary mechanisms: DARE-KV, which reuses cached key-value (KV) activations, and DARE-O, which reuses output activations to reduce redundant computation while preserving quality. DARE achieves up to 1.20x per-layer latency reduction and reuses up to 87% of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
