Revealing the Attention Floating Mechanism in Masked Diffusion Models
Xin Dai, Pengcheng Huang, Zhenghao Liu, Shuo Wang, Yukun Yan, Chaojun Xiao, Yu Gu, Ge Yu, and Maosong Sun

TL;DR
This paper uncovers the unique dynamic attention mechanism in Masked Diffusion Models, explaining their superior in-context learning and performance in knowledge-intensive tasks compared to autoregressive models.
Contribution
It introduces the concept of Attention Floating in MDMs and analyzes its structure, providing insights into their enhanced capabilities over ARMs.
Findings
Attention in MDMs shifts dynamically across steps and layers.
Shallow layers focus on structural framework, deep layers on semantic content.
MDMs outperform ARMs by doubling accuracy in knowledge tasks.
Abstract
Masked diffusion models (MDMs), which leverage bidirectional attention and a denoising process, are narrowing the performance gap with autoregressive models (ARMs). However, their internal attention mechanisms remain under-explored. This paper investigates the attention behaviors in MDMs, revealing the phenomenon of Attention Floating. Unlike ARMs, where attention converges to a fixed sink, MDMs exhibit dynamic, dispersed attention anchors that shift across denoising steps and layers. Further analysis reveals its Shallow Structure-Aware, Deep Content-Focused attention mechanism: shallow layers utilize floating tokens to build a global structural framework, while deeper layers allocate more capability toward capturing semantic content. Empirically, this distinctive attention pattern provides a mechanistic explanation for the strong in-context learning capabilities of MDMs, allowing them…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Functional Brain Connectivity Studies
