Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models
Sujung Hong, Chanyong Yoon, Seong Jae Hwang

TL;DR
This paper identifies key issues causing repetitive and poorly grounded long-form generation in large diffusion vision-language models and proposes a training-free, plug-and-play solution to improve their performance.
Contribution
The authors introduce Mask Prior Suppression and Monotonic RoPE Scaling techniques to mitigate mask prior drift and attention collapse without additional training.
Findings
Improved long-form generation quality and visual grounding in LDVLMs.
Robust performance gains across multiple benchmarks.
Effective mitigation of repetitive generation and attention issues.
Abstract
Large diffusion vision-language models (LDVLMs) have recently emerged as a promising alternative to autoregressive models, enabling parallel decoding for efficient inference and leveraging bidirectional attention for global context. Despite these advances, their behavior under long-form generation remains underexplored. In this work, we show that existing LDVLMs suffer from repetitive generation and degraded visual grounding, and identify two underlying causes. First, repetitive generation originates from a mask token prior: since generation tokens are initialized as mask tokens, their hidden representations progressively drift toward a shared prior direction over generation steps. Second, a fundamental misalignment between the positional attention bias and the iterative unmasking process suppresses attention toward informative visual tokens, degrading visual grounding. Based on these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
