Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

Sujung Hong; Chanyong Yoon; Seong Jae Hwang

arXiv:2605.14530·cs.CV·May 20, 2026

Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

Sujung Hong, Chanyong Yoon, Seong Jae Hwang

PDF

TL;DR

This paper identifies key issues causing repetitive and poorly grounded long-form generation in large diffusion vision-language models and proposes a training-free, plug-and-play solution to improve their performance.

Contribution

The authors introduce Mask Prior Suppression and Monotonic RoPE Scaling techniques to mitigate mask prior drift and attention collapse without additional training.

Findings

01

Improved long-form generation quality and visual grounding in LDVLMs.

02

Robust performance gains across multiple benchmarks.

03

Effective mitigation of repetitive generation and attention issues.

Abstract

Large diffusion vision-language models (LDVLMs) have recently emerged as a promising alternative to autoregressive models, enabling parallel decoding for efficient inference and leveraging bidirectional attention for global context. Despite these advances, their behavior under long-form generation remains underexplored. In this work, we show that existing LDVLMs suffer from repetitive generation and degraded visual grounding, and identify two underlying causes. First, repetitive generation originates from a mask token prior: since generation tokens are initialized as mask tokens, their hidden representations progressively drift toward a shared prior direction over generation steps. Second, a fundamental misalignment between the positional attention bias and the iterative unmasking process suppresses attention toward informative visual tokens, degrading visual grounding. Based on these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.