Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models

Keuntae Kim; Mingyu Kang; Yong Suk Choi

arXiv:2604.05497·cs.AI·April 8, 2026

Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models

Keuntae Kim, Mingyu Kang, Yong Suk Choi

PDF

TL;DR

This paper identifies issues in diffusion multimodal large language models with Chain-of-Thought reasoning, such as premature answers and weak visual grounding, and proposes methods to improve reasoning quality and speed.

Contribution

It introduces Position and Step Penalty and Visual Reasoning Guidance to enhance reasoning accuracy and efficiency in dMLLMs.

Findings

01

Achieved up to 7.5% higher accuracy

02

Delivered more than 3x speedup in reasoning

03

Improved visual grounding and reasoning progression

Abstract

Diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive (AR) LLMs. Recently, this paradigm has been extended to multimodal tasks, leading to the development of diffusion multimodal large language models (dMLLMs). These models are expected to retain the reasoning capabilities of LLMs while enabling faster inference through parallel generation. However, when combined with Chain-of-Thought (CoT) reasoning, dMLLMs exhibit two critical issues. First, we observe that dMLLMs often generate the final answer token at a very early timestep. This trend indicates that the model determines the answer before sufficient reasoning, leading to degraded reasoning performance. Second, during the initial timesteps, dMLLMs show minimal dependency on visual prompts, exhibiting a fundamentally different pattern of visual information utilization compared to AR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.