DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models
Zefeng He, Xiaoye Qu, Yafu Li, Tong Zhu, Siyuan Huang, Yu Cheng

TL;DR
DiffThinker introduces a diffusion-based generative framework for multimodal reasoning, reformulating it as an image-to-image task to improve logical consistency and spatial accuracy in vision-centric applications.
Contribution
The paper presents DiffThinker, a novel diffusion-based paradigm for multimodal reasoning, demonstrating significant performance improvements over existing models in vision-centric tasks.
Findings
Outperforms GPT-5 by +314.2% in certain tasks
Achieves +111.6% improvement over Gemini-3-Flash
Surpasses fine-tuned Qwen3-VL-32B baseline by +39.0%
Abstract
While recent Multimodal Large Language Models (MLLMs) have attained significant strides in multimodal reasoning, their reasoning processes remain predominantly text-centric, leading to suboptimal performance in complex long-horizon, vision-centric tasks. In this paper, we establish a novel Generative Multimodal Reasoning paradigm and introduce DiffThinker, a diffusion-based reasoning framework. Conceptually, DiffThinker reformulates multimodal reasoning as a native generative image-to-image task, achieving superior logical consistency and spatial precision in vision-centric tasks. We perform a systematic comparison between DiffThinker and MLLMs, providing the first in-depth investigation into the intrinsic characteristics of this paradigm, revealing four core properties: efficiency, controllability, native parallelism, and collaboration. Extensive experiments across four domains…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Constraint Satisfaction and Optimization
