DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models

Zefeng He; Xiaoye Qu; Yafu Li; Tong Zhu; Siyuan Huang; Yu Cheng

arXiv:2512.24165·cs.CV·January 1, 2026

DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models

Zefeng He, Xiaoye Qu, Yafu Li, Tong Zhu, Siyuan Huang, Yu Cheng

PDF

Open Access 1 Models 1 Datasets

TL;DR

DiffThinker introduces a diffusion-based generative framework for multimodal reasoning, reformulating it as an image-to-image task to improve logical consistency and spatial accuracy in vision-centric applications.

Contribution

The paper presents DiffThinker, a novel diffusion-based paradigm for multimodal reasoning, demonstrating significant performance improvements over existing models in vision-centric tasks.

Findings

01

Outperforms GPT-5 by +314.2% in certain tasks

02

Achieves +111.6% improvement over Gemini-3-Flash

03

Surpasses fine-tuned Qwen3-VL-32B baseline by +39.0%

Abstract

While recent Multimodal Large Language Models (MLLMs) have attained significant strides in multimodal reasoning, their reasoning processes remain predominantly text-centric, leading to suboptimal performance in complex long-horizon, vision-centric tasks. In this paper, we establish a novel Generative Multimodal Reasoning paradigm and introduce DiffThinker, a diffusion-based reasoning framework. Conceptually, DiffThinker reformulates multimodal reasoning as a native generative image-to-image task, achieving superior logical consistency and spatial precision in vision-centric tasks. We perform a systematic comparison between DiffThinker and MLLMs, providing the first in-depth investigation into the intrinsic characteristics of this paradigm, revealing four core properties: efficiency, controllability, native parallelism, and collaboration. Extensive experiments across four domains…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
yhx12/DiffThinker
model· ♡ 12
♡ 12

Datasets

yhx12/DiffThinker_Eval
dataset· 333 dl
333 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Constraint Satisfaction and Optimization