Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models

Jialong Wu; Xiaoying Zhang; Hongyi Yuan; Xiangcheng Zhang; Tianhao Huang; Changjing He; Chaoyi Deng; Renrui Zhang; Youbin Wu; Mingsheng Long

arXiv:2601.19834·cs.AI·January 28, 2026

Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models

Jialong Wu, Xiaoying Zhang, Hongyi Yuan, Xiangcheng Zhang, Tianhao Huang, Changjing He, Chaoyi Deng, Renrui Zhang, Youbin Wu, Mingsheng Long

PDF

Open Access 1 Datasets

TL;DR

This paper investigates how visual generation within multimodal models enhances reasoning, especially in physical and spatial domains, by formalizing world models and empirically testing their benefits through a new evaluation suite.

Contribution

It introduces a formal analysis of world models in chain-of-thought reasoning and demonstrates when visual generation improves reasoning through controlled experiments.

Findings

01

Interleaved visual-verbal reasoning outperforms verbal-only in physical tasks.

02

Visual world models help overcome representational limitations.

03

Multimodal models excel in tasks grounded in the physical world.

Abstract

Humans construct internal world models and reason by manipulating the concepts within these models. Recent advances in AI, particularly chain-of-thought (CoT) reasoning, approximate such human cognitive abilities, where world models are believed to be embedded within large language models. Expert-level performance in formal and abstract domains such as mathematics and programming has been achieved in current systems by relying predominantly on verbal reasoning. However, they still lag far behind humans in domains like physical and spatial intelligence, which require richer representations and prior knowledge. The emergence of unified multimodal models (UMMs) capable of both verbal and visual generation has therefore sparked interest in more human-like reasoning grounded in complementary multimodal pathways, though their benefits remain unclear. From a world-model perspective, this paper…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

thuml/VisWorld-Eval
dataset· 362 dl
362 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Action Observation and Synchronization