Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models
Jialong Wu, Xiaoying Zhang, Hongyi Yuan, Xiangcheng Zhang, Tianhao Huang, Changjing He, Chaoyi Deng, Renrui Zhang, Youbin Wu, Mingsheng Long

TL;DR
This paper investigates how visual generation within multimodal models enhances reasoning, especially in physical and spatial domains, by formalizing world models and empirically testing their benefits through a new evaluation suite.
Contribution
It introduces a formal analysis of world models in chain-of-thought reasoning and demonstrates when visual generation improves reasoning through controlled experiments.
Findings
Interleaved visual-verbal reasoning outperforms verbal-only in physical tasks.
Visual world models help overcome representational limitations.
Multimodal models excel in tasks grounded in the physical world.
Abstract
Humans construct internal world models and reason by manipulating the concepts within these models. Recent advances in AI, particularly chain-of-thought (CoT) reasoning, approximate such human cognitive abilities, where world models are believed to be embedded within large language models. Expert-level performance in formal and abstract domains such as mathematics and programming has been achieved in current systems by relying predominantly on verbal reasoning. However, they still lag far behind humans in domains like physical and spatial intelligence, which require richer representations and prior knowledge. The emergence of unified multimodal models (UMMs) capable of both verbal and visual generation has therefore sparked interest in more human-like reasoning grounded in complementary multimodal pathways, though their benefits remain unclear. From a world-model perspective, this paper…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Action Observation and Synchronization
