Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong,, Ivan Vuli\'c, Furu Wei

TL;DR
This paper introduces Multimodal Visualization-of-Thought (MVoT), a new paradigm enabling visual reasoning in large language models by generating image visualizations of their reasoning processes, improving performance in complex spatial tasks.
Contribution
It proposes MVoT, a novel multimodal reasoning framework that incorporates visual thinking into LLMs, along with a token discrepancy loss to enhance visualization quality.
Findings
MVoT achieves competitive results on spatial reasoning tasks.
It shows significant improvements over Chain-of-Thought in challenging scenarios.
MVoT enables visual reasoning to complement verbal reasoning effectively.
Abstract
Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Yet, it struggles in complex spatial reasoning tasks. Nonetheless, human cognition extends beyond language alone, enabling the remarkable capability to think in both words and images. Inspired by this mechanism, we propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT). It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces. To ensure high-quality visualization, we introduce token discrepancy loss into autoregressive MLLMs. This innovation significantly improves both visual coherence and fidelity. We validate this approach through several dynamic spatial reasoning tasks. Experimental results reveal that MVoT demonstrates competitive performance across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Data Visualization and Analytics · Advanced Graph Neural Networks
