Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Zhaochen Su; Peng Xia; Hangyu Guo; Zhenhua Liu; Yan Ma; Xiaoye Qu; Jiaqi Liu; Yanshu Li; Kaide Zeng; Zhengyuan Yang; Linjie Li; Yu Cheng; Heng Ji; Junxian He; Yi R. Fung

arXiv:2506.23918·cs.CV·July 4, 2025

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, Linjie Li, Yu Cheng, Heng Ji, Junxian He, Yi R. Fung

PDF

1 Repo

TL;DR

This survey explores the evolution of multimodal reasoning in AI from static image understanding to dynamic, thinking-with-images paradigms, highlighting foundational principles, methods, benchmarks, and future challenges.

Contribution

It establishes a three-stage framework for think-with-image AI, reviews core methods, analyzes benchmarks, and outlines future research directions.

Findings

01

Introduction of a three-stage evolution framework

02

Comprehensive review of core methods at each stage

03

Analysis of benchmarks and future challenges

Abstract

Recent progress in multimodal reasoning has been significantly advanced by textual Chain-of-Thought (CoT), a paradigm where models conduct reasoning within language. This text-centric approach, however, treats vision as a static, initial context, creating a fundamental "semantic gap" between rich perceptual data and discrete symbolic thought. Human cognition often transcends language, utilizing vision as a dynamic mental sketchpad. A similar evolution is now unfolding in AI, marking a fundamental paradigm shift from models that merely think about images to those that can truly think with images. This emerging paradigm is characterized by models leveraging visual information as intermediate steps in their thought process, transforming vision from a passive input into a dynamic, manipulable cognitive workspace. In this survey, we chart this evolution of intelligence along a trajectory of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhaochen0110/awesome_think_with_images
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.