Thinking with Generated Images

Ethan Chern; Zhulin Hu; Steffi Chern; Siqi Kou; Jiadi Su; Yan Ma; Zhijie Deng; Pengfei Liu

arXiv:2505.22525·cs.CV·May 29, 2025

Thinking with Generated Images

Ethan Chern, Zhulin Hu, Steffi Chern, Siqi Kou, Jiadi Su, Yan Ma, Zhijie Deng, Pengfei Liu

PDF

Open Access

TL;DR

This paper introduces a new paradigm for multimodal models to think visually by generating and critiquing intermediate images, significantly enhancing complex visual reasoning capabilities across various domains.

Contribution

It presents a novel approach allowing models to generate and critique intermediate visual thoughts, improving visual reasoning beyond fixed images or text-only methods.

Findings

01

Up to 50% improvement in complex multi-object visual tasks

02

Effective decomposition of complex visual problems into manageable steps

03

Enhanced iterative visual hypothesis refinement

Abstract

We present Thinking with Generated Images, a novel paradigm that fundamentally transforms how large multimodal models (LMMs) engage with visual reasoning by enabling them to natively think across text and vision modalities through spontaneous generation of intermediate visual thinking steps. Current visual reasoning with LMMs is constrained to either processing fixed user-provided images or reasoning solely through text-based chain-of-thought (CoT). Thinking with Generated Images unlocks a new dimension of cognitive capability where models can actively construct intermediate visual thoughts, critique their own visual hypotheses, and refine them as integral components of their reasoning process. We demonstrate the effectiveness of our approach through two complementary mechanisms: (1) vision generation with intermediate visual subgoals, where models decompose complex visual tasks into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Language, Metaphor, and Cognition