Do multimodal models imagine electric sheep?
Santhosh Kumar Ramakrishnan, Carl Vondrick, Raja Giryes, Philipp Kr\"ahenb\"uhl, Vladlen Koltun

TL;DR
Large multimodal models develop internal visual representations when solving spatial puzzles, and fine-tuning them with action prediction enhances their reasoning abilities by sharpening these mental images.
Contribution
The paper demonstrates that multimodal models form implicit visual world models through action prediction training and proposes methods to improve their reasoning by refining mental imagery.
Findings
Models encode meaningful visual info after each action.
Supervising action sequences fosters internal visual representations.
Enhancing visual tokens improves reasoning accuracy from 83% to 89%.
Abstract
Yes. We find that large multimodal models develop mental imagery when solving spatial puzzles, and they do imagine sheep when solving sheep puzzles. We fine-tune a Qwen3.5 VLM to solve twelve diverse visual reasoning tasks -- including tangram, jigsaw, sokoban, 3D mental rotation, and rush hour -- that require understanding geometry, spatial relationships, and the consequences of actions. By supervising the model to predict the open-loop sequence of actions to solve a puzzle from an initial state, we show that the model's activations after each action encode meaningful visual information about the intermediate state. This finding suggests that an imperfect visual world model begins to form as a byproduct of learning to select correct actions, in the absence of any explicit visual supervision. Building on this observation, we propose two ways to sharpen and use the mental images formed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
