Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities
Sachit Menon, Richard Zemel, Carl Vondrick

TL;DR
Whiteboard-of-Thought is a novel prompting method that enhances multimodal large language models' visual reasoning by allowing them to draw and process images as part of their reasoning steps, significantly improving performance on spatial tasks.
Contribution
The paper introduces a simple, effective prompting technique enabling multimodal models to perform visual reasoning by drawing and interpreting images without additional training or modules.
Findings
Achieves state-of-the-art results on four visual reasoning tasks.
Enables up to 92% accuracy where chain-of-thought fails.
Identifies key factors influencing the method's success and errors.
Abstract
When presented with questions involving visual thinking, humans naturally switch reasoning modalities, often forming mental images or drawing visual aids. Large language models have shown promising results in arithmetic and symbolic reasoning by expressing intermediate reasoning in text as a chain of thought, yet struggle to extend this capability to answer text queries that are easily solved by visual reasoning, even with extensive multimodal pretraining. We introduce a simple method, whiteboard-of-thought prompting, to unlock the visual reasoning capabilities of multimodal large language models across modalities. Whiteboard-of-thought prompting provides multimodal large language models with a metaphorical `whiteboard' to draw out reasoning steps as images, then returns these images back to the model for further processing. We find this can be accomplished with no demonstrations or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Systems Theories and Implementation
