Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke, Zettlemoyer, Noah A Smith, Ranjay Krishna

TL;DR
This paper introduces Sketchpad, a visual sketching framework for multimodal language models that enhances reasoning by allowing models to draw and manipulate visual artifacts, leading to significant performance improvements.
Contribution
Sketchpad enables multimodal LMs to draw and reason with visual sketches and tools, bridging the gap between human-like sketching and AI reasoning capabilities.
Findings
Significant performance gains on math and visual reasoning tasks.
State-of-the-art results on multiple benchmarks.
Effective integration of visual sketching with vision models.
Abstract
Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our limited-capacity working memory. However, such actions are missing in current multimodal language models (LMs). Current chain-of-thought and tool-use paradigms only use text as intermediate reasoning steps. In this work, we introduce Sketchpad, a framework that gives multimodal LMs a visual sketchpad and tools to draw on the sketchpad. The LM conducts planning and reasoning according to the visual artifacts it has drawn. Different from prior work, which uses text-to-image models to enable LMs to draw, Sketchpad enables LMs to draw with lines, boxes, marks, etc., which is closer to human sketching and better facilitates reasoning. Sketchpad can also use specialist vision models during the sketching…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Storytelling and Education · Speech and dialogue systems
MethodsBalanced Selection
