Chain-of-Sketch: Enabling Global Visual Reasoning
Aryo Lotfi, Enrico Fini, Samy Bengio, Moin Nabi, Emmanuel Abbe

TL;DR
This paper introduces Chain-of-Sketch, a method that improves global visual reasoning in large vision models by breaking complex tasks into intermediate steps with a Markovian structure, enhancing generalization and efficiency.
Contribution
We propose the chain-of-sketch technique with a Markovian structure, enabling better learning and generalization on global reasoning tasks in vision models.
Findings
Large vision models struggle with global reasoning tasks.
Chain-of-sketch improves learning efficiency and generalization.
Markovian structure in CoS enhances out-of-distribution performance.
Abstract
Modern vision models have achieved remarkable success in benchmarks where local features provide critical information about the target. There is now a growing interest in tackling tasks requiring more global reasoning, where local features do not provide significant information. Minsky and Papert put forward such tasks in 1969 with their connectivity study, exposing the limitations of the perceptron model. In this paper, we introduce an expanded set of global visual datasets involving graphs, strings, mazes, and image grids. We show that large vision models still struggle to learn these tasks efficiently. Similarly, state-of-the-art multi-modal LLMs perform poorly on these datasets. We explain this learning inefficiency by means of the 'globality degree' measure. To mitigate this, we propose a method called chain-of-sketch (CoS). Similar to the chain-of-thought and scratchpad techniques…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducational Tools and Methods · Online and Blended Learning
