SketchThinker-R1: Towards Efficient Sketch-Style Reasoning in Large Multimodal Models
Ruiyang Zhang, Dongzhan Zhou, Zhedong Zheng

TL;DR
SketchThinker-R1 introduces a training framework that encourages large multimodal models to adopt efficient, human-like sketch-style reasoning, significantly reducing computational costs while maintaining accuracy.
Contribution
The paper presents a novel three-stage training method to instill and enhance sketch-style reasoning in large multimodal models, improving efficiency and interpretability.
Findings
Achieves over 64% reduction in reasoning token cost
Maintains high answer accuracy despite efficiency improvements
Focuses reasoning on key cues for better interpretability
Abstract
Despite the empirical success of extensive, step-by-step reasoning in large multimodal models, long reasoning processes inevitably incur substantial computational overhead, i.e., in terms of higher token costs and increased response time, which undermines inference efficiency. In contrast, humans often employ sketch-style reasoning: a concise, goal-directed cognitive process that prioritizes salient information and enables efficient problem-solving. Inspired by this cognitive efficiency, we propose SketchThinker-R1, which incentivizes sketch-style reasoning ability in large multimodal models. Our method consists of three primary stages. In the Sketch-Mode Cold Start stage, we convert standard long reasoning process into sketch-style reasoning and finetune base multimodal model, instilling initial sketch-style reasoning capability. Next, we train SketchJudge Reward Model, which…
Peer Reviews
Decision·ICLR 2026 Poster
1. The method gives good efficiency gains on multimodal reasoning, with 64% token reduction, while keeping the accuracy. 2. The method is simple and make sense. It shows that by carefully curate the SFT cold start data, and a new reward for sketch-style, it is possible to achieve similar reasoning process while greatly reduce the token count. 3. The ablations are well designed and provide good analysis.
1. The major weakness is that the scale of the experiments are very small. It only contains 1K prompts training set and 150 steps. It is questionable whether this method is scalable and generalizable on more tasks. What will happen if a bigger training set is used, and more FLOPs trained? Will it be continually improving, or this method is more unstable compared with vanilla R1? The paper would be stronger if include more scaling experiments. 2. More analysis on the reward models would make the
The motivation is good. Reducing token number is reasonable for multimodal tasks thinking. The three‐stage pipeline (cold‐start conversion of reasoning style, reward model, RL) is reasonably well‐designed and coherently described. The reward model is reasonable. Experiments consistently show good results spanning different domains (visual reasoning, logic, physics) while showing substantial reductions in token cost while preserving good accuracy.
1. Suggest an experiment on ratio of the sketch reward. What if increasing the sketch coefficient and lower the format coefficient? 2. It is unclear if every baseline shares the same training data or just vanilla r1. Good to include more details. 3. How much time (GPU hour) in practice could be saved?
1. This paper introduces a novel and comprehensive three-stage framework, SketchThinker-R1, which moves beyond superficial reasoning length constraints by directly fostering an "intrinsic sketch-style thinking" within large multimodal models. This approach represents a significant paradigm shift from external compression to internal cognitive efficiency. 2. The development and integration of the specialized SketchJudge reward model are a key strength. This model's ability to accurately evaluate
1. This paper relies on a powerful closed-source large language model for generating concise "sketch-style reasoning" data. It remains uncertain whether these generated data consistently encapsulate all necessary reasoning steps, which could potentially impact the interpretability of the subsequent black-box reasoning model and challenge the paper's assumption of mimicking human thought. 2. This paper claims to enhance efficiency without compromising accuracy; however, its accuracy baseline for
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Intelligent Tutoring Systems and Adaptive Learning · Child and Animal Learning Development
