SATORI-R1: Incentivizing Multimodal Reasoning through Explicit Visual Anchoring
Chuming Shen, Wei Wei, Xiaoye Qu, Yu Cheng

TL;DR
SATORI-R1 introduces a multimodal reasoning framework for VQA that decomposes tasks into verifiable stages with explicit rewards, improving focus and accuracy over baseline models.
Contribution
It proposes a novel staged reasoning approach with explicit supervision and a new dataset, VQA-Verify, to enhance multimodal reasoning in VQA tasks.
Findings
Achieves up to 15.7% accuracy improvement over baseline.
Enhances focus on critical image regions through attention analysis.
Demonstrates consistent performance gains across seven benchmarks.
Abstract
DeepSeek-R1 has demonstrated powerful reasoning capabilities in the text domain through stable reinforcement learning (RL). Recently, in the multimodal domain, works have begun to directly apply RL to generate R1-like free-form reasoning for Visual Question Answering (VQA) tasks. However, multimodal tasks share an intrinsically different nature from textual tasks, which heavily rely on the understanding of the input image to solve the problem. Therefore, such free-form reasoning faces two critical limitations in the VQA task: (1) Extended reasoning chains diffuse visual focus away from task-critical regions, degrading answer accuracy. (2) Unverifiable intermediate steps amplify policy-gradient variance and computational costs overhead. To address these issues, in this paper, we introduce SATORI ( with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Neural Network Applications
MethodsSoftmax · Attention Is All You Need · Focus
