SATORI-R1: Incentivizing Multimodal Reasoning through Explicit Visual Anchoring

Chuming Shen; Wei Wei; Xiaoye Qu; Yu Cheng

arXiv:2505.19094·cs.CV·December 4, 2025

SATORI-R1: Incentivizing Multimodal Reasoning through Explicit Visual Anchoring

Chuming Shen, Wei Wei, Xiaoye Qu, Yu Cheng

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

SATORI-R1 introduces a multimodal reasoning framework for VQA that decomposes tasks into verifiable stages with explicit rewards, improving focus and accuracy over baseline models.

Contribution

It proposes a novel staged reasoning approach with explicit supervision and a new dataset, VQA-Verify, to enhance multimodal reasoning in VQA tasks.

Findings

01

Achieves up to 15.7% accuracy improvement over baseline.

02

Enhances focus on critical image regions through attention analysis.

03

Demonstrates consistent performance gains across seven benchmarks.

Abstract

DeepSeek-R1 has demonstrated powerful reasoning capabilities in the text domain through stable reinforcement learning (RL). Recently, in the multimodal domain, works have begun to directly apply RL to generate R1-like free-form reasoning for Visual Question Answering (VQA) tasks. However, multimodal tasks share an intrinsically different nature from textual tasks, which heavily rely on the understanding of the input image to solve the problem. Therefore, such free-form reasoning faces two critical limitations in the VQA task: (1) Extended reasoning chains diffuse visual focus away from task-critical regions, degrading answer accuracy. (2) Unverifiable intermediate steps amplify policy-gradient variance and computational costs overhead. To address these issues, in this paper, we introduce SATORI ( $S p a t ia l l y$ $A n c h or e d$ $T a s k$ $O pt imi z a t i o n$ with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

justairr/satori-r1
pytorchOfficial

Models

🤗
justairr/SATORI
model· 1 dl· ♡ 1
1 dl♡ 1

Datasets

justairr/VQA-Verify
dataset· 56 dl
56 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Neural Network Applications

MethodsSoftmax · Attention Is All You Need · Focus