Structured Role-Aware Policy Optimization for Multimodal Reasoning
Bingqing Jiang, Difan Zou

TL;DR
This paper introduces SRPO, a role-aware policy optimization method that improves multimodal reasoning by assigning token-level credit based on functional roles, enhancing evidence grounding in vision-language models.
Contribution
SRPO refines sequence-level rewards into role-specific token advantages without altering the original reward function, advancing multimodal reasoning capabilities.
Findings
SRPO improves evidence-grounded reasoning across benchmarks.
Role-aware credit assignment enhances visual evidence utilization.
The method does not require external reward models or teachers.
Abstract
Reinforcement learning from verifiable rewards (RLVR), especially with Group Relative Policy Optimization (GRPO), has shown strong potential for improving the reasoning capabilities of large vision-language models (LVLMs). However, in multimodal reasoning, final-answer rewards are typically assigned at the sequence level and do not distinguish the functional roles of different tokens, making it difficult to determine whether a correct answer is supported by task-relevant visual evidence. In this paper, we revisit multimodal RLVR from the perspective of role-aware token-level credit assignment, where structured responses are decomposed into perception tokens for extracting visual evidence and reasoning tokens for deriving answers from that evidence. Based on this perspective, we propose Structured Role-aware Policy Optimization (SRPO), which refines the sequence-level GRPO advantage into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
