TL;DR
This paper introduces TW-GRPO, a reinforcement learning framework that improves video reasoning by focusing on salient information and utilizing dense rewards, leading to state-of-the-art results on multiple benchmarks.
Contribution
The paper proposes TW-GRPO, which enhances visual reasoning with token weighting, multi-choice rewards, and data augmentation, addressing key limitations of previous methods.
Findings
Achieves 50.4% accuracy on CLEVRER, outperforming previous models.
Improves MMVU accuracy to 65.8%.
Demonstrates effectiveness of focused reasoning and dense rewards.
Abstract
Recent advancements in reinforcement learning, particularly through Group Relative Policy Optimization (GRPO), have significantly improved multimodal large language models for complex reasoning tasks. However, two critical limitations persist: 1) they often produce unfocused, verbose reasoning chains that obscure salient spatiotemporal cues and 2) binary rewarding fails to account for partially correct answers, resulting in high reward variance and inefficient learning. In this paper, we propose TW-GRPO, a novel framework that enhances visual reasoning with focused thinking and dense reward granularity. Specifically, we employs a token weighting mechanism that prioritizes tokens with high informational density (estimated by intra-group information entropy), suppressing redundant tokens like generic reasoning prefixes. Furthermore, we reformulate RL training by shifting from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
