TL;DR
This paper introduces VGPO, a framework that enhances visual focus in vision-language models during reasoning by using visual attention compensation and a dual advantage re-weighting strategy.
Contribution
VGPO is a novel approach that mitigates visual forgetting and improves visual attention in multimodal reasoning models.
Findings
VGPO improves visual activation in reasoning tasks.
VGPO achieves better performance in multimodal reasoning.
VGPO effectively counters visual forgetting during reasoning.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visual faithfulness, characterized by sparse attention activation to visual tokens. More importantly, our empirical analysis reveals that temporal visual forgetting along reasoning steps exacerbates this deficiency. To bridge this gap, we propose Visually-Guided Policy Optimization (VGPO), a novel framework to reinforce visual focus during policy optimization. Specifically, VGPO initially introduces a Visual Attention Compensation mechanism that leverages visual similarity to localize and amplify visual cues, while progressively elevating visual expectations in later steps to counteract visual forgetting. Building on this mechanism, we implement a dual-grained advantage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
