TL;DR
This paper introduces PGPO, a novel fine-grained policy optimization method that enhances multimodal reasoning in large vision-language models by emphasizing visually-grounded tokens, leading to significant performance improvements.
Contribution
The paper proposes PGPO, a new token-level advantage reshaping framework that improves learning signals for visually-dependent tokens in large vision-language models.
Findings
PGPO boosts model performance by 18.7% on average across benchmarks.
It reduces gradient variance and prevents training collapse.
PGPO acts as an effective regularizer for perception-grounded reasoning.
Abstract
While Reinforcement Learning from Verifiable Rewards (RLVR) has advanced reasoning in Large Vision-Language Models (LVLMs), prevailing frameworks suffer from a foundational methodological flaw: by distributing identical advantages across all generated tokens, these methods inherently dilute the learning signals essential for optimizing the critical, visually-grounded steps of multimodal reasoning. To bridge this gap, we formulate \textit{Token Visual Dependency}, quantifying the causal information gain of visual inputs via the Kullback-Leibler (KL) divergence between visual-conditioned and text-only predictive distributions. Revealing that this dependency is highly sparse and semantically pivotal, we introduce Perception-Grounded Policy Optimization (PGPO), which is a novel fine-grained credit assignment framework that dynamically reshapes advantages at the token level. Through a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
