TL;DR
This paper introduces KAWHI, a reward reweighting mechanism that enhances large vision-language models by explicitly integrating structured visual information into reinforcement learning, improving multimodal reasoning.
Contribution
KAWHI provides a novel, plug-and-play method for incorporating visual structure into reward optimization, boosting reasoning performance in LVLMs.
Findings
KAWHI consistently improves reasoning benchmarks across models.
It effectively localizes salient visual regions for better alignment.
KAWHI enhances the coupling of visual evidence with reasoning steps.
Abstract
Reinforcement Learning from Verifiable Rewards (RLVR) has substantially enhanced the reasoning capabilities of large language models in abstract reasoning tasks. However, its application to Large Vision-Language Models (LVLMs) remains constrained by a structural representational bottleneck. Existing approaches generally lack explicit modeling and effective utilization of visual information, preventing visual representations from being tightly coupled with the reinforcement learning optimization process and thereby limiting further improvements in multimodal reasoning performance. To address this limitation, we propose KAWHI (Key-Region Aligned Weighted Harmonic Incentive), a plug-and-play reward reweighting mechanism that explicitly incorporates structured visual information into uniform reward policy optimization methods (e.g., GRPO and GSPO). The method adaptively localizes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
