TL;DR
MIRL introduces a mutual information-guided reinforcement learning framework that improves vision-language reasoning by efficiently allocating sampling resources and distinguishing perception errors from reasoning failures.
Contribution
It proposes a novel MI-based decoupled framework that enhances visual perception and reasoning in vision-language models, reducing sampling waste and reward blindness.
Findings
Achieves 70.22% average accuracy on six benchmarks.
Surpasses performance of 16 full trajectories with only 10 pre-samples.
Reduces complete trajectory sampling by 25%.
Abstract
Vision-Language Models (VLMs) frequently suffer from visual perception errors and hallucinations that compromise answer accuracy in complex reasoning tasks. Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising solution by optimizing policies using answer correctness signals. Despite their effectiveness, prevailing RLVR methods face two critical limitations. First, much of the sampling budget is wasted on trajectories doomed to fail due to early visual description errors. Second, sparse rewards cannot distinguish whether failures stem from visual perception or reasoning stages. We introduce MIRL, a decoupled framework that addresses both limitations by leveraging mutual information (MI) between generated descriptions and visual inputs as a cheap pre-screening signal. This enables intelligent budget allocation toward high-potential trajectories via forking, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
