MIRL: Mutual Information-Guided Reinforcement Learning for Vision-Language Models

Yin Zhang; Jiaxuan Zhao; Zonghan Wu; Zengxiang Li; Junfeng Fang; Kun Wang; Qingsong Wen; Yilei Shao

arXiv:2605.01520·cs.CV·May 5, 2026

MIRL: Mutual Information-Guided Reinforcement Learning for Vision-Language Models

Yin Zhang, Jiaxuan Zhao, Zonghan Wu, Zengxiang Li, Junfeng Fang, Kun Wang, Qingsong Wen, Yilei Shao

PDF

1 Repo

TL;DR

MIRL introduces a mutual information-guided reinforcement learning framework that improves vision-language reasoning by efficiently allocating sampling resources and distinguishing perception errors from reasoning failures.

Contribution

It proposes a novel MI-based decoupled framework that enhances visual perception and reasoning in vision-language models, reducing sampling waste and reward blindness.

Findings

01

Achieves 70.22% average accuracy on six benchmarks.

02

Surpasses performance of 16 full trajectories with only 10 pre-samples.

03

Reduces complete trajectory sampling by 25%.

Abstract

Vision-Language Models (VLMs) frequently suffer from visual perception errors and hallucinations that compromise answer accuracy in complex reasoning tasks. Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising solution by optimizing policies using answer correctness signals. Despite their effectiveness, prevailing RLVR methods face two critical limitations. First, much of the sampling budget is wasted on trajectories doomed to fail due to early visual description errors. Second, sparse rewards cannot distinguish whether failures stem from visual perception or reasoning stages. We introduce MIRL, a decoupled framework that addresses both limitations by leveraging mutual information (MI) between generated descriptions and visual inputs as a cheap pre-screening signal. This enables intelligent budget allocation toward high-potential trajectories via forking, while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://anonymous.4open.science/r/mirl-main
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.