From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training

Donglai Xu; Hongzheng Yang; Yuzhi Zhao; Pingping Zhang; Jinpeng Chen; Wenao Ma; Zhijian Hou; Mengyang Wu; Xiaolei Li; Senkang Hu; Ziyi Guan; Jason Chun Lok Li; Lai Man Po

arXiv:2511.07738·cs.LG·March 31, 2026

From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training

Donglai Xu, Hongzheng Yang, Yuzhi Zhao, Pingping Zhang, Jinpeng Chen, Wenao Ma, Zhijian Hou, Mengyang Wu, Xiaolei Li, Senkang Hu, Ziyi Guan, Jason Chun Lok Li, Lai Man Po

PDF

TL;DR

This paper introduces a two-stage entropy optimization method for reinforcement learning with verifiable rewards in multimodal large language models, improving noise tolerance and training stability.

Contribution

It proposes a novel phased approach that transitions from entropy maximization to minimization, enhancing robustness against noisy labels in MLLM training.

Findings

01

Outperforms prior methods across three MLLM backbones and various noise settings.

02

Effectively balances exploration and exploitation during training.

03

Achieves superior performance on multiple tasks with noisy data.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) for Multimodal Large Language Models (MLLMs) is highly dependent on high-quality labeled data, which is often scarce and prone to substantial annotation noise in real-world scenarios. Existing unsupervised RLVR methods, including pure entropy minimization, can overfit to incorrect labels and limit the crucial reward ranking signal for Group-Relative Policy Optimization (GRPO). To address these challenges and enhance noise tolerance, we propose a novel two-stage, token-level entropy optimization method for RLVR. This approach dynamically guides the model from exploration to exploitation during training. In the initial exploration phase, token-level entropy maximization promotes diverse and stochastic output generation, serving as a strong regularizer that prevents premature convergence to noisy labels and ensures sufficient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.