Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning

Zhuoxu Huang; Mengxi Jia; Hao Sun; Xuelong Li; Jungong Han

arXiv:2602.20197·cs.LG·March 13, 2026

Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning

Zhuoxu Huang, Mengxi Jia, Hao Sun, Xuelong Li, Jungong Han

PDF

Open Access 3 Reviews

TL;DR

This paper introduces CalibRL, a hybrid-policy RLVR framework that enables controllable exploration in multi-modal reasoning models, improving training stability and performance by balancing exploration and exploitation through expert-guided mechanisms.

Contribution

CalibRL is a novel hybrid-policy RLVR method that uses distribution-aware advantage weighting and asymmetric activation to control exploration, addressing challenges of entropy collapse and policy degradation.

Findings

01

CalibRL achieves consistent improvements across eight benchmarks.

02

The framework effectively balances exploration and exploitation.

03

It stabilizes training by mitigating distributional mismatch.

Abstract

Reinforcement Learning with verifiable rewards (RLVR) has emerged as a primary learning paradigm for enhancing the reasoning capabilities of multi-modal large language models (MLLMs). However, during RL training, the enormous state space of MLLM and sparse rewards often leads to entropy collapse, policy degradation, or over-exploitation of suboptimal behaviors. This necessitates an exploration strategy that maintains productive stochasticity while avoiding the drawbacks of uncontrolled random sampling, yielding inefficient exploration. In this paper, we propose CalibRL, a hybrid-policy RLVR framework that supports controllable exploration with expert guidance, enabled by two key mechanisms. First, a distribution-aware advantage weighting scales updates by group rareness to calibrate the distribution, therefore preserving exploration. Meanwhile, the asymmetric activation function…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The paper is well written and easy to follow. It provides a nice summary of existing methods as well. 2. The proposed algorithm is simple and novel. 3. The empirical performance is strong.

Weaknesses

1. The algorithm seems to apply further beyond visual reasoning domain, but is not discussed. 2. The effectiveness of the algorithm is not verified for larger scales, like 30B. This seems too much to ask for though especially if the paper comes from academia.

Reviewer 02Rating 4Confidence 4

Strengths

1. This proposed method introduces a clear hybrid‑policy objective that treats expert trajectories as a relative reference, which helps maintain entropy while steering updates toward verified behaviors. The design uses an intuitive pair of knobs—an asymmetric gate on the policy‑vs‑expert log‑prob gap and a group‑rarity magnitude—that integrates cleanly with GRPO and is easy to implement. 2. Simple, general mechanism: The LeakyReLU‑gated is an elegant way to use demonstrations for relative guida

Weaknesses

1. GeoEval split clarity: The paper constructs GeoEval from validation failures of GPT‑4o CoT filtering, then reports it as a test benchmark with the largest deltas. Please clarify whether this split was ever used for hyper-parameter tuning or early stopping. If yes, results could be optimistically biased; if no, state this explicitly and detail safeguards. 2. Baselines for entropy control: Since the contribution is controllable exploration, it misses comparisons to standard entropy‑regularized

Reviewer 03Rating 6Confidence 2

Strengths

1. The paper is clearly written and easy to follow. 2. It addresses an important and widely existing problem, that under the SFT-then-RL paradigm, the policy becomes tightly anchored to the expert distribution during the SFT stage. This causes exploration to be restricted within the local neighborhood of expert behaviors, making it difficult to adapt to reward signals or discover more optimal reasoning trajectories. 3. The paper provides comprehensive and convincing ablation studies that support

Weaknesses

1. Most evaluations are math benchmarks. The claim of general multi-modal reasoning would be more convincing with benchmarks involving richer visual, linguistic, or commonsense reasoning modalities (e.g., ScienceQA, MMMU, or multimodal dialogue tasks).

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics