Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning
Zhuoxu Huang, Mengxi Jia, Hao Sun, Xuelong Li, Jungong Han

TL;DR
This paper introduces CalibRL, a hybrid-policy RLVR framework that enables controllable exploration in multi-modal reasoning models, improving training stability and performance by balancing exploration and exploitation through expert-guided mechanisms.
Contribution
CalibRL is a novel hybrid-policy RLVR method that uses distribution-aware advantage weighting and asymmetric activation to control exploration, addressing challenges of entropy collapse and policy degradation.
Findings
CalibRL achieves consistent improvements across eight benchmarks.
The framework effectively balances exploration and exploitation.
It stabilizes training by mitigating distributional mismatch.
Abstract
Reinforcement Learning with verifiable rewards (RLVR) has emerged as a primary learning paradigm for enhancing the reasoning capabilities of multi-modal large language models (MLLMs). However, during RL training, the enormous state space of MLLM and sparse rewards often leads to entropy collapse, policy degradation, or over-exploitation of suboptimal behaviors. This necessitates an exploration strategy that maintains productive stochasticity while avoiding the drawbacks of uncontrolled random sampling, yielding inefficient exploration. In this paper, we propose CalibRL, a hybrid-policy RLVR framework that supports controllable exploration with expert guidance, enabled by two key mechanisms. First, a distribution-aware advantage weighting scales updates by group rareness to calibrate the distribution, therefore preserving exploration. Meanwhile, the asymmetric activation function…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper is well written and easy to follow. It provides a nice summary of existing methods as well. 2. The proposed algorithm is simple and novel. 3. The empirical performance is strong.
1. The algorithm seems to apply further beyond visual reasoning domain, but is not discussed. 2. The effectiveness of the algorithm is not verified for larger scales, like 30B. This seems too much to ask for though especially if the paper comes from academia.
1. This proposed method introduces a clear hybrid‑policy objective that treats expert trajectories as a relative reference, which helps maintain entropy while steering updates toward verified behaviors. The design uses an intuitive pair of knobs—an asymmetric gate on the policy‑vs‑expert log‑prob gap and a group‑rarity magnitude—that integrates cleanly with GRPO and is easy to implement. 2. Simple, general mechanism: The LeakyReLU‑gated is an elegant way to use demonstrations for relative guida
1. GeoEval split clarity: The paper constructs GeoEval from validation failures of GPT‑4o CoT filtering, then reports it as a test benchmark with the largest deltas. Please clarify whether this split was ever used for hyper-parameter tuning or early stopping. If yes, results could be optimistically biased; if no, state this explicitly and detail safeguards. 2. Baselines for entropy control: Since the contribution is controllable exploration, it misses comparisons to standard entropy‑regularized
1. The paper is clearly written and easy to follow. 2. It addresses an important and widely existing problem, that under the SFT-then-RL paradigm, the policy becomes tightly anchored to the expert distribution during the SFT stage. This causes exploration to be restricted within the local neighborhood of expert behaviors, making it difficult to adapt to reward signals or discover more optimal reasoning trajectories. 3. The paper provides comprehensive and convincing ablation studies that support
1. Most evaluations are math benchmarks. The claim of general multi-modal reasoning would be more convincing with benchmarks involving richer visual, linguistic, or commonsense reasoning modalities (e.g., ScienceQA, MMMU, or multimodal dialogue tasks).
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics
