Asymmetric Advantage Modulation Calibrates Entropy Dynamics in RLVR
Hengrui Gu, Xiaotian Han, Yujing Bian, Feiyi Wang, Kaixiong Zhou

TL;DR
This paper introduces AsymGRPO, a novel advantage modulation method for RLVR that selectively enhances productive entropy and suppresses noisy entropy, improving reasoning performance in large language models.
Contribution
It proposes a channel-wise advantage modulation approach that decouples positive and negative advantage updates, enabling more precise control over exploration and exploitation in RLVR.
Findings
AsymGRPO outperforms existing RLVR methods on five reasoning benchmarks.
Decoupling advantage channels improves model's reasoning accuracy.
Flexible modulation of advantage channels enhances learning across prompt difficulties.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning ability of large language models (LLMs), but it often suffers from \textit{restricted exploration}, where the policy rapidly concentrates on a narrow set of solutions. A common remedy is entropy regularization, which attempts to preserve exploration by increasing policy entropy. However, for LLM-RL, this intervention is highly sensitive to its coefficient, can introduce semantically weak uncertainty, and often yields limited accuracy gains. This motivates a more precise question: which entropy helps reasoning, and which entropy should be reduced? To study this, we parameterize the advantage estimator in Group Relative Policy Optimization (GRPO) into positive and negative outcome-conditioned channels and analyze their entropy dynamics. Our results show that positive-channel modulation raises…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
