Asymmetric Advantage Modulation Calibrates Entropy Dynamics in RLVR

Hengrui Gu; Xiaotian Han; Yujing Bian; Feiyi Wang; Kaixiong Zhou

arXiv:2604.04894·cs.CL·May 13, 2026

Asymmetric Advantage Modulation Calibrates Entropy Dynamics in RLVR

Hengrui Gu, Xiaotian Han, Yujing Bian, Feiyi Wang, Kaixiong Zhou

PDF

TL;DR

This paper introduces AsymGRPO, a novel advantage modulation method for RLVR that selectively enhances productive entropy and suppresses noisy entropy, improving reasoning performance in large language models.

Contribution

It proposes a channel-wise advantage modulation approach that decouples positive and negative advantage updates, enabling more precise control over exploration and exploitation in RLVR.

Findings

01

AsymGRPO outperforms existing RLVR methods on five reasoning benchmarks.

02

Decoupling advantage channels improves model's reasoning accuracy.

03

Flexible modulation of advantage channels enhances learning across prompt difficulties.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning ability of large language models (LLMs), but it often suffers from \textit{restricted exploration}, where the policy rapidly concentrates on a narrow set of solutions. A common remedy is entropy regularization, which attempts to preserve exploration by increasing policy entropy. However, for LLM-RL, this intervention is highly sensitive to its coefficient, can introduce semantically weak uncertainty, and often yields limited accuracy gains. This motivates a more precise question: which entropy helps reasoning, and which entropy should be reduced? To study this, we parameterize the advantage estimator in Group Relative Policy Optimization (GRPO) into positive and negative outcome-conditioned channels and analyze their entropy dynamics. Our results show that positive-channel modulation raises…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.