Your Group-Relative Advantage Is Biased
Fengkai Yang, Zherui Chen, Xiaohan Wang, Xiaodong Lu, Jiajun Chai, Guojun Yin, Wei Lin, Shuai Ma, Fuzhen Zhuang, Deqing Wang, Yaodong Yang, Jianxin Li, Yikun Ban

TL;DR
This paper reveals a bias in group-relative advantage estimation in RLVR for language models, demonstrating its impact on training and proposing an adaptive reweighting method, HA-DW, to improve performance.
Contribution
The paper provides the first theoretical analysis of bias in group-relative advantage estimation and introduces HA-DW, a method to correct this bias in RLVR training.
Findings
Bias causes underestimation for hard prompts and overestimation for easy prompts.
HA-DW improves performance across five reasoning benchmarks.
Correcting advantage bias enhances robustness and efficiency of RLVR.
Abstract
Reinforcement Learning from Verifier Rewards (RLVR) has emerged as a widely used approach for post-training large language models on reasoning tasks, with group-based methods such as GRPO and its variants gaining broad adoption. These methods rely on group-relative advantage estimation to avoid learned critics, yet its theoretical properties remain poorly understood. In this work, we uncover a fundamental issue of group-based RL: the group-relative advantage estimator is inherently biased relative to the true (expected) advantage. We provide the first theoretical analysis showing that it systematically underestimates advantages for hard prompts and overestimates them for easy prompts, leading to imbalanced exploration and exploitation. To address this issue, we propose History-Aware Adaptive Difficulty Weighting (HA-DW), an adaptive reweighting scheme that adjusts advantage estimates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Reinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI)
