Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO
Zhiyuan Zeng, Jiameng Huang, Zhangyue Yin, Jiashuo Liu, Ziniu Li, Bingrui Li, Yuhao Wu, Yining Zheng, Ge Zhang, Wenhao Huang, Xipeng Qiu

TL;DR
This paper introduces Balanced Aggregation (BA), a simple method to improve token-level policy gradient aggregation in RLVR, reducing bias and enhancing training stability and performance in large language models.
Contribution
It identifies biases caused by existing aggregation methods and proposes BA as a novel, effective alternative for better RLVR training in language models.
Findings
BA improves training stability across multiple benchmarks.
BA outperforms standard token and sequence aggregation methods.
Aggregation bias is influenced by response length variation and positive-negative length gap.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has become a central paradigm for improving reasoning and code generation in large language models, and GRPO-style training is widely adopted for its simplicity and effectiveness. However, an important design choice remains underexplored: how token-level policy gradient terms are aggregated within each sampled group. Standard GRPO uses sequence aggregation, while recent work has advocated token aggregation as a better alternative. We show that these two rules induce different optimization biases: token aggregation introduces sign-length coupling, while sequence aggregation implicitly downweights longer responses through sequence-level equal weighting. To address this tension, we propose \textbf{Balanced Aggregation (BA)}, a simple drop-in replacement that computes token-level means separately within the positive and negative subsets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
