TL;DR
This paper identifies an implicit advantage symmetry in GRAE that hampers exploration and difficulty adaptation in RLVR, and proposes A-GRAE to improve learning efficiency and performance.
Contribution
It uncovers the limitations of symmetric advantage estimation and introduces A-GRAE, a dynamic approach that enhances exploration and difficulty focus in RLVR.
Findings
A-GRAE outperforms standard GRPO across seven benchmarks.
Asymmetry in advantage estimation promotes better exploration.
Curriculum-like sample difficulty shifting improves learning efficiency.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR), particularly GRPO, has become the standard for eliciting LLM reasoning. However, its efficiency in exploration and difficulty adaptation remains an open challenge. In this work, we argue that these bottlenecks stem from an implicit advantage symmetry inherent in Group Relative Advantage Estimation (GRAE). This symmetry induces two critical limitations: (i) at the group level, strict symmetry in weights between correct and incorrect trajectories leaves unsampled action logits unchanged, thereby hindering exploration of novel correct solution. (ii) at the sample level, the algorithm implicitly prioritizes medium-difficulty samples, remaining agnostic to the non-stationary demands of difficulty focus. Through controlled experiments, we reveal that this symmetric property is sub-optimal, yielding two pivotal insights: (i) asymmetrically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
