AMIR-GRPO: Inducing Implicit Preference Signals into GRPO
Amir Hossein Yari, Fajri Koto

TL;DR
AMIR-GRPO enhances group relative policy optimization by integrating implicit preference signals from intra-group reward rankings, leading to improved reasoning performance and better supervision utilization in large language models.
Contribution
It introduces a novel implicit regularizer into GRPO that leverages intra-group reward rankings without extra annotations, improving reasoning accuracy.
Findings
Outperforms standard GRPO on mathematical reasoning benchmarks.
Produces clearer separation between correct and incorrect reasoning.
Achieves broader coverage beyond standard GRPO solutions.
Abstract
Reinforcement learning has become the primary paradigm for aligning large language models (LLMs) on complex reasoning tasks, with group relative policy optimization (GRPO) widely used in large-scale post-training. However, GRPO faces structural limitations in reasoning-heavy settings: sequence-level advantage normalization introduces systematic length bias, penalties for low-quality trajectories are diluted, and the scalar objective discards rich pairwise preference information embedded in within-group reward rankings. As a result, valuable supervision from costly rollouts remains underutilized. We propose AMIR-GRPO, which augments GRPO with an implicit DPO-style contrastive regularizer constructed directly from intra-group reward rankings, requiring no additional annotations. This mechanism amplifies suppression of low-reward trajectories, attenuates response-level length bias, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI) · Topic Modeling
