AMIR-GRPO: Inducing Implicit Preference Signals into GRPO

Amir Hossein Yari; Fajri Koto

arXiv:2601.03661·cs.LG·January 8, 2026

AMIR-GRPO: Inducing Implicit Preference Signals into GRPO

Amir Hossein Yari, Fajri Koto

PDF

Open Access

TL;DR

AMIR-GRPO enhances group relative policy optimization by integrating implicit preference signals from intra-group reward rankings, leading to improved reasoning performance and better supervision utilization in large language models.

Contribution

It introduces a novel implicit regularizer into GRPO that leverages intra-group reward rankings without extra annotations, improving reasoning accuracy.

Findings

01

Outperforms standard GRPO on mathematical reasoning benchmarks.

02

Produces clearer separation between correct and incorrect reasoning.

03

Achieves broader coverage beyond standard GRPO solutions.

Abstract

Reinforcement learning has become the primary paradigm for aligning large language models (LLMs) on complex reasoning tasks, with group relative policy optimization (GRPO) widely used in large-scale post-training. However, GRPO faces structural limitations in reasoning-heavy settings: sequence-level advantage normalization introduces systematic length bias, penalties for low-quality trajectories are diluted, and the scalar objective discards rich pairwise preference information embedded in within-group reward rankings. As a result, valuable supervision from costly rollouts remains underutilized. We propose AMIR-GRPO, which augments GRPO with an implicit DPO-style contrastive regularizer constructed directly from intra-group reward rankings, requiring no additional annotations. This mechanism amplifies suppression of low-reward trajectories, attenuates response-level length bias, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI) · Topic Modeling