A Unified Pair-GRPO Family: From Implicit to Explicit Preference Constraints for Stable and General RL Alignment

Hao Yu

arXiv:2605.06375·cs.LG·May 12, 2026

A Unified Pair-GRPO Family: From Implicit to Explicit Preference Constraints for Stable and General RL Alignment

Hao Yu

PDF

TL;DR

This paper introduces the Pair-GRPO family, a unified framework for preference-based RL that improves stability, interpretability, and performance in LLM alignment tasks through novel theoretical insights and practical algorithms.

Contribution

It develops Soft-Pair-GRPO and Hard-Pair-GRPO, providing theoretical guarantees and demonstrating superior performance on benchmark tasks compared to existing methods.

Findings

01

Soft-Pair-GRPO's gradient is a scalar multiple of GRPO's gradient, explaining its stability.

02

Hard-Pair-GRPO introduces explicit constraints to further reduce gradient noise.

03

The Pair-GRPO family outperforms state-of-the-art baselines in LLM alignment benchmarks.

Abstract

Large language model (LLM) alignment via reinforcement learning from human preferences (RLHF) suffers from unstable policy updates, ambiguous gradient directions, poor interpretability, and high gradient variance in mainstream pairwise preference learning paradigms. To systematically address these limitations, we establish a unified theoretical framework for preference-based RL optimization centered on the Pair-GRPO family, comprising two tightly coupled variants: Soft-Pair-GRPO and Hard-Pair-GRPO. Soft-Pair-GRPO is a minimal modification of Group Relative Policy Optimization (GRPO) that replaces group-normalized scalar rewards with binary pairwise preference rewards, retaining GRPO's clipped surrogate and KL-regularized structure. We prove a critical gradient equivalence theorem: under first-order Taylor expansion around the current policy, Soft-Pair-GRPO's gradient is a positive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.