A Unified Pair-GRPO Family: From Implicit to Explicit Preference Constraints for Stable and General RL Alignment
Hao Yu

TL;DR
This paper introduces the Pair-GRPO family, a unified framework for preference-based RL that improves stability, interpretability, and performance in LLM alignment tasks through novel theoretical insights and practical algorithms.
Contribution
It develops Soft-Pair-GRPO and Hard-Pair-GRPO, providing theoretical guarantees and demonstrating superior performance on benchmark tasks compared to existing methods.
Findings
Soft-Pair-GRPO's gradient is a scalar multiple of GRPO's gradient, explaining its stability.
Hard-Pair-GRPO introduces explicit constraints to further reduce gradient noise.
The Pair-GRPO family outperforms state-of-the-art baselines in LLM alignment benchmarks.
Abstract
Large language model (LLM) alignment via reinforcement learning from human preferences (RLHF) suffers from unstable policy updates, ambiguous gradient directions, poor interpretability, and high gradient variance in mainstream pairwise preference learning paradigms. To systematically address these limitations, we establish a unified theoretical framework for preference-based RL optimization centered on the Pair-GRPO family, comprising two tightly coupled variants: Soft-Pair-GRPO and Hard-Pair-GRPO. Soft-Pair-GRPO is a minimal modification of Group Relative Policy Optimization (GRPO) that replaces group-normalized scalar rewards with binary pairwise preference rewards, retaining GRPO's clipped surrogate and KL-regularized structure. We prove a critical gradient equivalence theorem: under first-order Taylor expansion around the current policy, Soft-Pair-GRPO's gradient is a positive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
