General Preference Reinforcement Learning
Muhammad Umer, Muhammad Ahmed Mohsin, Ahsan Bilal, Arslan Chaudhry, Andreas Haupt, Sanmi Koyejo, Emily Fox, John M. Cioffi

TL;DR
This paper introduces General Preference Reinforcement Learning (GPRL), a novel method that uses a structured multi-dimensional preference model to improve open-ended language model alignment and robustness.
Contribution
The paper proposes GPRL, which leverages a multi-dimensional preference embedding and a drift monitor to enhance policy updates and prevent reward hacking in language models.
Findings
GPRL achieves a 56.51% win rate on AlpacaEval 2.0.
GPRL outperforms SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench.
GPRL resists reward hacking across extended training.
Abstract
Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, while preference optimization handles open-ended generation yet forgoes the continuous exploration that powers online RL. Closing this gap requires a verifier for open-ended quality, but a scalar reward model is the wrong shape for the job. Quality is multi-dimensional, and any scalar score is an incomplete proxy that lets online RL collapse onto whichever axis the score is most sensitive to. We turn instead to the General Preference Model (GPM), which embeds responses into skew-symmetric subspaces and represents preference as a structured, intransitivity-aware comparison. Building on this, we propose General…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
