TL;DR
This paper introduces the HRC model and DSPPO framework to explicitly disentangle transitive and cyclic preferences in LLM alignment, improving robustness and performance over existing methods.
Contribution
It proposes a novel game-theoretic decomposition approach and dynamic optimization method for better preference modeling in LLM alignment.
Findings
HRC converges faster and achieves higher accuracy in synthetic tests.
HRC+DSPPO outperforms baselines on RewardBench 2 and downstream benchmarks.
Code is publicly available at https://github.com/lab-klc/Hybrid-Reward-Cyclic.
Abstract
Standard RLHF relies on transitive scalar rewards, failing to capture the cyclic nature of human preferences. While some approaches like the General Preference Model (GPM) address this, we identify a theoretical limitation: their implicit formulation entangles hierarchy with cyclicity, failing to guarantee dominant solutions. To address this, we propose the Hybrid Reward-Cyclic (HRC) model, which utilizes game-theoretic decomposition to explicitly disentangle preferences into orthogonal transitive (scalar) and cyclic (vector) components. Complementing this, we introduce Dynamic Self-Play Preference Optimization (DSPPO), which treats alignment as a time-varying game to progressively guide the policy toward the Nash equilibrium. Synthetic data experiments further validate HRC's structural superiority in mixed transitive--cyclic settings, where HRC converges faster and achieves higher…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
