Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment

Yucong Huang; Xiucheng Li; Kaiqi Zhao; Jing Li

arXiv:2605.17342·cs.CL·May 19, 2026

Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment

Yucong Huang, Xiucheng Li, Kaiqi Zhao, Jing Li

PDF

1 Repo

TL;DR

This paper introduces the HRC model and DSPPO framework to explicitly disentangle transitive and cyclic preferences in LLM alignment, improving robustness and performance over existing methods.

Contribution

It proposes a novel game-theoretic decomposition approach and dynamic optimization method for better preference modeling in LLM alignment.

Findings

01

HRC converges faster and achieves higher accuracy in synthetic tests.

02

HRC+DSPPO outperforms baselines on RewardBench 2 and downstream benchmarks.

03

Code is publicly available at https://github.com/lab-klc/Hybrid-Reward-Cyclic.

Abstract

Standard RLHF relies on transitive scalar rewards, failing to capture the cyclic nature of human preferences. While some approaches like the General Preference Model (GPM) address this, we identify a theoretical limitation: their implicit formulation entangles hierarchy with cyclicity, failing to guarantee dominant solutions. To address this, we propose the Hybrid Reward-Cyclic (HRC) model, which utilizes game-theoretic decomposition to explicitly disentangle preferences into orthogonal transitive (scalar) and cyclic (vector) components. Complementing this, we introduce Dynamic Self-Play Preference Optimization (DSPPO), which treats alignment as a time-varying game to progressively guide the policy toward the Nash equilibrium. Synthetic data experiments further validate HRC's structural superiority in mixed transitive--cyclic settings, where HRC converges faster and achieves higher…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lab-klc/Hybrid-Reward-Cyclic
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.