Theoretical Tensions in RLHF: Reconciling Empirical Success with Inconsistencies in Social Choice Theory
Jiancong Xiao, Zhekun Shi, Kaizhao Liu, Qi Long, Weijie J. Su

TL;DR
This paper investigates the theoretical foundations of RLHF, explaining its empirical success despite axiomatic violations, and proposes modifications and new criteria to improve alignment.
Contribution
It provides a theoretical reconciliation of RLHF's performance with social choice axioms and introduces new alignment criteria for future method design.
Findings
RLHF satisfies pairwise majority and Condorcet consistency under realistic assumptions.
A simple modification to reward modeling can ensure consistency properties.
RLHF satisfies preference matching and preference equivalence but not group preference matching.
Abstract
Despite its empirical success, Reinforcement Learning from Human Feedback (RLHF) has been shown to violate almost all the fundamental axioms in social choice theory -- such as majority consistency, pairwise majority consistency, and Condorcet consistency. This raises a foundational question: why does RLHF perform so well in practice if it fails these seemingly essential properties? In this paper, we resolve this paradox by showing that under mild and empirically plausible assumptions on the preference profile, RLHF does satisfy pairwise majority and Condorcet consistency. These assumptions are frequently satisfied in real-world alignment tasks, offering a theoretical explanation for RLHF's strong practical performance. Furthermore, we show that a slight modification to the reward modeling objective can ensure pairwise majority or Condorcet consistency even under general preference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGender, Labor, and Family Dynamics
