Theoretical Tensions in RLHF: Reconciling Empirical Success with Inconsistencies in Social Choice Theory

Jiancong Xiao; Zhekun Shi; Kaizhao Liu; Qi Long; Weijie J. Su

arXiv:2506.12350·stat.ML·June 17, 2025

Theoretical Tensions in RLHF: Reconciling Empirical Success with Inconsistencies in Social Choice Theory

Jiancong Xiao, Zhekun Shi, Kaizhao Liu, Qi Long, Weijie J. Su

PDF

Open Access

TL;DR

This paper investigates the theoretical foundations of RLHF, explaining its empirical success despite axiomatic violations, and proposes modifications and new criteria to improve alignment.

Contribution

It provides a theoretical reconciliation of RLHF's performance with social choice axioms and introduces new alignment criteria for future method design.

Findings

01

RLHF satisfies pairwise majority and Condorcet consistency under realistic assumptions.

02

A simple modification to reward modeling can ensure consistency properties.

03

RLHF satisfies preference matching and preference equivalence but not group preference matching.

Abstract

Despite its empirical success, Reinforcement Learning from Human Feedback (RLHF) has been shown to violate almost all the fundamental axioms in social choice theory -- such as majority consistency, pairwise majority consistency, and Condorcet consistency. This raises a foundational question: why does RLHF perform so well in practice if it fails these seemingly essential properties? In this paper, we resolve this paradox by showing that under mild and empirically plausible assumptions on the preference profile, RLHF does satisfy pairwise majority and Condorcet consistency. These assumptions are frequently satisfied in real-world alignment tasks, offering a theoretical explanation for RLHF's strong practical performance. Furthermore, we show that a slight modification to the reward modeling objective can ensure pairwise majority or Condorcet consistency even under general preference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGender, Labor, and Family Dynamics