From RLHF to Direct Alignment: A Theoretical Unification of Preference Learning for Large Language Models
Tarun Raheja, Nilay Pochhi

TL;DR
This paper unifies various preference learning methods for aligning large language models through a theoretical framework, clarifying their differences and guiding practitioners in method selection.
Contribution
It provides a formal unification of preference learning approaches, characterizing them along three axes and establishing key theoretical results and failure modes.
Findings
Reveals the underlying structure of preference learning methods.
Establishes scaling laws and conditions for method failure.
Provides a decision guide for practitioners.
Abstract
Aligning large language models (LLMs) with human preferences has become essential for safe and beneficial AI deployment. While Reinforcement Learning from Human Feedback (RLHF) established the dominant paradigm, a proliferation of alternatives -- Direct Preference Optimization (DPO), Identity Preference Optimization (IPO), Kahneman-Tversky Optimization (KTO), Simple Preference Optimization (SimPO), and many others -- has left practitioners without clear guidance on method selection. This survey provides a \textit{theoretical unification} of preference learning methods, revealing that the apparent diversity reduces to principled choices along three orthogonal axes: \textbf{(I) Preference Model} (what likelihood model underlies the objective), \textbf{(II) Regularization Mechanism} (how deviation from reference policies is controlled), and \textbf{(III) Data Distribution} (online vs.\…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI
