Beyond RLHF: A Unified Theoretical Framework of Alignment
Jihun Yun, Juno Kim, Jongho Park, Junhyuck Kim, Jongha Jon Ryu, Jaewoong Cho, Kwang-Sung Jun

TL;DR
This paper introduces a unified theoretical framework for alignment in large language models, analyzing various objectives including RLHF, and providing guarantees and empirical validation.
Contribution
It reframes alignment as distribution learning from preferences, proposing three principled objectives with proven convergence guarantees and explaining empirical performance differences.
Findings
Reverse KL minimization resembles RLHF, justifying its effectiveness.
On-policy objectives outperform likelihood-style objectives empirically.
Proposed objectives are competitive with strong baselines across tasks.
Abstract
Alignment via reinforcement learning from human feedback (RLHF) has become the dominant paradigm for controlling the quality of outputs from large language models (LLMs). However, existing theories do not provide strong justification for the RLHF objective itself and do not allow comparisons of the guarantees between various methods because different methods are often analyzed under different frameworks. Toward a unified framework for alignment, we ask under what assumptions can we derive existing or new training objectives and obtain theoretical guarantees. To this end, we reframe alignment as distribution learning from pairwise preferences, which makes a probabilistic assumption describing how preferences reveal information about the target LM. This leads us to propose three principled alignment objectives: preference maximum likelihood estimation, preference distillation, and reverse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
