PIPA: Preference Alignment as Prior-Informed Statistical Estimation
Junbo Li, Zhangyang Wang, Qiang Liu

TL;DR
PIPA introduces a unified probabilistic framework for offline preference alignment in language models, improving performance on benchmarks without extra training costs by integrating prior information.
Contribution
It formulates preference alignment as a prior-informed MLE problem, unifying existing algorithms and enabling new variations with enhanced performance.
Findings
Achieves 3-10% performance improvements on GSM8K and MATH benchmarks.
Unifies existing offline preference algorithms under a probabilistic framework.
Enhances performance without additional training or computational costs.
Abstract
Offline preference alignment for language models such as Direct Preference Optimization (DPO) is favored for its effectiveness and simplicity, eliminating the need for costly reinforcement learning. Various offline algorithms have been developed for different data settings, yet they lack a unified understanding. In this study, we introduce Pior-Informed Preference Alignment (PIPA), a unified, RL-free probabilistic framework that formulates language model preference alignment as a Maximum Likelihood Estimation (MLE) problem with prior constraints. This method effectively accommodates both paired and unpaired data, as well as answer and step-level annotations. We illustrate that DPO and KTO are special cases with different prior constraints within our framework. By integrating different types of prior information, we developed two variations of PIPA: PIPA-M and PIPA-N. Both algorithms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Criteria Decision Making · Data Management and Algorithms · Bayesian Modeling and Causal Inference
