TL;DR
This paper introduces a simpler, stable method called MMPO for aligning large language models with human preferences by maximizing marginal likelihood, avoiding complex reward models.
Contribution
The paper proposes a novel preference optimization approach based on maximum marginal likelihood, simplifying and stabilizing LLM alignment without explicit reward models.
Findings
MMPO is more stable across hyperparameters than baselines.
MMPO achieves comparable or better preference alignment.
Improved performance is due to implicit preference optimization.
Abstract
Aligning Large Language Models (LLMs) with human preferences is crucial, but standard methods like Reinforcement Learning from Human Feedback (RLHF) are often complex and unstable. In this work, we propose a new, simpler approach that recasts alignment through the lens of Maximum Marginal Likelihood (MML) estimation. Our new MML based Preference Optimization (MMPO) maximizes the marginal log-likelihood of a preferred text output, using the preference pair as samples for approximation, and forgoes the need for both an explicit reward model and entropy maximization. We theoretically demonstrate that MMPO implicitly performs preference optimization, producing a weighted gradient that naturally up-weights chosen responses over rejected ones. Across models ranging from 135M to 8B parameters, we empirically show that MMPO: 1) is more stable with respect to the hyperparameter compared…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
