Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms
Zeguan Xiao, Yun Chen, Guanhua Chen, Ke Tang

TL;DR
This paper identifies a fundamental mismatch in direct alignment algorithms for language models and proposes a token-level training method, POET, to better align training with autoregressive decoding, improving performance.
Contribution
The paper introduces Prefix-Oriented Equal-length Training (POET), a simple method to reduce the reward-generation gap in DAAs by truncating responses to match shorter ones, enhancing alignment.
Findings
POET improves DPO and SimPO performance by up to 11.8 points in AlpacaEval 2.
POET achieves overall improvements across downstream tasks.
Addressing the reward-generation gap is crucial for better alignment in DAAs.
Abstract
Direct Alignment Algorithms (DAAs), such as Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO), have emerged as efficient alternatives to Reinforcement Learning from Human Feedback (RLHF) algorithms for aligning large language models (LLMs) with human preferences. However, DAAs suffer from a fundamental limitation we identify as the "reward-generation gap", a discrepancy between training objectives and autoregressive decoding dynamics. In this paper, we consider that one contributor to the reward-generation gap is the mismatch between the inherent importance of prefix tokens during the LLM generation process and how this importance is reflected in the implicit reward functions of DAAs. To bridge the gap, we adopt a token-level MDP perspective of DAAs to analyze its limitations and introduce a simple yet effective approach called Prefix-Oriented Equal-length…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
