Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms

Zeguan Xiao; Yun Chen; Guanhua Chen; Ke Tang

arXiv:2506.09457·cs.CL·April 17, 2026

Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms

Zeguan Xiao, Yun Chen, Guanhua Chen, Ke Tang

PDF

TL;DR

This paper identifies a fundamental mismatch in direct alignment algorithms for language models and proposes a token-level training method, POET, to better align training with autoregressive decoding, improving performance.

Contribution

The paper introduces Prefix-Oriented Equal-length Training (POET), a simple method to reduce the reward-generation gap in DAAs by truncating responses to match shorter ones, enhancing alignment.

Findings

01

POET improves DPO and SimPO performance by up to 11.8 points in AlpacaEval 2.

02

POET achieves overall improvements across downstream tasks.

03

Addressing the reward-generation gap is crucial for better alignment in DAAs.

Abstract

Direct Alignment Algorithms (DAAs), such as Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO), have emerged as efficient alternatives to Reinforcement Learning from Human Feedback (RLHF) algorithms for aligning large language models (LLMs) with human preferences. However, DAAs suffer from a fundamental limitation we identify as the "reward-generation gap", a discrepancy between training objectives and autoregressive decoding dynamics. In this paper, we consider that one contributor to the reward-generation gap is the mismatch between the inherent importance of prefix tokens during the LLM generation process and how this importance is reflected in the implicit reward functions of DAAs. To bridge the gap, we adopt a token-level MDP perspective of DAAs to analyze its limitations and introduce a simple yet effective approach called Prefix-Oriented Equal-length…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.