Earlier Tokens Contribute More: Learning Direct Preference Optimization   From Temporal Decay Perspective

Ruichen Shao; Bei Li; Gangao Liu; Yang Chen; Xiang Zhou; Jingang Wang,; Xunliang Cai; Peng Li

arXiv:2502.14340·cs.CL·February 21, 2025

Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective

Ruichen Shao, Bei Li, Gangao Liu, Yang Chen, Xiang Zhou, Jingang Wang,, Xunliang Cai, Peng Li

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a temporal decay mechanism into Direct Preference Optimization (DPO) to prioritize earlier tokens in sequences, improving alignment and performance of large language models across various benchmarks.

Contribution

It proposes a novel temporal decay factor in DPO that dynamically weights rewards based on token position, addressing length bias and enhancing model alignment.

Findings

01

Outperforms vanilla DPO by 5.9-8.8 points on AlpacaEval 2

02

Improves scores by 3.3-9.7 points on Arena-Hard

03

Enhances performance on mathematical and reasoning benchmarks

Abstract

Direct Preference Optimization (DPO) has gained attention as an efficient alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with human preferences. Despite its advantages, DPO suffers from a length bias, generating responses longer than those from the reference model. Existing solutions like SimPO and SamPO address this issue but uniformly treat the contribution of rewards across sequences, overlooking temporal dynamics. To this end, we propose an enhanced preference optimization method that incorporates a temporal decay factor controlled by a gamma parameter. This dynamic weighting mechanism adjusts the influence of each reward based on its position in the sequence, prioritizing earlier tokens that are more critical for alignment. By adaptively focusing on more relevant feedback, our approach mitigates overfitting to less…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lotusrc/d2po
pytorchOfficial

Videos

Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective· slideslive

Taxonomy

TopicsGame Theory and Voting Systems · Constraint Satisfaction and Optimization

MethodsSoftmax · Attention Is All You Need · Direct Preference Optimization