TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization

Mingkang Zhu; Xi Chen; Zhongdao Wang; Bei Yu; Hengshuang Zhao; Jiaya Jia

arXiv:2506.14574·cs.LG·June 18, 2025

TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization

Mingkang Zhu, Xi Chen, Zhongdao Wang, Bei Yu, Hengshuang Zhao, Jiaya Jia

PDF

Open Access 1 Repo

TL;DR

This paper introduces TGDPO, a method that leverages token-level reward guidance to improve Direct Preference Optimization, resulting in significant performance gains in aligning large language models with human preferences.

Contribution

The work decomposes sequence-level PPO into token-level problems and derives a framework for token-level reward guidance in DPO, enabling more fine-grained policy optimization.

Findings

01

Achieves up to 7.5% win rate improvement on MT-Bench

02

Demonstrates significant performance gains over standard DPO

03

Provides a practical reward guidance framework for token-level optimization

Abstract

Recent advancements in reinforcement learning from human feedback have shown that utilizing fine-grained token-level reward models can substantially enhance the performance of Proximal Policy Optimization (PPO) in aligning large language models. However, it is challenging to leverage such token-level reward as guidance for Direct Preference Optimization (DPO), since DPO is formulated as a sequence-level bandit problem. To address this challenge, this work decomposes the sequence-level PPO into a sequence of token-level proximal policy optimization problems and then frames the problem of token-level PPO with token-level reward guidance, from which closed-form optimal token-level policy and the corresponding token-level reward can be derived. Using the obtained reward and Bradley-Terry model, this work establishes a framework of computable loss functions with token-level reward guidance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dvlab-research/tgdpo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFormal Methods in Verification

MethodsProximal Policy Optimization · Direct Preference Optimization