TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization
Mingkang Zhu, Xi Chen, Zhongdao Wang, Bei Yu, Hengshuang Zhao, Jiaya Jia

TL;DR
This paper introduces TGDPO, a method that leverages token-level reward guidance to improve Direct Preference Optimization, resulting in significant performance gains in aligning large language models with human preferences.
Contribution
The work decomposes sequence-level PPO into token-level problems and derives a framework for token-level reward guidance in DPO, enabling more fine-grained policy optimization.
Findings
Achieves up to 7.5% win rate improvement on MT-Bench
Demonstrates significant performance gains over standard DPO
Provides a practical reward guidance framework for token-level optimization
Abstract
Recent advancements in reinforcement learning from human feedback have shown that utilizing fine-grained token-level reward models can substantially enhance the performance of Proximal Policy Optimization (PPO) in aligning large language models. However, it is challenging to leverage such token-level reward as guidance for Direct Preference Optimization (DPO), since DPO is formulated as a sequence-level bandit problem. To address this challenge, this work decomposes the sequence-level PPO into a sequence of token-level proximal policy optimization problems and then frames the problem of token-level PPO with token-level reward guidance, from which closed-form optimal token-level policy and the corresponding token-level reward can be derived. Using the obtained reward and Bradley-Terry model, this work establishes a framework of computable loss functions with token-level reward guidance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFormal Methods in Verification
MethodsProximal Policy Optimization · Direct Preference Optimization
