TIS-DPO: Token-level Importance Sampling for Direct Preference   Optimization With Estimated Weights

Aiwei Liu; Haoping Bai; Zhiyun Lu; Yanchao Sun; Xiang Kong; Simon; Wang; Jiulong Shan; Albin Madappally Jose; Xiaojiang Liu; Lijie Wen; Philip; S. Yu; Meng Cao

arXiv:2410.04350·cs.CL·April 16, 2025

TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights

Aiwei Liu, Haoping Bai, Zhiyun Lu, Yanchao Sun, Xiang Kong, Simon, Wang, Jiulong Shan, Albin Madappally Jose, Xiaojiang Liu, Lijie Wen, Philip, S. Yu, Meng Cao

PDF

Open Access 2 Repos

TL;DR

This paper introduces TIS-DPO, a token-level importance sampling method for direct preference optimization in LLMs, improving alignment by weighting tokens based on estimated importance, leading to better performance on multiple tasks.

Contribution

It proposes a novel token-level importance sampling approach for DPO, estimating token importance weights using contrastive LLMs to enhance optimization efficiency and alignment results.

Findings

01

TIS-DPO outperforms baseline methods on alignment tasks.

02

Estimated token weights effectively identify key token positions.

03

The method improves optimization efficiency and model alignment quality.

Abstract

Direct Preference Optimization (DPO) has been widely adopted for preference alignment of Large Language Models (LLMs) due to its simplicity and effectiveness. However, DPO is derived as a bandit problem in which the whole response is treated as a single arm, ignoring the importance differences between tokens, which may affect optimization efficiency and make it difficult to achieve optimal results. In this work, we propose that the optimal data for DPO has equal expected rewards for each token in winning and losing responses, as there is no difference in token importance. However, since the optimal dataset is unavailable in practice, we propose using the original dataset for importance sampling to achieve unbiased optimization. Accordingly, we propose a token-level importance sampling DPO objective named TIS-DPO that assigns importance weights to each token based on its reward. Inspired…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Management and Algorithms

MethodsDirect Preference Optimization