Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization
Meng Li, Guangda Huzhang, Haibo Zhang, Xiting Wang, Anxiang Zeng

TL;DR
This paper introduces OTPO, a novel token weighting scheme based on optimal transport, to improve preference optimization in LLMs by emphasizing meaningful tokens and reducing noise influence.
Contribution
It proposes an adaptive, context-aware token weighting method using optimal transport to enhance preference optimization in large language models.
Findings
OTPO improves instruction-following performance.
Enhanced interpretability of preference models.
Increased reward stability during training.
Abstract
Direct Preference Optimization (DPO) has emerged as a promising framework for aligning Large Language Models (LLMs) with human preferences by directly optimizing the log-likelihood difference between chosen and rejected responses. However, existing methods assign equal importance to all tokens in the response, while humans focus on more meaningful parts. This leads to suboptimal preference optimization, as irrelevant or noisy tokens disproportionately influence DPO loss. To address this limitation, we propose \textbf{O}ptimal \textbf{T}ransport-based token weighting scheme for enhancing direct \textbf{P}reference \textbf{O}ptimization (OTPO). By emphasizing semantically meaningful token pairs and de-emphasizing less relevant ones, our method introduces a context-aware token weighting scheme that yields a more contrastive reward difference estimate. This adaptive weighting enhances…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDNA and Biological Computing
MethodsFocus · Direct Preference Optimization
