Selective Preference Optimization via Token-Level Reward Function Estimation
Kailai Yang, Zhiwei Liu, Qianqian Xie, Jimin Huang, Erxue Min, Sophia Ananiadou

TL;DR
SePO introduces an efficient token-level preference optimization method that selects key tokens based on a learned reward function, significantly improving alignment performance with reduced computational cost.
Contribution
The paper presents SePO, a novel token selection strategy using Direct Preference Optimization to improve large language model alignment efficiently.
Findings
Outperforms baseline methods by optimizing only 30% key tokens.
Enables weak oracle models to supervise larger policy models effectively.
Enhances out-of-distribution token selection and reduces over-optimization.
Abstract
Recent advancements in large language model alignment leverage token-level supervisions to perform fine-grained preference optimization. However, existing token-level alignment methods either optimize on all available tokens, which can be noisy and inefficient, or perform selective training with complex and expensive key token selection strategies. In this work, we propose Selective Preference Optimization (SePO), a novel selective alignment strategy that centers on efficient key token selection. SePO proposes the first token selection method based on Direct Preference Optimization (DPO), which trains an oracle model to estimate a token-level reward function on the target data. This method applies to any existing alignment datasets with response-level annotations and enables cost-efficient token selection with small-scale oracle models and training data. The estimated reward function is…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
SePO offers a cost-efficient alignment strategy by focusing on a subset of high-reward tokens, which reduces annotation costs. The method demonstrates better performance on several benchmarks, surpassing existing token-level and response-level alignment methods. SePO’s weak-to-strong generalization enables effective supervision from smaller, weaker oracle models, showing scalability across varying model sizes.
1. The method is limited by the requirement that oracle and policy models share the same vocabulary and tokenizer, which reduces flexibility across different model architectures. 2. The use of the DPO reward format as an automated credit assignment behaviour has been attempted by other works, and the paper's contribution is weaker as only quantifies the results of this assignment to the weights of the DPO loss. 3. Suppose the confidence given by the Oracle model is used as the gold label for the
This paper introduces a novel token-level reward function estimator using DPO. SePO reduces the need for extensive token optimization, demonstrating improved alignment performance while training on only 30% of tokens. This is valuable for scaling LLMs and reducing computational overhead. The weak-to-strong generalization capability of SePO allows smaller models to supervise larger ones.
The experiments primarily involve relatively moderate-sized models. Testing SePO on stronger models, such as LLaMA2-Chat-70B, would provide further insights into its scalability and potential bottlenecks, especially for the weak-to-strong generalization experiment. Compared to other methods, the improvement seems to be slight.
- The idea is clear and novel. - The reported results indicate the promise of the approach.
- The proof of Theorem 1, which asserts that after training a DPO, the reward function can be expressed as a decoupled reward $\hat{r}$, inherits this property (Line 810) from the assumption that the reward can be written in such a manner (Assumption 1). This raises the question of whether all reward functions can be expressed in a decoupled way. From a naive perspective, a decoupled reward is not normalized, and longer texts might have larger absolute values of reward. In my attempts to learn r
The strengths of this paper are listed as follows 1. This paper observes that the total reward of a generated utterance is usually dominated by a few tokens. This observation is interesting and motivate the method well 2. This paper propose a token-selection-based training method, which is new and interesting to me 3. The experiments are comprehensive and results look good.
My concerns are listed as follows: 1. My major concern is about the token selection mechanism. The motivation behind using $\hat{r}(s_t,a_t)$ as the proxy of the reward is unclear to me. Theorem 1 only proved that $\sum \hat{r} (s_t, a_t) + V^{*}(s_1) = \sum r(s_t, a_t)$, which only guarantees that the sum of $\hat{r}$ and the sum of $r$ is the same (up to a constant). However, the value distribution of $r$ and $\hat{r}$ might still be drastically different. Therefore, the token selection based
Videos
Taxonomy
TopicsAdvanced Database Systems and Queries
