Adaptive Batch-Wise Sample Scheduling for Direct Preference Optimization
Zixuan Huang, Yikun Ban, Lean Fu, Xiaojie Li, Zhongxiang Dai, Jianxin Li, Deqing Wang

TL;DR
This paper introduces SamS, an adaptive sample scheduling algorithm for DPO that dynamically selects training samples based on the model's evolving state, significantly enhancing LLM alignment performance.
Contribution
The paper proposes a novel adaptive sample scheduling method, SamS, for DPO that improves training efficiency and model alignment without altering the core algorithm.
Findings
SamS improves performance across multiple tasks.
Integration of SamS requires minimal additional computational cost.
Sample scheduling based on model feedback enhances generalization.
Abstract
Direct Preference Optimization (DPO) has emerged as an effective approach for aligning large language models (LLMs) with human preferences. However, its performance is highly dependent on the quality of the underlying human preference data. To address this bottleneck, prior work has explored various data selection strategies, but these methods often overlook the impact of the evolving states of the language model during the optimization process. In this paper, we introduce a novel problem: Sample Scheduling for DPO, which aims to dynamically and adaptively schedule training samples based on the model's evolving batch-wise states throughout preference optimization. To solve this problem, we propose SamS, an efficient and effective algorithm that adaptively selects samples in each training batch based on the LLM's learning feedback to maximize the potential generalization performance.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research
