Adaptive Batch-Wise Sample Scheduling for Direct Preference Optimization

Zixuan Huang; Yikun Ban; Lean Fu; Xiaojie Li; Zhongxiang Dai; Jianxin Li; Deqing Wang

arXiv:2506.17252·cs.LG·March 10, 2026

Adaptive Batch-Wise Sample Scheduling for Direct Preference Optimization

Zixuan Huang, Yikun Ban, Lean Fu, Xiaojie Li, Zhongxiang Dai, Jianxin Li, Deqing Wang

PDF

Open Access

TL;DR

This paper introduces SamS, an adaptive sample scheduling algorithm for DPO that dynamically selects training samples based on the model's evolving state, significantly enhancing LLM alignment performance.

Contribution

The paper proposes a novel adaptive sample scheduling method, SamS, for DPO that improves training efficiency and model alignment without altering the core algorithm.

Findings

01

SamS improves performance across multiple tasks.

02

Integration of SamS requires minimal additional computational cost.

03

Sample scheduling based on model feedback enhances generalization.

Abstract

Direct Preference Optimization (DPO) has emerged as an effective approach for aligning large language models (LLMs) with human preferences. However, its performance is highly dependent on the quality of the underlying human preference data. To address this bottleneck, prior work has explored various data selection strategies, but these methods often overlook the impact of the evolving states of the language model during the optimization process. In this paper, we introduce a novel problem: Sample Scheduling for DPO, which aims to dynamically and adaptively schedule training samples based on the model's evolving batch-wise states throughout preference optimization. To solve this problem, we propose SamS, an efficient and effective algorithm that adaptively selects samples in each training batch based on the LLM's learning feedback to maximize the potential generalization performance.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research