OPTune: Efficient Online Preference Tuning

Lichang Chen; Jiuhai Chen; Chenxi Liu; John Kirchenbauer; Davit; Soselia; Chen Zhu; Tom Goldstein; Tianyi Zhou; Heng Huang

arXiv:2406.07657·cs.LG·June 13, 2024

OPTune: Efficient Online Preference Tuning

Lichang Chen, Jiuhai Chen, Chenxi Liu, John Kirchenbauer, Davit, Soselia, Chen Zhu, Tom Goldstein, Tianyi Zhou, Heng Huang

PDF

Open Access 5 Reviews

TL;DR

OPTune introduces an efficient online preference tuning method for large language models that dynamically samples and reweights responses to improve alignment speed and quality without relying on human-curated data.

Contribution

The paper proposes a novel data exploration and reweighting strategy for online preference tuning that enhances training efficiency and maintains alignment quality.

Findings

01

Achieves 1.27-1.56x faster training speed.

02

Maintains instruction-following capabilities.

03

Does not rely on human-curated responses.

Abstract

Reinforcement learning with human feedback~(RLHF) is critical for aligning Large Language Models (LLMs) with human preference. Compared to the widely studied offline version of RLHF, \emph{e.g.} direct preference optimization (DPO), recent works have shown that the online variants achieve even better alignment. However, online alignment requires on-the-fly generation of new training data, which is costly, hard to parallelize, and suffers from varying quality and utility. In this paper, we propose a more efficient data exploration strategy for online preference tuning (OPTune), which does not rely on human-curated or pre-collected teacher responses but dynamically samples informative responses for on-policy preference alignment. During data generation, OPTune only selects prompts whose (re)generated responses can potentially provide more informative and higher-quality training signals…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 4

Strengths

1. OPTune achieves notable computational savings in data generation and training, reducing costs for online RLHF while preserving alignment quality. 2. By focusing on low-reward prompts, OPTune avoids unnecessary regeneration, which is a pragmatic approach to improve efficiency. 3. Using weighted DPO loss changes binary signals to dense signals, improving improving alignment through prioritizing high-utility samples.

Weaknesses

1. The choice of the ratio of re-generated prompts $\rho$ can be a key factor of OPTune. Though the authors conduct experiments with different $\rho$s, the authors do not provide direct insights on how to choose $\rho$ to balance between efficiency and performance. 2. Online DPO (without weighted loss) should be the most related baseline for this paper. Though some experiments are conducted, the authors do not sufficiently evaluate OPTune's superiority over online DPO. 3. In Table 3, the perform

Reviewer 02Rating 5Confidence 3

Strengths

The paper is well-written and well-orginized. Considering online DPO takes more time than the original offline method, improving its efficiency is of great significance.

Weaknesses

Given that iterative DPO often utilizes different prompts in different iterations [1] for avoid overfitting or overoptimization [2], it is not clear how the proposed method can be used in such scenarios. The performance of the models corresponding to different selection ratios in Table 2 is not very different and is generally low, which cannot explain the effectiveness of the method. References: [1] Meng Y, Xia M, Chen D. Simpo: Simple preference optimization with a reference-free reward[C]

Reviewer 03Rating 5Confidence 4

Strengths

- Show across multiple experiments that the proposed strategy outperforms a random selection strategy.

Weaknesses

- Lack of relevant baselines on sample selection: a pretty common strategy in RLHF is to pick prompts that had the largest "margin" between the winner and the loser for further training (e.g. https://arxiv.org/abs/2404.03715). Could you compare your strategy against this technique? - Lack of relevant baselines on policy optimization: a variety of papers have already noted that IPO / DPO ignore the gap in reward between the winning and losing completions. Could you compare against at least one o

Reviewer 04Rating 3Confidence 5

Strengths

The writing is relatively clear.

Weaknesses

1. Lack of Innovation Over the past year, the alignment community has proposed numerous methods similar to those used in this paper. As early as the Llama 2 Technical Report, the approach of directly incorporating the score difference between two responses into the loss function was introduced. Although the Llama 2 Technical Report is cited in the Related Work section, there is no comparative discussion with Llama 2 or other similar works in Section 3.2. 2. Incomplete Experiments and Lack of Ana

Reviewer 05Rating 3Confidence 3

Strengths

1. The experiment is performed on a 7B model and the result suggests that with a right ratio for regeneration, the proposed method indeed improves training efficiency without decreasing the performance. 2. The experiment has reasonable comparison with random subselection and shows that random subselection does not work as well as the proposed method.

Weaknesses

1. It is unclear to me that ranking the prompts by absolute reward makes sense, especially if the reward model is trained by BT loss. For each fixed prompt, the BT loss only cares about the difference between two responses, so difference prompts may induce a difference biased of the corresponding completion. Thus having a low reward does not necessarily mean that the model is currently performing bad on the prompt. Honestly I might be describing the procedure wrong because I don't see a clear de

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Multimedia Communication and Technology · Data Management and Algorithms

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings