Is On-Policy Data always the Best Choice for Direct Preference Optimization-based LM Alignment?

Zetian Sun; Dongfang Li; Xuhui Chen; Baotian Hu; Min Zhang

arXiv:2508.10530·cs.AI·January 28, 2026

Is On-Policy Data always the Best Choice for Direct Preference Optimization-based LM Alignment?

Zetian Sun, Dongfang Li, Xuhui Chen, Baotian Hu, Min Zhang

PDF

3 Reviews

TL;DR

This paper investigates the effectiveness of on-policy versus static preference data in language model alignment, revealing that on-policy data is not always optimal and proposing a stage-based framework for better alignment strategies.

Contribution

It introduces the alignment stage assumption, dividing the process into preference injection and fine-tuning stages, and develops an algorithm to identify optimal data boundaries for improved LM alignment.

Findings

01

On-policy data can be significantly more or less effective than static data depending on the model.

02

The proposed boundary measurement algorithm effectively identifies the transition point between alignment stages.

03

Experimental results across multiple models and methods validate the generality of the alignment stage assumption.

Abstract

The alignment of language models~(LMs) with human preferences is critical for building reliable AI systems. The problem is typically framed as optimizing an LM policy to maximize the expected reward that reflects human preferences. Recently, Direct Preference Optimization~(DPO) was proposed as a LM alignment method that directly optimize the policy from static preference data, and further improved by incorporating on-policy sampling~(i.e., preference candidates generated during the training loop) for better LM alignment. However, we show on-policy data is not always optimal, with systematic effectiveness difference emerging between static and on-policy preference candidates. For example, on-policy data can result in a $3 \times$ effectiveness compared with static data for Llama-3, and a $0.4 \times$ effectiveness for Zephyr. To explain the phenomenon, we propose the alignment stage…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- Taking inititative on the varied results we see in the preference optimization literature is good to see. The breakdown of the preference optimization at least makes sense qualitatively, if not theoretically - The boundary area measurement provides a single quantitative metric that really distills the core of the paper - The research questions are validated on multiple model families

Weaknesses

- While the boundary area measurement is a good starting point, I still think the proposition given in the paper lacks practical application. The boundary area measurement is easy to calculate in retrospect, after we have the models and all versions of the preference data, but doesn't seem so easy when decisions need to be made on-the-fly.

Reviewer 02Rating 6Confidence 4

Strengths

1. The paper investigates the learning dynamics of preference learning, which is understudied. The findings are novel and could be useful to practitioners who are working on human preference learning. 2. The paper is well written with comprehensive experiments demonstrating the generalizability of their findings. The finding that off-policy DPO followed by on-policy DPO on two iterations provides a simple recipe for people to try out. 3. The discussion between on- v.s. off-policy data could

Weaknesses

1. It seems that people are not really excited about DPO / RLHF anymore. For example, the latest open-source frontier models (Qwen3, GLM 4.5, Kimi-K2...) only adopts a RL process using rubric rewards and verifiable rewards on math / coding tasks. This is a good and interesting paper, it is just that I don't know how much impact would a DPO paper make in 2025. It would be better if the authors can show a curriculum of on- and off-policy data also generalizes to RLVR experiments. 2. The findings

Reviewer 03Rating 6Confidence 4

Strengths

1. The writing is clear and easy to follow. 2. Research on the problem of using on-policy data for DPO is interesting. 3. The authors have great literature review for related works.

Weaknesses

1. I have some concerns about models to measure diversity. As mentioned in the paper, the authors use Zephyr to measure diversity. To further confirm this conclusion, it's better for authors to see whether this diversity patterns are similar for different models. 2. This method can be used to recognize the training stage of LLMs. Does this method have great potential to make DPO training more efficient (like convergence speed)? 3. I find that the authors conduct some experiments on DPO and SLi

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.