SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in   Language Model Preference Learning

Tianjian Li; Daniel Khashabi

arXiv:2505.02363·cs.CL·May 6, 2025

SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning

Tianjian Li, Daniel Khashabi

PDF

Open Access

TL;DR

This paper introduces SIMPLEMIX, a straightforward method that combines on-policy and off-policy data to enhance language model alignment, demonstrating significant improvements across various tasks and benchmarks.

Contribution

SIMPLEMIX is a simple yet effective approach that systematically combines on-policy and off-policy preference data, outperforming more complex methods in language model alignment.

Findings

01

SIMPLEMIX improves alignment performance by 6.03% on Alpaca Eval 2.0.

02

It outperforms complex methods like HyPO and DPO-Mix-P by 3.05%.

03

On-policy data excels in reasoning tasks, off-policy in creative tasks.

Abstract

Aligning language models with human preferences relies on pairwise preference datasets. While some studies suggest that on-policy data consistently outperforms off -policy data for preference learning, others indicate that the advantages of on-policy data may be task-dependent, highlighting the need for a systematic exploration of their interplay. In this work, we show that on-policy and off-policy data offer complementary strengths in preference optimization: on-policy data is particularly effective for reasoning tasks like math and coding, while off-policy data performs better on open-ended tasks such as creative writing and making personal recommendations. Guided by these findings, we introduce SIMPLEMIX, an approach to combine the complementary strengths of on-policy and off-policy preference learning by simply mixing these two data sources. Our empirical results across diverse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsDirect Preference Optimization