SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning
Tianjian Li, Daniel Khashabi

TL;DR
This paper introduces SIMPLEMIX, a straightforward method that combines on-policy and off-policy data to enhance language model alignment, demonstrating significant improvements across various tasks and benchmarks.
Contribution
SIMPLEMIX is a simple yet effective approach that systematically combines on-policy and off-policy preference data, outperforming more complex methods in language model alignment.
Findings
SIMPLEMIX improves alignment performance by 6.03% on Alpaca Eval 2.0.
It outperforms complex methods like HyPO and DPO-Mix-P by 3.05%.
On-policy data excels in reasoning tasks, off-policy in creative tasks.
Abstract
Aligning language models with human preferences relies on pairwise preference datasets. While some studies suggest that on-policy data consistently outperforms off -policy data for preference learning, others indicate that the advantages of on-policy data may be task-dependent, highlighting the need for a systematic exploration of their interplay. In this work, we show that on-policy and off-policy data offer complementary strengths in preference optimization: on-policy data is particularly effective for reasoning tasks like math and coding, while off-policy data performs better on open-ended tasks such as creative writing and making personal recommendations. Guided by these findings, we introduce SIMPLEMIX, an approach to combine the complementary strengths of on-policy and off-policy preference learning by simply mixing these two data sources. Our empirical results across diverse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsDirect Preference Optimization
