When Data is the Algorithm: A Systematic Study and Curation of Preference Optimization Datasets
Aladin Djuhera, Farhan Ahmed, Swanand Ravindra Kadhe, Syed Zawad, Heiko Ludwig, Holger Boche

TL;DR
This paper conducts a comprehensive data-centric analysis of open-source preference datasets for LLM alignment, introduces a new curated dataset UltraMix, and provides insights to improve preference-based fine-tuning.
Contribution
It systematically annotates and compares existing DPO datasets using the Magpie framework and creates UltraMix, a refined dataset that enhances performance while reducing size.
Findings
UltraMix outperforms individual datasets on key benchmarks.
Annotations reveal structural and qualitative differences in datasets.
UltraMix is 30% smaller yet more effective than the best single dataset.
Abstract
Aligning large language models (LLMs) is a central objective of post-training, often achieved through reward modeling and reinforcement learning methods. Among these, direct preference optimization (DPO) has emerged as a widely adopted technique that fine-tunes LLMs on preferred completions over less favorable ones. While most frontier LLMs do not disclose their curated preference pairs, the broader LLM community has released several open-source DPO datasets, including TuluDPO, ORPO, UltraFeedback, HelpSteer, and Code-Preference-Pairs. However, systematic comparisons remain scarce, largely due to the high computational cost and the lack of rich quality annotations, making it difficult to understand how preferences were selected, which task types they span, and how well they reflect human judgment on a per-sample level. In this work, we present the first comprehensive, data-centric…
Peer Reviews
Decision·ICLR 2026 Poster
1. Data Engineering for post-training is often overlooked but important. The paper presents a nice study and comparison of how to curate data for DPO and demonstrates that their recipe indeed improves performance. 2. The paper is well-written, and clear. Comprehensive analysis and experiments. Solid results. 3. It would be a good contribution if the authors can release their UltraMix dataset for the community to use and study.
1. My main concern is about the novelty of this paper. The main takeaway of this paper is data filtering is important, we need to select high quality and difficult data. Such findings are not new to the field and I don't know how much people would learn from reading this paper. The resulting dataset (UltraMix) itself might be more useful. 2. The quality filtering is conducted by an "independent reward model", in this case, FsfairX. I don't know how much of the gains comes from using this partic
- This paper presents a thorough and extensive analysis of five open-source DPO datasets, examining aspects such as task type and prompt quality. The study offers valuable insights—for instance, highlighting that poorly written instructions can lead to subpar preference alignment. - The experimental evaluation is comprehensive, employing different benchmarks to assess model performance. Furthermore, the authors validate their findings using different models, enhancing the reliability of the re
- Unclear Motivation. The claim that "quality annotations are mostly missing" is not entirely accurate, as widely-used datasets like UltraFeedback and HelpSteer do provide fine-grained, human-annotated preference scores. Improving these details will help readers understand the contribution of this paper better. - The current taxonomy requires refinement. Categorizing UltraFeedback and HelpSteer (lines 100-101) as "instruction-following" datasets is imprecise, as they are primarily preference da
The paper’s detailed dataset analysis and creation of UltraMix reflect a high level of thoroughness. The inclusion of annotations for quality, difficulty, and preference rewards provides a nuanced and comprehensive look at the DPO process.
1. Comparison of different datasets: While the comparison of different datasets in Table 1 is useful, it does not account for size differences between datasets. For example, TuluDPO has 273k pairs, while ORPO only has 44k. 2. Preference Signal Accuracy: The authors mention that current preference signals have only about 70% accuracy, especially in datasets like UltraFeedback. Since this dataset is annotated by GPT-4o, it would be expected to have a higher level of accuracy. The paper should clar
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Recommender Systems and Techniques · Explainable Artificial Intelligence (XAI)
