TL;DR
The paper introduces Alignment Data Map, a tool for selecting high-quality preference data to efficiently train aligned language models, reducing costs and improving annotation accuracy.
Contribution
It proposes a novel data analysis method that identifies effective preference data and detects label misannotations, enhancing alignment training efficiency.
Findings
Training on 33% of high-quality, low-variability data achieves comparable or better alignment.
Alignment Data Map detects label misannotations by analyzing label-score correlations.
Experimental results on multiple benchmarks demonstrate improved efficiency and accuracy.
Abstract
Human preference data is essential for aligning large language models (LLMs) with human values, but collecting such data is often costly and inefficient-motivating the need for efficient data selection methods that reduce annotation costs while preserving alignment effectiveness. To address this issue, we propose Alignment Data Map, a data analysis tool for identifying and selecting effective preference data. We first evaluate alignment scores of the preference data by LLM-as-a-judge, explicit reward model, and reference-based approaches. The Alignment Data Map considers both response quality and inter-response variability based on the alignment scores. From our experimental findings, training on only 33% of samples that exhibit high-quality and low-variability, achieves comparable or superior alignment performance on MT-Bench, Evol-Instruct, and AlpacaEval, compared to training with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
