TL;DR
In the context of strong pre-trained language models, active preference learning offers minimal benefits over simple random sampling for data selection, with negligible improvements in proxy win-rates and no mitigation of capability degradation.
Contribution
This paper critically evaluates the effectiveness of active preference learning in large language models, showing it often underperforms simple random sampling in strong prior regimes.
Findings
Active preference learning yields negligible improvements over random sampling.
Win-rate improves even as overall model capability degrades.
Active preference learning does not significantly reduce variance or mitigate capability collapse.
Abstract
Modern LLMs inherit strong priors from web-scale pretraining, which can limit the headroom of post-training data-selection strategies. While Active Preference Learning (APL) seeks to optimize query efficiency in online Direct Preference Optimization (DPO), the inherent richness of on-policy candidate pools often renders simple Random sampling a surprisingly formidable baseline. We evaluate uncertainty-based APL against Random across harmlessness, helpfulness, and instruction-following settings, utilizing both reward models and LLM-as-a-judge proxies. We find that APL yields negligible improvements in proxy win-rates compared to Random. Crucially, we observe a dissociation where win-rate improves even as general capability -- measured by standard benchmarks -- degrades. APL fails to mitigate this capability collapse or reduce variance significantly better than random sampling. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
