TL;DR
This paper introduces H"older-DPO, a novel alignment loss function with provable robustness to label noise, enabling scalable, automated valuation of human feedback and improved model alignment.
Contribution
It presents the first alignment method with a provable redescending property, allowing robust estimation from noisy human feedback and effective detection of mislabels.
Findings
H"older-DPO achieves state-of-the-art robustness in alignment tasks.
It accurately detects and removes noisy labels in datasets.
Application to real-world data improves alignment performance.
Abstract
Despite the importance of aligning language models with human preferences, crowd-sourced human feedback is often noisy -- for example, preferring less desirable responses -- posing a fundamental challenge to alignment. A truly robust alignment objective should yield identical model parameters even under severe label noise, a property known as redescending. We prove that no existing alignment methods satisfy this property. To address this, we propose H\"older-DPO, the first principled alignment loss with a provable redescending property, enabling estimation of the clean data distribution from noisy feedback. The aligned model estimates the likelihood of clean data, providing a theoretically grounded metric for dataset valuation that identifies the location and fraction of mislabels. This metric is gradient-free, enabling scalable and automated human feedback valuation without costly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
