Scalable Valuation of Human Feedback through Provably Robust Model Alignment

Masahiro Fujisawa; Masaki Adachi; Michael A. Osborne

arXiv:2505.17859·cs.LG·October 27, 2025

Scalable Valuation of Human Feedback through Provably Robust Model Alignment

Masahiro Fujisawa, Masaki Adachi, Michael A. Osborne

PDF

1 Video

TL;DR

This paper introduces H"older-DPO, a novel alignment loss function with provable robustness to label noise, enabling scalable, automated valuation of human feedback and improved model alignment.

Contribution

It presents the first alignment method with a provable redescending property, allowing robust estimation from noisy human feedback and effective detection of mislabels.

Findings

01

H"older-DPO achieves state-of-the-art robustness in alignment tasks.

02

It accurately detects and removes noisy labels in datasets.

03

Application to real-world data improves alignment performance.

Abstract

Despite the importance of aligning language models with human preferences, crowd-sourced human feedback is often noisy -- for example, preferring less desirable responses -- posing a fundamental challenge to alignment. A truly robust alignment objective should yield identical model parameters even under severe label noise, a property known as redescending. We prove that no existing alignment methods satisfy this property. To address this, we propose H\"older-DPO, the first principled alignment loss with a provable redescending property, enabling estimation of the clean data distribution from noisy feedback. The aligned model estimates the likelihood of clean data, providing a theoretically grounded metric for dataset valuation that identifies the location and fraction of mislabels. This metric is gradient-free, enabling scalable and automated human feedback valuation without costly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Scalable Valuation of Human Feedback through Provably Robust Model Alignment· slideslive