Pairwise Calibrated Rewards for Pluralistic Alignment
Daniel Halpern, Evi Micha, Ariel D. Procaccia, Itai Shapira

TL;DR
This paper introduces a method to model diverse human preferences in AI alignment by learning a distribution over multiple reward functions from pairwise preferences, improving calibration and representation of pluralistic values.
Contribution
It proposes a novel pairwise calibration approach to learn reward ensembles that reflect diverse human preferences without predefined groups or annotator IDs.
Findings
Improved calibration of reward ensembles to human preferences
Effective learning heuristic for training reward distributions
Accurate representation of pluralistic human values
Abstract
Current alignment pipelines presume a single, universal notion of desirable behavior. However, human preferences often diverge across users, contexts, and cultures. As a result, disagreement collapses into the majority signal and minority perspectives are discounted. To address this, we propose reflecting diverse human preferences through a distribution over multiple reward functions, each inducing a distinct aligned policy. The distribution is learned directly from pairwise preference without annotator identifiers or predefined groups. Instead, annotator disagreements are treated as informative soft labels. Our central criterion is pairwise calibration: for every pair of candidate responses, the proportion of reward functions preferring one response matches the fraction of annotators with that preference. We prove that even a small outlier-free ensemble can accurately represent diverse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Recommender Systems and Techniques · Data Quality and Management
