Strong Preferences Affect the Robustness of Preference Models and Value   Alignment

Ziwei Xu; Mohan Kankanhalli

arXiv:2410.02451·cs.AI·March 11, 2025

Strong Preferences Affect the Robustness of Preference Models and Value Alignment

Ziwei Xu, Mohan Kankanhalli

PDF

Open Access 1 Video

TL;DR

This paper analyzes how small changes in preference probabilities can significantly impact the robustness of preference models used for value alignment in AI, highlighting potential safety concerns.

Contribution

It provides a theoretical analysis of the sensitivity of common preference models, revealing conditions under which they become highly sensitive to preference changes.

Findings

01

Preference probabilities can change significantly with minor preference shifts.

02

Sensitivity is especially high when preferences are near certainty (0 or 1).

03

Implications for robustness and safety in AI value alignment.

Abstract

Value alignment, which aims to ensure that large language models (LLMs) and other AI agents behave in accordance with human values, is critical for ensuring safety and trustworthiness of these systems. A key component of value alignment is the modeling of human preferences as a representation of human values. In this paper, we investigate the robustness of value alignment by examining the sensitivity of preference models. Specifically, we ask: how do changes in the probabilities of some preferences affect the predictions of these models for other preferences? To answer this question, we theoretically analyze the robustness of widely used preference models by examining their sensitivities to minor changes in preferences they model. Our findings reveal that, in the Bradley-Terry and the Placket-Luce model, the probability of a preference can change significantly as other preferences…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Strong Preferences Affect the Robustness of Preference Models and Value Alignment· slideslive

Taxonomy

TopicsDecision-Making and Behavioral Economics