A Revealed Preference Framework for AI Alignment
Elchin Suleymanov

TL;DR
This paper introduces the Luce Alignment Model, a revealed preference framework to assess whether AI agents align with human preferences by analyzing their choices in different settings.
Contribution
It develops a novel model that identifies AI-human preference alignment using revealed preference techniques in both laboratory and field scenarios.
Findings
AI alignment can be generically identified from observed choices.
The model applies to both laboratory and real-world settings.
It provides a method to quantify AI's adherence to human preferences.
Abstract
Human decision makers increasingly delegate choices to AI agents, raising a natural question: does the AI implement the human principal's preferences or pursue its own? To study this question using revealed preference techniques, I introduce the Luce Alignment Model, where the AI's choices are a mixture of two Luce rules, one reflecting the human's preferences and the other the AI's. I show that the AI's alignment (similarity of human and AI preferences) can be generically identified in two settings: the laboratory setting, where both human and AI choices are observed, and the field setting, where only AI choices are observed.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
