Can Revealed Preferences Clarify LLM Alignment and Steering?
Khurram Yamin, Jingjing Tang, Eric Horvitz, Bryan Wilder

TL;DR
This paper introduces an empirical method to infer and evaluate the implicit preferences of LLMs in decision-making, enabling assessment of their goal coherence and steerability across medical domains.
Contribution
It presents a pipeline for estimating LLMs' implied preferences and demonstrates its use in evaluating model coherence and steerability in high-stakes decision tasks.
Findings
Models exhibit some internal goal coherence.
Models often fail to accurately report or adopt user-specified preferences.
Prompting can influence models' decision policies but with limitations.
Abstract
LLMs are increasingly used to make or support high-stakes decisions under uncertainty, where alignment depends not only on factual accuracy but on how models weigh tradeoffs between different outcomes. We present an empirical pipeline for estimating the implied preferences that an LLM's observed choices optimize: we elicit the model's probability distribution over unknowns along with the choice it would make for the decision task and then fit a discrete choice model to recover the cost function that best rationalizes the model's decisions. We show how this revealed-preference description allows rigorous evaluation of whether models behave in a consistently goal-directed way, whether they can verbalize a description of their objectives which matches their revealed decision policy, and whether prompting can reliably steer those policies to implement a user-specified cost function. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
