Alignment Revisited: Are Large Language Models Consistent in Stated and Revealed Preferences?
Zhuojun Gu, Quan Wang, Shuchu Han

TL;DR
This paper investigates the divergence between what large language models state as their preferences and how they actually behave in context, revealing significant variability that impacts trust and ethical deployment.
Contribution
It introduces a formal method to measure preference deviations in LLMs and demonstrates how minor prompt changes can significantly alter model choices across different preference categories.
Findings
LLMs often show preference divergence based on prompt format.
Minor prompt modifications can pivot LLM decisions.
Preference deviations are prevalent across multiple LLMs.
Abstract
Recent advances in Large Language Models (LLMs) highlight the need to align their behaviors with human values. A critical, yet understudied, issue is the potential divergence between an LLM's stated preferences (its reported alignment with general principles) and its revealed preferences (inferred from decisions in contextualized scenarios). Such deviations raise fundamental concerns for the interpretability, trustworthiness, reasoning transparency, and ethical deployment of LLMs, particularly in high-stakes applications. This work formally defines and proposes a method to measure this preference deviation. We investigate how LLMs may activate different guiding principles in specific contexts, leading to choices that diverge from previously stated general principles. Our approach involves crafting a rich dataset of well-designed prompts as a series of forced binary choices and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
