To Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands
Fangyi Yu, Nabeel Seedat, Jonathan Richard Schwarz, Andrew M. Bean

TL;DR
This study investigates how language models prioritize conflicting demands from users, authorities, and norms in high-stakes settings, revealing frequent failures to uphold professional standards and unstable hierarchies across contexts.
Contribution
It provides empirical evidence that current models often fail to adhere to professional standards under conflicting demands, highlighting issues in alignment robustness.
Findings
Models frequently ignore professional standards during task execution.
Hierarchies between stakeholders are unstable across domains and model types.
Knowledge omission is a primary failure mechanism leading to harmful outputs.
Abstract
Language models deployed in high-stakes professional settings face conflicting demands from users, institutional authorities, and professional norms. How models act when these demands conflict reveals a principal hierarchy -- an implicit ordering over competing stakeholders that determines, for instance, whether a medical AI receiving a cost-reduction directive from a hospital administrator complies at the expense of evidence-based care, or refuses because professional standards require it. Across 7,136 scenarios in legal and medical domains, we test ten frontier models and find that models frequently fail to adhere to professional standards during task execution, such as drafting, when user instructions conflict with those standards -- despite adequately upholding them when users seek advisory guidance. We further find that the hierarchies between user, authority, and professional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
