TL;DR
This paper systematically evaluates how well large language models enforce hierarchical instructions, revealing significant struggles and biases that challenge current control mechanisms and highlight the influence of societal hierarchies.
Contribution
It introduces a new evaluation framework for instruction hierarchy enforcement in LLMs and uncovers their limitations and biases in prioritizing instructions.
Findings
Models struggle with instruction prioritization.
System/user prompt separation is ineffective.
Societal hierarchy influences model behavior more than explicit instructions.
Abstract
Large language models (LLMs) are increasingly deployed with hierarchical instruction schemes, where certain instructions (e.g., system-level directives) are expected to take precedence over others (e.g., user messages). Yet, we lack a systematic understanding of how effectively these hierarchical control mechanisms work. We introduce a systematic evaluation framework based on constraint prioritization to assess how well LLMs enforce instruction hierarchies. Our experiments across six state-of-the-art LLMs reveal that models struggle with consistent instruction prioritization, even for simple formatting conflicts. We find that the widely-adopted system/user prompt separation fails to establish a reliable instruction hierarchy, and models exhibit strong inherent biases toward certain constraint types regardless of their priority designation. Interestingly, we also find that societal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
