Hierarchical Alignment: Enforcing Hierarchical Instruction-Following in LLMs through Logical Consistency
Shu Yang, Zihao Zhou, Di Wang, Wenda Li

TL;DR
This paper introduces Neuro-Symbolic Hierarchical Alignment (NSHA), a method for large language models to better handle conflicting instructions by explicitly modeling instruction priorities and reasoning through constraint satisfaction.
Contribution
The paper presents a novel neuro-symbolic approach that enforces hierarchical instruction-following in LLMs via solver-guided reasoning and training distillation.
Findings
NSHA improves model performance in conflicting instruction scenarios.
NSHA maintains utility while enhancing safety and compliance.
The approach is effective across multiple tasks and interaction types.
Abstract
Large language models increasingly operate under multiple instructions from heterogeneous sources with different authority levels, including system policies, user requests, tool outputs, and retrieved context. While prior work on instruction hierarchy highlights the importance of respecting instruction priorities, it mainly focuses on adversarial attacks and overlooks the benign but common instruction conflicts that arise in real-world applications. In such settings, models must not only avoid security violations but also preserve task utility and behavioral consistency when instructions partially or implicitly conflict. We propose Neuro-Symbolic Hierarchical Alignment (NSHA) for hierarchical instruction-following by explicitly modeling and enforcing instruction priorities. At inference time, we introduce solver-guided reasoning that formulates instruction resolution as a constraint…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
