Reasoning Up the Instruction Ladder for Controllable Language Models
Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar

TL;DR
This paper introduces a reasoning-based approach to enforce instruction hierarchies in large language models, improving their ability to follow prioritized instructions and resist safety attacks.
Contribution
The authors propose VerIH, a dataset for training models to reason about instruction priorities, and demonstrate effective finetuning that enhances instruction following and safety robustness.
Findings
Models show 20% improvement on instruction hierarchy benchmarks.
Finetuned models reduce attack success rate by up to 20%.
Reasoning over instruction hierarchies enhances controllability and safety of LLMs.
Abstract
As large language model (LLM) based systems take on high-stakes roles in real-world decision-making, they must reconcile competing instructions from multiple sources (e.g., model developers, users, and tools) within a single prompt context. Thus, enforcing an instruction hierarchy (IH) in LLMs, where higher-level directives override lower-priority requests, is critical for the reliability and controllability of LLMs. In this work, we reframe instruction hierarchy resolution as a reasoning task. Specifically, the model must first "think" about the relationship between a given user prompt and higher-priority (system) instructions before generating a response. To enable this capability via training, we construct VerIH, an instruction hierarchy dataset of constraint-following tasks with verifiable answers. This dataset comprises ~7K aligned and conflicting system-user instructions. We show…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper can be easily followed with coherent writing. - Experiments show improved performance over the base models and CoT after full fine-tuning. - Results also show generalization to other benchmarks, particularly to safety-related tasks. - Ablation of not using the conflicting prompts is also shown. - Reasoning traces are also qualitatively evaluated to assess the relationship between system and user prompt.
- Originality is limited since the key benefit and contribution of verifiability of instructions is established from Lambert et al., 2025. On the other hand, the idea of conflicting user prompts is also originally provided in Zhang et al., 2025. Thus, the only contribution is augmenting the RLVR-IFEval dataset with the basic scheme of conflicting user prompts. - Simple Claude-based rewriting may introduce bias and may not generalize to new types of rewriting structures. More analysis to the dive
* **Novel and Effective Problem Framing:** The key insight to treat instruction hierarchy as a meta-reasoning task rather than a standard alignment problem is a strong and novel contribution. This moves the field beyond implicit learning and toward explicit, scrutable conflict resolution. * **High-Quality Dataset (VerIH):** The creation of the VerIH dataset is a valuable contribution to the community. The methodology of generating conflicts from an existing verifiable dataset (RLVR-IFEval) is
* **Scalability of Hierarchy:** The paper simplifies the IH problem to two levels: system and user. While it claims the method is "inherently scalable", this is asserted without proof. Real-world applications involve more complex hierarchies (e.g., developer system prompts, user-level system prompts, tool instructions, user data) that may have more nuanced precedence rules. The experiments do not test this scalability. * **Diversity of Conflicts:** The VerIH dataset's conflicts are generated
1. Concrete Reformulation of a Critical Problem: The paper offers a clear and well-motivated reformulation of instruction hierarchy resolution as a reasoning problem, supported by real-world motivating scenarios. This framing addresses a persistent weakness in LLM deployment around controllability and safety. 2. VerIH Dataset with Verifiable Constraints: Construction of the VerIH dataset enables systematic training and evaluation. By creating both aligned and conflicting system-user prompt pairs
1. Limited Theoretical Analysis and Justification of Meta-Reasoning Efficacy: While the intuition for meta-reasoning over instruction hierarchies is plausible, the paper lacks a formal or semi-formal analysis or even a taxonomy of potential failure cases for instruction prioritization. For instance, there is no attempt to systematically dissect why explicit reasoning works better than implicit mapping for instruction hierarchy, nor to quantify its limitations (Section 2 and analysis in Section 6
* A simple and effective method: Lightweight dataset + RLVR yields measurable performance gains with only ~7K examples. * Broad benchmark coverage: Includes both in-domain (IHEval, IFBench) and out-of-domain (safety, jailbreak) tests.
* Incremental novelty: The paper extends earlier instruction hierarchy and reasoning-for-safety works but doesn’t fundamentally rethink model architecture or training beyond RLVR on synthetic conflicts. * Synthetic dataset limitations: VerIH conflicts are LLM-generated and may lack realism or linguistic diversity; unclear if models overfit to the structure of these synthetic conflicts. * Evaluation limitations: * Heavy reliance on automated verification or guard scoring; limited human evalu
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Topic Modeling
