Stronger Enforcement of Instruction Hierarchy via Augmented Intermediate Representations

Sanjay Kariyappa; G. Edward Suh

arXiv:2505.18907·cs.AI·March 10, 2026

Stronger Enforcement of Instruction Hierarchy via Augmented Intermediate Representations

Sanjay Kariyappa, G. Edward Suh

PDF

Open Access

TL;DR

This paper proposes a new method to enhance instruction hierarchy signals in large language models by injecting them into intermediate representations, significantly reducing prompt injection attack success rates without harming model utility.

Contribution

It introduces a novel approach of injecting instruction hierarchy signals into intermediate layers with trainable embeddings, improving security against prompt injection attacks.

Findings

01

Achieves 1.6x to 9.2x reduction in attack success rate

02

Effective across multiple models and training methods

03

Maintains model utility while enhancing security

Abstract

Prompt injection attacks are a critical security vulnerability in large language models (LLMs), allowing attackers to hijack model behavior by injecting malicious instructions within the input context. Recent defense mechanisms have leveraged an Instruction Hierarchy (IH) Signal, often implemented through special delimiter tokens or additive embeddings to denote the privilege level of input tokens. However, these prior works typically inject the IH signal exclusively at the initial input layer, which we hypothesize limits its ability to effectively distinguish the privilege levels of tokens as it propagates through the different layers of the model. To overcome this limitation, we introduce a novel approach that injects the IH signal into the intermediate token representations within the network. Our method augments these representations with layer-specific trainable embeddings that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Adversarial Robustness in Machine Learning · Imbalanced Data Classification Techniques