Stronger Enforcement of Instruction Hierarchy via Augmented Intermediate Representations
Sanjay Kariyappa, G. Edward Suh

TL;DR
This paper proposes a new method to enhance instruction hierarchy signals in large language models by injecting them into intermediate representations, significantly reducing prompt injection attack success rates without harming model utility.
Contribution
It introduces a novel approach of injecting instruction hierarchy signals into intermediate layers with trainable embeddings, improving security against prompt injection attacks.
Findings
Achieves 1.6x to 9.2x reduction in attack success rate
Effective across multiple models and training methods
Maintains model utility while enhancing security
Abstract
Prompt injection attacks are a critical security vulnerability in large language models (LLMs), allowing attackers to hijack model behavior by injecting malicious instructions within the input context. Recent defense mechanisms have leveraged an Instruction Hierarchy (IH) Signal, often implemented through special delimiter tokens or additive embeddings to denote the privilege level of input tokens. However, these prior works typically inject the IH signal exclusively at the initial input layer, which we hypothesize limits its ability to effectively distinguish the privilege levels of tokens as it propagates through the different layers of the model. To overcome this limitation, we introduce a novel approach that injects the IH signal into the intermediate token representations within the network. Our method augments these representations with layer-specific trainable embeddings that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Adversarial Robustness in Machine Learning · Imbalanced Data Classification Techniques
