Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy

Tong Wu; Shujian Zhang; Kaiqiang Song; Silei Xu; Sanqiang Zhao; Ravi Agrawal; Sathish Reddy Indurthi; Chong Xiang; Prateek Mittal; Wenxuan Zhou

arXiv:2410.09102·cs.LG·May 28, 2025

Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy

Tong Wu, Shujian Zhang, Kaiqiang Song, Silei Xu, Sanqiang Zhao, Ravi Agrawal, Sathish Reddy Indurthi, Chong Xiang, Prateek Mittal, Wenxuan Zhou

PDF

Open Access 5 Models 3 Reviews

TL;DR

This paper introduces Instructional Segment Embedding (ISE), a novel architectural technique that embeds instruction priority into large language models, significantly enhancing safety and instruction-following capabilities against malicious prompts.

Contribution

The paper proposes ISE, an architectural embedding method inspired by BERT, to explicitly encode instruction hierarchy in LLMs, addressing a key safety vulnerability.

Findings

01

Up to 15.75% increase in robust accuracy on safety benchmarks

02

Up to 4.1% improvement in instruction-following performance

03

Effective differentiation of instruction types improves model safety

Abstract

Large Language Models (LLMs) are susceptible to security and safety threats, such as prompt injection, prompt extraction, and harmful requests. One major cause of these vulnerabilities is the lack of an instruction hierarchy. Modern LLM architectures treat all inputs equally, failing to distinguish between and prioritize various types of instructions, such as system messages, user prompts, and data. As a result, lower-priority user prompts may override more critical system instructions, including safety protocols. Existing approaches to achieving instruction hierarchy, such as delimiters and instruction-based training, do not address this issue at the architectural level. We introduce the Instructional Segment Embedding (ISE) technique, inspired by BERT, to modern large language models, which embeds instruction priority information directly into the model. This approach enables models…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

The paper is generally well written. The paper addresses an important issue about trustworthiness of LLMs.

Weaknesses

I had a difficult time understanding the problem statement at the beginning of the paper. The major weakness of the paper is the novelty of the approach. The approach merely adds an embedding layer to the LLM. Page 4: The authors state that the standard supervised fine tuning approach “remains a fundamental limitation.”(line 180) I think it would be nice to summarize the limitations/experimental results a bit here. Page 8, Robustness figures can be very misleading/look different based on ord

Reviewer 02Rating 6Confidence 3

Strengths

1. The idea of introducing instructional segment embedding to directly enhance LLM’s safety is novel and promising. 2. The authors provide rigorous experimental validation on a range of tasks and demonstrate the method’s effectiveness.

Weaknesses

1. The paper employs full-parameter fine-tuning to learn the instructional segment embeddings, but it’s unclear if the baseline models are fine-tuned similarly. Is it a fair comparison? An ablation study evaluating baseline performance with the same fine-tuning across all datasets (Clean Alpaca, Adversarial Alpaca, UltraChat) would strengthen the paper 2. The paper lacks an assessment of how well the segment embeddings generalize across datasets. Specifically, it would be valuable to see how emb

Reviewer 03Rating 6Confidence 3

Strengths

- Clear problem framing and motivation - The paper addresses a significant problem with modern LLMs. The lack of separation between different levels of input makes models vulnerable to prompt injection, prompt extraction, and jailbreaks. - Promising and intuitive solution - ISE is a simple and well-motivated solution - it is natural and straightforward, making it a very good idea. - The paper's presentation in sections 1-4 is very clear.

Weaknesses

- Lack of clarity in experimental design and results - Sections 5 (Experimental Design) and 6 (Results) are difficult to understand, which makes it challenging to asses the performance of the method. I think the paper would benefit significantly from a refactoring of these sections to be as clear as possible. - Concretely: - It is unclear how the malicious instructions in Adversarial Alpaca are generated, and how the instructions in training relate to those in testing. - It is unclea

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsQuality and Safety in Healthcare · Pharmacy and Medical Practices · Safety Systems Engineering in Autonomy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · WordPiece · Dropout · Layer Normalization · Adam · Attention Dropout