Evaluating the Instruction-Following Robustness of Large Language Models   to Prompt Injection

Zekun Li; Baolin Peng; Pengcheng He; Xifeng Yan

arXiv:2308.10819·cs.CL·November 28, 2023·2 cites

Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection

Zekun Li, Baolin Peng, Pengcheng He, Xifeng Yan

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces a benchmark to evaluate the robustness of large language models against prompt injection attacks, revealing significant vulnerabilities and emphasizing the need for improved prompt comprehension.

Contribution

The work provides the first comprehensive benchmark for assessing LLM robustness to prompt injection, highlighting key vulnerabilities and guiding future robustness improvements.

Findings

01

Some models overly focus on injected instructions

02

Models with better context understanding are more vulnerable

03

Significant vulnerabilities found across leading LLMs

Abstract

Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following, becoming increasingly crucial across various applications. However, this capability brings with it the risk of prompt injection attacks, where attackers inject instructions into LLMs' input to elicit undesirable actions or content. Understanding the robustness of LLMs against such attacks is vital for their safe implementation. In this work, we establish a benchmark to evaluate the robustness of instruction-following LLMs against prompt injection attacks. Our objective is to determine the extent to which LLMs can be influenced by injected instructions and their ability to differentiate between these injected and original target instructions. Through extensive experiments with leading instruction-following LLMs, we uncover significant vulnerabilities in their robustness to such attacks. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Software Engineering Research · Topic Modeling

MethodsFocus