Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection
Zekun Li, Baolin Peng, Pengcheng He, Xifeng Yan

TL;DR
This paper introduces a benchmark to evaluate the robustness of large language models against prompt injection attacks, revealing significant vulnerabilities and emphasizing the need for improved prompt comprehension.
Contribution
The work provides the first comprehensive benchmark for assessing LLM robustness to prompt injection, highlighting key vulnerabilities and guiding future robustness improvements.
Findings
Some models overly focus on injected instructions
Models with better context understanding are more vulnerable
Significant vulnerabilities found across leading LLMs
Abstract
Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following, becoming increasingly crucial across various applications. However, this capability brings with it the risk of prompt injection attacks, where attackers inject instructions into LLMs' input to elicit undesirable actions or content. Understanding the robustness of LLMs against such attacks is vital for their safe implementation. In this work, we establish a benchmark to evaluate the robustness of instruction-following LLMs against prompt injection attacks. Our objective is to determine the extent to which LLMs can be influenced by injected instructions and their ability to differentiate between these injected and original target instructions. Through extensive experiments with leading instruction-following LLMs, we uncover significant vulnerabilities in their robustness to such attacks. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Software Engineering Research · Topic Modeling
MethodsFocus
