Defending Against Prompt Injection With a Few DefensiveTokens
Sizhe Chen, Yizhu Wang, Nicholas Carlini, Chawin Sitawarin, David Wagner

TL;DR
This paper introduces DefensiveToken, a test-time defense mechanism for large language models that enhances prompt injection robustness by inserting specially optimized tokens, balancing security and utility.
Contribution
The paper proposes DefensiveToken, a novel test-time defense with optimized tokens that provides prompt injection robustness comparable to training-time methods.
Findings
DefensiveTokens improve prompt injection security with minimal utility loss.
The method allows flexible switching between high utility and security.
Code implementation is publicly available.
Abstract
When large language model (LLM) systems interact with external data to perform complex tasks, a new attack, namely prompt injection, becomes a significant threat. By injecting instructions into the data accessed by the system, the attacker is able to override the initial user task with an arbitrary task directed by the attacker. To secure the system, test-time defenses, e.g., defensive prompting, have been proposed for system developers to attain security only when needed in a flexible manner. However, they are much less effective than training-time defenses that change the model parameters. Motivated by this, we propose DefensiveToken, a test-time defense with prompt injection robustness comparable to training-time alternatives. DefensiveTokens are newly inserted as special tokens, whose embeddings are optimized for security. In security-sensitive cases, system developers can append a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
