Defending Against Prompt Injection with DataFilter
Yizhu Wang, Sizhe Chen, Raghad Alkhudair, Basel Alomair, David Wagner

TL;DR
This paper introduces DataFilter, a model-agnostic, test-time defense that effectively removes malicious prompt injections from data, significantly reducing attack success rates while preserving the utility of large language models.
Contribution
DataFilter is a novel supervised fine-tuned model that selectively filters adversarial content from data before it reaches the LLM, providing a practical and effective defense against prompt injection attacks.
Findings
DataFilter reduces prompt injection success rates to near zero.
It maintains high utility of LLMs while providing security against attacks.
The model is easy to deploy and works across multiple benchmarks.
Abstract
When large language model (LLM) agents are increasingly deployed to automate tasks and interact with untrusted external data, prompt injection emerges as a significant security threat. By injecting malicious instructions into the data that LLMs access, an attacker can arbitrarily override the original user task and redirect the agent toward unintended, potentially harmful actions. Existing defenses either require access to model weights (fine-tuning), incur substantial utility loss (detection-based), or demand non-trivial system redesign (system-level). Motivated by this, we propose DataFilter, a test-time model-agnostic defense that removes malicious instructions from the data before it reaches the backend LLM. DataFilter is trained with supervised fine-tuning on simulated injections and leverages both the user's instruction and the data to selectively strip adversarial content while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Topic Modeling
