Defending Against Prompt Injection with DataFilter

Yizhu Wang; Sizhe Chen; Raghad Alkhudair; Basel Alomair; David Wagner

arXiv:2510.19207·cs.CR·February 5, 2026

Defending Against Prompt Injection with DataFilter

Yizhu Wang, Sizhe Chen, Raghad Alkhudair, Basel Alomair, David Wagner

PDF

Open Access 1 Models

TL;DR

This paper introduces DataFilter, a model-agnostic, test-time defense that effectively removes malicious prompt injections from data, significantly reducing attack success rates while preserving the utility of large language models.

Contribution

DataFilter is a novel supervised fine-tuned model that selectively filters adversarial content from data before it reaches the LLM, providing a practical and effective defense against prompt injection attacks.

Findings

01

DataFilter reduces prompt injection success rates to near zero.

02

It maintains high utility of LLMs while providing security against attacks.

03

The model is easy to deploy and works across multiple benchmarks.

Abstract

When large language model (LLM) agents are increasingly deployed to automate tasks and interact with untrusted external data, prompt injection emerges as a significant security threat. By injecting malicious instructions into the data that LLMs access, an attacker can arbitrarily override the original user task and redirect the agent toward unintended, potentially harmful actions. Existing defenses either require access to model weights (fine-tuning), incur substantial utility loss (detection-based), or demand non-trivial system redesign (system-level). Motivated by this, we propose DataFilter, a test-time model-agnostic defense that removes malicious instructions from the data before it reaches the backend LLM. DataFilter is trained with supervised fine-tuning on simulated injections and leverages both the user's instruction and the data to selectively strip adversarial content while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
JoyYizhu/DataFilter
model· 355 dl· ♡ 4
355 dl♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Topic Modeling