Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks

John T. Halloran; Noopur S. Bhatt

arXiv:2605.19147·cs.CR·May 20, 2026

Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks

John T. Halloran, Noopur S. Bhatt

PDF

TL;DR

This paper proposes open-book benign rewriting (OBBR), a novel LLM rewriting method that effectively defends against backdoor data poisoning attacks by projecting samples into benign prompt space, improving safety and efficiency.

Contribution

The paper introduces OBBR, a new rewriting technique that enhances defense against various backdoor attacks, outperforming existing methods in safety and computational efficiency.

Findings

01

OBBR increases safety performance by 51% on average across five backdoor attacks.

02

OBBR outperforms state-of-the-art defenses and closed-book rewriting methods.

03

OBBR does not degrade natural language task performance and defends against non-trigger attacks.

Abstract

Large language models (LLMs) are highly susceptible to backdoor attacks (BAs), wherein training samples are poisoned using trigger-based harmful content. Furthermore, existing defenses have proven ineffective when extensively tested across BA patterns. To better combat BAs, we explore the use of LLM rewriting as a proactive defense against data poisoning. First, we theoretically show that when LLM rewriting utilizes open-book benign samples--termed open-book benign rewriting (OBBR)--the probability of a rewritten output being benign is strictly greater than that of closed-book rewriting. Thus, OBBR neutralizes harmful content by projecting training samples to the space of benign prompts. We then show that, in contrast to previous defenses, OBBR effectively mitigates a large number of existing BAs: across five known BAs and four widely used LLMs, OBBR increases safety performance by an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.