Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks
John T. Halloran, Noopur S. Bhatt

TL;DR
This paper proposes open-book benign rewriting (OBBR), a novel LLM rewriting method that effectively defends against backdoor data poisoning attacks by projecting samples into benign prompt space, improving safety and efficiency.
Contribution
The paper introduces OBBR, a new rewriting technique that enhances defense against various backdoor attacks, outperforming existing methods in safety and computational efficiency.
Findings
OBBR increases safety performance by 51% on average across five backdoor attacks.
OBBR outperforms state-of-the-art defenses and closed-book rewriting methods.
OBBR does not degrade natural language task performance and defends against non-trigger attacks.
Abstract
Large language models (LLMs) are highly susceptible to backdoor attacks (BAs), wherein training samples are poisoned using trigger-based harmful content. Furthermore, existing defenses have proven ineffective when extensively tested across BA patterns. To better combat BAs, we explore the use of LLM rewriting as a proactive defense against data poisoning. First, we theoretically show that when LLM rewriting utilizes open-book benign samples--termed open-book benign rewriting (OBBR)--the probability of a rewritten output being benign is strictly greater than that of closed-book rewriting. Thus, OBBR neutralizes harmful content by projecting training samples to the space of benign prompts. We then show that, in contrast to previous defenses, OBBR effectively mitigates a large number of existing BAs: across five known BAs and four widely used LLMs, OBBR increases safety performance by an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
