Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution

Chen Chen; Yuchen Sun; Jiaxin Gao; Xueluan Gong; Qian Wang; Ziyao Wang; Yongsen Zheng; Kwok-Yan Lam

arXiv:2508.21004·cs.CL·August 29, 2025

Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution

Chen Chen, Yuchen Sun, Jiaxin Gao, Xueluan Gong, Qian Wang, Ziyao Wang, Yongsen Zheng, Kwok-Yan Lam

PDF

Open Access

TL;DR

Lethe is a novel defense method that effectively purifies large language models from backdoor vulnerabilities by diluting malicious behaviors internally and distracting attention externally, outperforming existing defenses across multiple scenarios.

Contribution

The paper introduces LETHE, a comprehensive backdoor defense for LLMs that combines knowledge dilution and external evidence to neutralize diverse attack types.

Findings

01

Reduces attack success rate by up to 98%

02

Outperforms 8 state-of-the-art defenses

03

Maintains model utility and robustness

Abstract

Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks. However, they remain vulnerable to backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated. Existing backdoor defenses either lack comprehensiveness, focusing on narrow trigger settings, detection-only mechanisms, and limited domains, or fail to withstand advanced scenarios like model-editing-based, multi-trigger, and triggerless attacks. In this paper, we present LETHE, a novel method to eliminate backdoor behaviors from LLMs through knowledge dilution using both internal and external mechanisms. Internally, LETHE leverages a lightweight dataset to train a clean model, which is then merged with the backdoored model to neutralize malicious…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling