DUP: Detection-guided Unlearning for Backdoor Purification in Language Models

Man Hu; Yahui Ding; Yatao Yang; Liangyu Chen; Yanhao Jia; Shuai Zhao

arXiv:2508.01647·cs.CR·August 5, 2025

DUP: Detection-guided Unlearning for Backdoor Purification in Language Models

Man Hu, Yahui Ding, Yatao Yang, Liangyu Chen, Yanhao Jia, Shuai Zhao

PDF

Open Access 1 Video

TL;DR

DUP introduces a unified, detection-guided unlearning framework that effectively identifies and removes backdoor threats in language models without full retraining or external clean models, enhancing defense performance.

Contribution

It proposes a novel detection-guided unlearning method that combines feature anomaly detection with knowledge distillation-based purification, improving backdoor defense in language models.

Findings

01

DUP outperforms existing methods in detection accuracy.

02

It effectively unlearns backdoors without full model retraining.

03

The approach is validated across various attack types and model architectures.

Abstract

As backdoor attacks become more stealthy and robust, they reveal critical weaknesses in current defense strategies: detection methods often rely on coarse-grained feature statistics, and purification methods typically require full retraining or additional clean models. To address these challenges, we propose DUP (Detection-guided Unlearning for Purification), a unified framework that integrates backdoor detection with unlearning-based purification. The detector captures feature-level anomalies by jointly leveraging class-agnostic distances and inter-layer transitions. These deviations are integrated through a weighted scheme to identify poisoned inputs, enabling more fine-grained analysis. Based on the detection results, we purify the model through a parameter-efficient unlearning mechanism that avoids full retraining and does not require any external clean model. Specifically, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DUP: Detection-guided Unlearning for Backdoor Purification in Language Models· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Advanced Malware Detection Techniques