DUP: Detection-guided Unlearning for Backdoor Purification in Language Models
Man Hu, Yahui Ding, Yatao Yang, Liangyu Chen, Yanhao Jia, Shuai Zhao

TL;DR
DUP introduces a unified, detection-guided unlearning framework that effectively identifies and removes backdoor threats in language models without full retraining or external clean models, enhancing defense performance.
Contribution
It proposes a novel detection-guided unlearning method that combines feature anomaly detection with knowledge distillation-based purification, improving backdoor defense in language models.
Findings
DUP outperforms existing methods in detection accuracy.
It effectively unlearns backdoors without full model retraining.
The approach is validated across various attack types and model architectures.
Abstract
As backdoor attacks become more stealthy and robust, they reveal critical weaknesses in current defense strategies: detection methods often rely on coarse-grained feature statistics, and purification methods typically require full retraining or additional clean models. To address these challenges, we propose DUP (Detection-guided Unlearning for Purification), a unified framework that integrates backdoor detection with unlearning-based purification. The detector captures feature-level anomalies by jointly leveraging class-agnostic distances and inter-layer transitions. These deviations are integrated through a weighted scheme to identify poisoned inputs, enabling more fine-grained analysis. Based on the detection results, we purify the model through a parameter-efficient unlearning mechanism that avoids full retraining and does not require any external clean model. Specifically, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Advanced Malware Detection Techniques
