TL;DR
BackFlush is a novel framework that detects and eliminates backdoors in large language models without harming watermarks, using a rotation-based parameter unlearning technique.
Contribution
It introduces the Backdoor Flushing Phenomenon and Backdoor Susceptibility Amplification, enabling effective backdoor removal and detection without prior trigger knowledge.
Findings
Achieves approximately 1% Attack Success Rate (ASR)
Maintains approximately 99% clean accuracy (CACC)
Preserves watermarking capabilities while eliminating backdoors
Abstract
In recent trends, one can observe Large Language Models (LLMs) are exposed to backdoor attacks where vicious triggers added during training or model editing to elicit harmful outputs on specific input patterns while maintaining clean performance on normal inputs. Legitimate watermarks used as ownership signatures share similar mechanisms to backdoors, creating a critical challenge: detecting and eliminating unknown backdoors without compromising watermark integrity. Existing defenses require prior knowledge of triggers or their payloads, depend on clean reference models, or sacrifice model utility without preserving the watermark. To address these limitations we introduce BackFlush and its variants, a unified framework for backdoor detection and elimination while preserving watermarks. We establish two novel observations: Backdoor Flushing Phenomenon, where injecting and unlearning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
