BackFlush: Knowledge-Free Backdoor Detection and Elimination with Watermark Preservation in Large Language Models

Jagadeesh Rachapudi; Ritali Vatsi; Pranav Singh; Praful Hambarde; Amit Shukla

arXiv:2605.12529·cs.CR·May 14, 2026

BackFlush: Knowledge-Free Backdoor Detection and Elimination with Watermark Preservation in Large Language Models

Jagadeesh Rachapudi, Ritali Vatsi, Pranav Singh, Praful Hambarde, Amit Shukla

PDF

1 Repo

TL;DR

BackFlush is a novel framework that detects and eliminates backdoors in large language models without harming watermarks, using a rotation-based parameter unlearning technique.

Contribution

It introduces the Backdoor Flushing Phenomenon and Backdoor Susceptibility Amplification, enabling effective backdoor removal and detection without prior trigger knowledge.

Findings

01

Achieves approximately 1% Attack Success Rate (ASR)

02

Maintains approximately 99% clean accuracy (CACC)

03

Preserves watermarking capabilities while eliminating backdoors

Abstract

In recent trends, one can observe Large Language Models (LLMs) are exposed to backdoor attacks where vicious triggers added during training or model editing to elicit harmful outputs on specific input patterns while maintaining clean performance on normal inputs. Legitimate watermarks used as ownership signatures share similar mechanisms to backdoors, creating a critical challenge: detecting and eliminating unknown backdoors without compromising watermark integrity. Existing defenses require prior knowledge of triggers or their payloads, depend on clean reference models, or sacrifice model utility without preserving the watermark. To address these limitations we introduce BackFlush and its variants, a unified framework for backdoor detection and elimination while preserving watermarks. We establish two novel observations: Backdoor Flushing Phenomenon, where injecting and unlearning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

JagadeeshAI/BackFlush IJCNN.git
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.