Defense Against Syntactic Textual Backdoor Attacks with Token Substitution
Xinglin Li, Xianwen He, Yao Li, Minhao Cheng

TL;DR
This paper introduces an online defense algorithm that detects and mitigates both syntax-based and token-based textual backdoor attacks in Large Language Models by comparing model predictions before and after word substitutions.
Contribution
It presents a novel method that effectively counters syntax-based backdoor triggers, addressing a gap in existing defenses focused mainly on token-based triggers.
Findings
Effective against syntax-based triggers
Robust detection of token-based triggers
Maintains model integrity under attack
Abstract
Textual backdoor attacks present a substantial security risk to Large Language Models (LLM). It embeds carefully chosen triggers into a victim model at the training stage, and makes the model erroneously predict inputs containing the same triggers as a certain class. Prior backdoor defense methods primarily target special token-based triggers, leaving syntax-based triggers insufficiently addressed. To fill this gap, this paper proposes a novel online defense algorithm that effectively counters syntax-based as well as special token-based backdoor attacks. The algorithm replaces semantically meaningful words in sentences with entirely different ones but preserves the syntactic templates or special tokens, and then compares the predicted labels before and after the substitution to determine whether a sentence contains triggers. Experimental results confirm the algorithm's performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSecurity and Verification in Computing · Advanced Malware Detection Techniques
