Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing
Kaisheng Fan, Weizhe Zhang, Yishu Gao, Tegawend\'e F. Bissyand\'e, Xunzhu Tang

TL;DR
TIGS is a plug-and-play inference-time defense for large language models that disrupts backdoor triggers by intrinsic geometric smoothing, requiring no retraining or external data.
Contribution
Introducing TIGS, a novel, parameter-free method that detects and disrupts backdoor triggers during inference by leveraging attention collapse and geometric smoothing.
Findings
TIGS significantly reduces attack success rates across various LLM architectures.
TIGS maintains high reasoning accuracy and semantic consistency on clean inputs.
TIGS introduces minimal latency overhead, enabling practical deployment.
Abstract
Defending against backdoor attacks in large language models remains a critical practical challenge. Existing defenses mitigate these threats but typically incur high preparation costs and degrade utility via offline purification, or introduce severe latency via complex online interventions. To overcome this dichotomy, we present Tail-risk Intrinsic Geometric Smoothing (TIGS), a plug-and-play inference-time defense requiring no parameter updates, external clean data, or auxiliary generation. TIGS leverages the observation that successful backdoor triggers consistently induce localized attention collapse within the semantic content region. Operating entirely within the native forward pass, TIGS first performs content-aware tail-risk screening to identify suspicious attention heads and rows using sample-internal signals. It then applies intrinsic geometric smoothing: a weak content-domain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
