Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing

Kaisheng Fan; Weizhe Zhang; Yishu Gao; Tegawend\'e F. Bissyand\'e; Xunzhu Tang

arXiv:2604.24162·cs.CR·April 28, 2026

Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing

Kaisheng Fan, Weizhe Zhang, Yishu Gao, Tegawend\'e F. Bissyand\'e, Xunzhu Tang

PDF

TL;DR

TIGS is a plug-and-play inference-time defense for large language models that disrupts backdoor triggers by intrinsic geometric smoothing, requiring no retraining or external data.

Contribution

Introducing TIGS, a novel, parameter-free method that detects and disrupts backdoor triggers during inference by leveraging attention collapse and geometric smoothing.

Findings

01

TIGS significantly reduces attack success rates across various LLM architectures.

02

TIGS maintains high reasoning accuracy and semantic consistency on clean inputs.

03

TIGS introduces minimal latency overhead, enabling practical deployment.

Abstract

Defending against backdoor attacks in large language models remains a critical practical challenge. Existing defenses mitigate these threats but typically incur high preparation costs and degrade utility via offline purification, or introduce severe latency via complex online interventions. To overcome this dichotomy, we present Tail-risk Intrinsic Geometric Smoothing (TIGS), a plug-and-play inference-time defense requiring no parameter updates, external clean data, or auxiliary generation. TIGS leverages the observation that successful backdoor triggers consistently induce localized attention collapse within the semantic content region. Operating entirely within the native forward pass, TIGS first performs content-aware tail-risk screening to identify suspicious attention heads and rows using sample-internal signals. It then applies intrinsic geometric smoothing: a weak content-domain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.