ConfGuard: A Simple and Effective Backdoor Detection for Large Language Models

Zihan Wang; Rui Zhang; Hongwei Li; Wenshu Fan; Wenbo Jiang; Qingchuan Zhao; Guowen Xu

arXiv:2508.01365·cs.CR·November 12, 2025

ConfGuard: A Simple and Effective Backdoor Detection for Large Language Models

Zihan Wang, Rui Zhang, Hongwei Li, Wenshu Fan, Wenbo Jiang, Qingchuan Zhao, Guowen Xu

PDF

1 Video

TL;DR

ConfGuard is a lightweight, real-time backdoor detection method for large language models that leverages the sequence lock phenomenon, achieving near-perfect detection rates with minimal latency.

Contribution

The paper introduces ConfGuard, a novel detection approach exploiting behavioral discrepancies in output confidence to identify backdoors in LLMs, effective in real-time scenarios.

Findings

01

Achieves near 100% true positive rate in detecting backdoors.

02

Maintains negligible false positive rate across experiments.

03

Operates with minimal additional latency, suitable for real-world deployment.

Abstract

Backdoor attacks pose a significant threat to Large Language Models (LLMs), where adversaries can embed hidden triggers to manipulate LLM's outputs. Most existing defense methods, primarily designed for classification tasks, are ineffective against the autoregressive nature and vast output space of LLMs, thereby suffering from poor performance and high latency. To address these limitations, we investigate the behavioral discrepancies between benign and backdoored LLMs in output space. We identify a critical phenomenon which we term sequence lock: a backdoored model generates the target sequence with abnormally high and consistent confidence compared to benign generation. Building on this insight, we propose ConfGuard, a lightweight and effective detection method that monitors a sliding window of token confidences to identify sequence lock. Extensive experiments demonstrate ConfGuard…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ConfGuard: A Simple and Effective Backdoor Detection for Large Language Models· underline