Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models

Xi Li; Ruofan Mao; Yusen Zhang; Renze Lou; Chen Wu; Jiaqi Wang

arXiv:2406.05948·cs.CR·October 31, 2025·2 cites

Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models

Xi Li, Ruofan Mao, Yusen Zhang, Renze Lou, Chen Wu, Jiaqi Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces Chain-of-Scrutiny, a novel method leveraging LLMs' reasoning to detect backdoor attacks in API-based large language models, offering an efficient, transparent, and user-friendly defense mechanism.

Contribution

It proposes Chain-of-Scrutiny, a new approach that uses LLM reasoning to identify backdoor attacks without requiring model access or extensive data, suitable for API-only deployments.

Findings

01

Effective detection of backdoor attacks across various tasks and LLMs.

02

Greater benefits observed with more powerful LLMs.

03

Low-cost, data-efficient, and user-friendly detection method.

Abstract

Large Language Models (LLMs), especially those accessed via APIs, have demonstrated impressive capabilities across various domains. However, users without technical expertise often turn to (untrustworthy) third-party services, such as prompt engineering, to enhance their LLM experience, creating vulnerabilities to adversarial threats like backdoor attacks. Backdoor-compromised LLMs generate malicious outputs to users when inputs contain specific "triggers" set by attackers. Traditional defense strategies, originally designed for small-scale models, are impractical for API-accessible LLMs due to limited model access, high computational costs, and data requirements. To address these limitations, we propose Chain-of-Scrutiny (CoS) which leverages LLMs' unique reasoning abilities to mitigate backdoor attacks. It guides the LLM to generate reasoning steps for a given input and scrutinizes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lixi1994/CoS
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsSparse Evolutionary Training