Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models
Xi Li, Ruofan Mao, Yusen Zhang, Renze Lou, Chen Wu, Jiaqi Wang

TL;DR
This paper introduces Chain-of-Scrutiny, a novel method leveraging LLMs' reasoning to detect backdoor attacks in API-based large language models, offering an efficient, transparent, and user-friendly defense mechanism.
Contribution
It proposes Chain-of-Scrutiny, a new approach that uses LLM reasoning to identify backdoor attacks without requiring model access or extensive data, suitable for API-only deployments.
Findings
Effective detection of backdoor attacks across various tasks and LLMs.
Greater benefits observed with more powerful LLMs.
Low-cost, data-efficient, and user-friendly detection method.
Abstract
Large Language Models (LLMs), especially those accessed via APIs, have demonstrated impressive capabilities across various domains. However, users without technical expertise often turn to (untrustworthy) third-party services, such as prompt engineering, to enhance their LLM experience, creating vulnerabilities to adversarial threats like backdoor attacks. Backdoor-compromised LLMs generate malicious outputs to users when inputs contain specific "triggers" set by attackers. Traditional defense strategies, originally designed for small-scale models, are impractical for API-accessible LLMs due to limited model access, high computational costs, and data requirements. To address these limitations, we propose Chain-of-Scrutiny (CoS) which leverages LLMs' unique reasoning abilities to mitigate backdoor attacks. It guides the LLM to generate reasoning steps for a given input and scrutinizes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsSparse Evolutionary Training
