LLMScan: Causal Scan for LLM Misbehavior Detection
Mengdi Zhang, Kai Kiat Goh, Peixin Zhang, Jun Sun, Rose Lin Xin, Hongyu Zhang

TL;DR
LLMScan introduces a causal inference-based monitoring method to detect untruthful, biased, and harmful responses in large language models by analyzing their internal causal contributions.
Contribution
This work presents LLMScan, a novel causality analysis approach for comprehensive detection of LLM misbehavior, surpassing existing targeted methods.
Findings
Effectively distinguishes normal and misbehaving behaviors via causal distribution analysis.
Develops accurate, lightweight detectors for various misbehavior types.
Demonstrates robustness across multiple tasks and models.
Abstract
Despite the success of Large Language Models (LLMs) across various fields, their potential to generate untruthful, biased and harmful responses poses significant risks, particularly in critical applications. This highlights the urgent need for systematic methods to detect and prevent such misbehavior. While existing approaches target specific issues such as harmful responses, this work introduces LLMScan, an innovative LLM monitoring technique based on causality analysis, offering a comprehensive solution. LLMScan systematically monitors the inner workings of an LLM through the lens of causal inference, operating on the premise that the LLM's `brain' behaves differently when misbehaving. By analyzing the causal contributions of the LLM's input tokens and transformer layers, LLMScan effectively detects misbehavior. Extensive experiments across various tasks and models reveal clear…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMagnetic confinement fusion research
