LLMScan: Causal Scan for LLM Misbehavior Detection

Mengdi Zhang; Kai Kiat Goh; Peixin Zhang; Jun Sun; Rose Lin Xin; Hongyu Zhang

arXiv:2410.16638·cs.AI·May 27, 2025

LLMScan: Causal Scan for LLM Misbehavior Detection

Mengdi Zhang, Kai Kiat Goh, Peixin Zhang, Jun Sun, Rose Lin Xin, Hongyu Zhang

PDF

Open Access 2 Repos

TL;DR

LLMScan introduces a causal inference-based monitoring method to detect untruthful, biased, and harmful responses in large language models by analyzing their internal causal contributions.

Contribution

This work presents LLMScan, a novel causality analysis approach for comprehensive detection of LLM misbehavior, surpassing existing targeted methods.

Findings

01

Effectively distinguishes normal and misbehaving behaviors via causal distribution analysis.

02

Develops accurate, lightweight detectors for various misbehavior types.

03

Demonstrates robustness across multiple tasks and models.

Abstract

Despite the success of Large Language Models (LLMs) across various fields, their potential to generate untruthful, biased and harmful responses poses significant risks, particularly in critical applications. This highlights the urgent need for systematic methods to detect and prevent such misbehavior. While existing approaches target specific issues such as harmful responses, this work introduces LLMScan, an innovative LLM monitoring technique based on causality analysis, offering a comprehensive solution. LLMScan systematically monitors the inner workings of an LLM through the lens of causal inference, operating on the premise that the LLM's `brain' behaves differently when misbehaving. By analyzing the causal contributions of the LLM's input tokens and transformer layers, LLMScan effectively detects misbehavior. Extensive experiments across various tasks and models reveal clear…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMagnetic confinement fusion research