Caught in the Act: a mechanistic approach to detecting deception
Gerard Boxo, Ryan Socha, Daniel Yoo, Shivam Raval

TL;DR
This paper presents a mechanistic approach using linear probes to detect deception in large language models with high accuracy, revealing specific internal representations associated with deceptive responses.
Contribution
It introduces a novel linear probing method to identify deception in LLMs and characterizes the internal activation patterns linked to deceptive outputs.
Findings
Probes achieve over 90% accuracy in detecting deception in large models.
Detection accuracy varies with model size, from chance in small models to over 90% in large models.
Multiple linear directions encode deception, with their number increasing in larger models.
Abstract
Sophisticated instrumentation for AI systems might have indicators that signal misalignment from human values, not unlike a "check engine" light in cars. One such indicator of misalignment is deceptiveness in generated responses. Future AI instrumentation may have the ability to detect when an LLM generates deceptive responses while reasoning about seemingly plausible but incorrect answers to factual questions. In this work, we demonstrate that linear probes on LLMs internal activations can detect deception in their responses with extremely high accuracy. Our probes reach a maximum of greater than 90% accuracy in distinguishing between deceptive and non-deceptive arguments generated by llama and qwen models ranging from 1.5B to 14B parameters, including their DeepSeek-r1 finetuned variants. We observe that probes on smaller models (1.5B) achieve chance accuracy at detecting deception,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
