Probing the Limits of the Lie Detector Approach to LLM Deception
Tom-Felix Berger

TL;DR
This paper challenges the assumption that deception in LLMs always involves lying, showing models can deceive without false statements and that current truth probes often fail to detect such non-lying deception.
Contribution
It demonstrates that LLMs can deceive through misleading non-falsities and highlights the limitations of existing truth probes in detecting non-lying deception, proposing new directions for research.
Findings
Models can deceive without producing false statements.
Truth probes are better at detecting lies than non-lying deception.
Current detection methods have a significant blind spot.
Abstract
Mechanistic approaches to deception in large language models (LLMs) often rely on "lie detectors", that is, truth probes trained to identify internal representations of model outputs as false. The lie detector approach to LLM deception implicitly assumes that deception is coextensive with lying. This paper challenges that assumption. It experimentally investigates whether LLMs can deceive without producing false statements and whether truth probes fail to detect such behavior. Across three open-source LLMs, it is shown that some models reliably deceive by producing misleading non-falsities, particularly when guided by few-shot prompting. It is further demonstrated that truth probes trained on standard true-false datasets are significantly better at detecting lies than at detecting deception without lying, confirming a critical blind spot of current mechanistic deception detection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDeception detection and forensic psychology · Topic Modeling · Explainable Artificial Intelligence (XAI)
