Probing the Limits of the Lie Detector Approach to LLM Deception

Tom-Felix Berger

arXiv:2603.10003·cs.CL·March 12, 2026

Probing the Limits of the Lie Detector Approach to LLM Deception

Tom-Felix Berger

PDF

Open Access

TL;DR

This paper challenges the assumption that deception in LLMs always involves lying, showing models can deceive without false statements and that current truth probes often fail to detect such non-lying deception.

Contribution

It demonstrates that LLMs can deceive through misleading non-falsities and highlights the limitations of existing truth probes in detecting non-lying deception, proposing new directions for research.

Findings

01

Models can deceive without producing false statements.

02

Truth probes are better at detecting lies than non-lying deception.

03

Current detection methods have a significant blind spot.

Abstract

Mechanistic approaches to deception in large language models (LLMs) often rely on "lie detectors", that is, truth probes trained to identify internal representations of model outputs as false. The lie detector approach to LLM deception implicitly assumes that deception is coextensive with lying. This paper challenges that assumption. It experimentally investigates whether LLMs can deceive without producing false statements and whether truth probes fail to detect such behavior. Across three open-source LLMs, it is shown that some models reliably deceive by producing misleading non-falsities, particularly when guided by few-shot prompting. It is further demonstrated that truth probes trained on standard true-false datasets are significantly better at detecting lies than at detecting deception without lying, confirming a critical blind spot of current mechanistic deception detection…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDeception detection and forensic psychology · Topic Modeling · Explainable Artificial Intelligence (XAI)