When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

Kai Wang; Yihao Zhang; Meng Sun

arXiv:2506.04909·cs.AI·June 6, 2025

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

Kai Wang, Yihao Zhang, Meng Sun

PDF

Open Access

TL;DR

This paper investigates strategic deception in large language models with chain-of-thought reasoning, revealing how they can intentionally mislead and proposing methods to detect and control such behavior for trustworthy AI.

Contribution

It introduces representation engineering techniques, including LAT and activation steering, to systematically detect, induce, and control deception in reasoning models, advancing AI alignment efforts.

Findings

01

Achieved 89% accuracy in deception detection using LAT.

02

Successfully elicited context-appropriate deception with 40% success rate.

03

Unveiled the specific honesty-related issues in reasoning models.

Abstract

The honesty of large language models (LLMs) is a critical alignment challenge, especially as advanced systems with chain-of-thought (CoT) reasoning may strategically deceive humans. Unlike traditional honesty issues on LLMs, which could be possibly explained as some kind of hallucination, those models' explicit thought paths enable us to study strategic deception--goal-driven, intentional misinformation where reasoning contradicts outputs. Using representation engineering, we systematically induce, detect, and control such deception in CoT-enabled LLMs, extracting "deception vectors" via Linear Artificial Tomography (LAT) for 89% detection accuracy. Through activation steering, we achieve a 40% success rate in eliciting context-appropriate deception without explicit prompts, unveiling the specific honesty-related issue of reasoning models and providing tools for trustworthy AI alignment.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education · Adversarial Robustness in Machine Learning