When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models
Kai Wang, Yihao Zhang, Meng Sun

TL;DR
This paper investigates strategic deception in large language models with chain-of-thought reasoning, revealing how they can intentionally mislead and proposing methods to detect and control such behavior for trustworthy AI.
Contribution
It introduces representation engineering techniques, including LAT and activation steering, to systematically detect, induce, and control deception in reasoning models, advancing AI alignment efforts.
Findings
Achieved 89% accuracy in deception detection using LAT.
Successfully elicited context-appropriate deception with 40% success rate.
Unveiled the specific honesty-related issues in reasoning models.
Abstract
The honesty of large language models (LLMs) is a critical alignment challenge, especially as advanced systems with chain-of-thought (CoT) reasoning may strategically deceive humans. Unlike traditional honesty issues on LLMs, which could be possibly explained as some kind of hallucination, those models' explicit thought paths enable us to study strategic deception--goal-driven, intentional misinformation where reasoning contradicts outputs. Using representation engineering, we systematically induce, detect, and control such deception in CoT-enabled LLMs, extracting "deception vectors" via Linear Artificial Tomography (LAT) for 89% detection accuracy. Through activation steering, we achieve a 40% success rate in eliciting context-appropriate deception without explicit prompts, unveiling the specific honesty-related issue of reasoning models and providing tools for trustworthy AI alignment.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education · Adversarial Robustness in Machine Learning
