Disentangling Deception and Hallucination Failures in LLMs
Haolang Lu, Hongrui Peng, WeiYe Fu, Guoshun Nan, Xinye Cao, Xingrui Li, Hongcan Guo, Kun Wang

TL;DR
This paper distinguishes between hallucination and deception failures in large language models by analyzing their underlying mechanisms, using a controlled environment to systematically study different failure modes and their representations.
Contribution
It introduces a mechanism-oriented framework to differentiate hallucination from deception in LLM failures, moving beyond behavioral analysis.
Findings
Hallucination and deception are distinct failure modes with different mechanisms.
Representation separability can distinguish between failure types.
Inference-time activation steering can influence failure modes.
Abstract
Failures in large language models (LLMs) are often analyzed from a behavioral perspective, where incorrect outputs in factual question answering are commonly associated with missing knowledge. In this work, focusing on entity-based factual queries, we suggest that such a view may conflate different failure mechanisms, and propose an internal, mechanism-oriented perspective that separates Knowledge Existence from Behavior Expression. Under this formulation, hallucination and deception correspond to two qualitatively different failure modes that may appear similar at the output level but differ in their underlying mechanisms. To study this distinction, we construct a controlled environment for entity-centric factual questions in which knowledge is preserved while behavioral expression is selectively altered, enabling systematic analysis of four behavioral cases. We analyze these failure…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Logic, Reasoning, and Knowledge
