The Truthfulness Spectrum Hypothesis
Zhuofan Josh Ying, Shauli Ravfogel, Nikolaus Kriegeskorte, Peter Hase

TL;DR
This paper investigates how large language models encode different types of truth and falsehood, revealing a spectrum of truth directions in their representations that can be manipulated and reshaped through training and interventions.
Contribution
It introduces the truthfulness spectrum hypothesis, demonstrating the coexistence of domain-general and domain-specific truth directions in LLMs and analyzing their geometric and causal properties.
Findings
Linear probes generalize across most truth types but not on lying.
Training on all domains improves cross-domain generalization.
Post-training reshapes truth geometry, affecting model tendencies.
Abstract
Large language models (LLMs) have been reported to linearly encode truthfulness, yet recent work questions this finding's generality. We reconcile these views with the truthfulness spectrum hypothesis: the representational space contains directions ranging from broadly domain-general to narrowly domain-specific. To test this hypothesis, we systematically evaluate probe generalization across five truth types (definitional, empirical, logical, fictional, and ethical), sycophantic and expectation-inverted lying, and existing honesty benchmarks. Linear probes generalize well across most domains but fail on sycophantic and expectation-inverted lying. Yet training on all domains jointly recovers strong performance, confirming that domain-general directions exist despite poor pairwise transfer. The geometry of probe directions explains these patterns: Mahalanobis cosine similarity between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Misinformation and Its Impacts
