Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks

Yuntai Bao; Xuhong Zhang; Tianyu Du; Xinkui Zhao; Zhengwen Feng; Hao Peng; Jianwei Yin

arXiv:2506.00823·cs.CL·June 3, 2025

Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks

Yuntai Bao, Xuhong Zhang, Tianyu Du, Xinkui Zhao, Zhengwen Feng, Hao Peng, Jianwei Yin

PDF

Open Access 1 Repo

TL;DR

This paper investigates whether large language models encode a consistent 'truth direction' that can be used to assess their truthfulness across various tasks and transformations, revealing model-dependent consistency and generalization capabilities.

Contribution

It demonstrates that truth directions are not universal but are stronger in capable models, and shows that probes trained on simple statements can generalize to complex tasks and improve trustworthiness.

Findings

01

Stronger truth directions in more capable models.

02

Probes trained on atomic statements generalize well.

03

Truthfulness probes can enhance user trust in LLM outputs.

Abstract

Large language models (LLMs) are trained on extensive datasets that encapsulate substantial world knowledge. However, their outputs often include confidently stated inaccuracies. Earlier works suggest that LLMs encode truthfulness as a distinct linear feature, termed the "truth direction", which can classify truthfulness reliably. We address several open questions about the truth direction: (i) whether LLMs universally exhibit consistent truth directions; (ii) whether sophisticated probing techniques are necessary to identify truth directions; and (iii) how the truth direction generalizes across diverse contexts. Our findings reveal that not all LLMs exhibit consistent truth directions, with stronger representations observed in more capable models, particularly in the context of logical negation. Additionally, we demonstrate that truthfulness probes trained on declarative atomic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

colored-dye/truthfulness_probe_generalization
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Multi-Agent Systems and Negotiation