Testing the Limits of Truth Directions in LLMs
Angelos Poulis, Mark Crovella, Evimaria Terzi

TL;DR
This paper investigates the limitations of the universality of truth directions in large language models, revealing significant variability across layers, tasks, and instructions that challenge previous assumptions.
Contribution
It uncovers layer-dependent, task-specific, and instruction-sensitive variations in truth directions, highlighting the limited scope of their universality in LLMs.
Findings
Truth directions are highly layer-dependent.
They vary with task type and complexity.
Model instructions significantly influence truth directions.
Abstract
Large language models (LLMs) have been shown to encode truth of statements in their activation space along a linear truth direction. Previous studies have argued that these directions are universal in certain aspects, while more recent work has questioned this conclusion drawing on limited generalization across some settings. In this work, we identify a number of limits of truth-direction universality that have not been previously understood. We first show that truth directions are highly layer-dependent, and that a full understanding of universality requires probing at many layers in the model. We then show that truth directions depend heavily on task type, emerging in earlier layers for factual and later layers for reasoning tasks; they also vary in performance across levels of task complexity. Finally, we show that model instructions dramatically affect truth directions; simple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
