The Trilemma of Truth in Large Language Models
Germans Savcisens, Tina Eliassi-Rad

TL;DR
This paper introduces sAwMIL, a novel probing framework combining multiple-instance learning and conformal prediction to better assess the truthfulness of information encoded in large language models, revealing complex encoding patterns.
Contribution
The study identifies flaws in existing probing methods and proposes sAwMIL, a new approach that improves the reliability of truthfulness assessment in LLMs by leveraging internal activations.
Findings
Common probing methods are unreliable and sometimes worse than zero-shot prompting.
Truth and falsehood are not encoded symmetrically in LLMs.
LLMs encode a third signal, distinct from both true and false.
Abstract
The public often attributes human-like qualities to large language models (LLMs) and assumes they "know" certain things. In reality, LLMs encode information retained during training as internal probabilistic knowledge. This study examines existing methods for probing the veracity of that knowledge and identifies several flawed underlying assumptions. To address these flaws, we introduce sAwMIL (Sparse-Aware Multiple-Instance Learning), a multiclass probing framework that combines multiple-instance learning with conformal prediction. sAwMIL leverages internal activations of LLMs to classify statements as true, false, or neither. We evaluate sAwMIL across 16 open-source LLMs, including default and chat-based variants, on three new curated datasets. Our results show that (1) common probing methods fail to provide a reliable and transferable veracity direction and, in some settings, perform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Support Vector Machine
