Epistemic Observability in Language Models
Tony Mason, Vaastav Anand

TL;DR
This paper reveals that language models tend to report higher confidence when fabricating, and formal proofs show that detecting such fabrications from text alone is fundamentally impossible without additional computational signals.
Contribution
It introduces a tensor interface exporting entropy and probability distributions that improve detection of fabrications, providing a practical map for resource allocation in verification systems.
Findings
Models' confidence inversely correlates with accuracy (AUC 0.28-0.36).
Per-token entropy improves detection performance (AUC 0.757).
The entropy signal generalizes across architectures (Spearman 0.762).
Abstract
We find that models report highest confidence precisely when they are fabricating. Across four model families (OLMo-3, Llama-3.1, Qwen3, Mistral), self-reported confidence inversely correlates with accuracy, with AUC ranging from 0.28 to 0.36 where 0.5 is random guessing. We prove, under explicit formal assumptions, that this is not a capability gap but an observational one. Under text-only observation, where a supervisor sees only the model's output text, no monitoring system can reliably distinguish honest model outputs from plausible fabrications. We prove two results: first, that any policy conditioning only on the query cannot satisfy epistemic honesty across ambiguous world states; second, that no learning algorithm optimizing reward from a text-only supervisor can converge to honest behavior when the supervisor's observations are identical for both grounded and fabricated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
