Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness
Tomer Ashuach, Shai Gretz, Yoav Katz, Yonatan Belinkov, Liat Ein-Dor

TL;DR
This paper investigates whether large language models have privileged knowledge about answer correctness, finding domain-specific advantages in factual knowledge but not in math reasoning, and localizing this advantage within model layers.
Contribution
It introduces a method to disentangle genuine privileged knowledge in LLMs by analyzing disagreement subsets and localizes the domain-specific advantage across model layers.
Findings
Self-representations outperform peer representations in factual knowledge tasks.
No advantage of self-representations in math reasoning tasks.
Privileged knowledge in factual tasks emerges from early-to-mid layers.
Abstract
Humans use introspection to evaluate their understanding through private internal states inaccessible to external observers. We investigate whether large language models possess similar privileged knowledge about answer correctness, information unavailable through external observation. We train correctness classifiers on question representations from both a model's own hidden states and external models, testing whether self-representations provide a performance advantage. On standard evaluation, we find no advantage: self-probes perform comparably to peer-model probes. We hypothesize this is due to high inter-model agreement of answer correctness. To isolate genuine privileged knowledge, we evaluate on disagreement subsets, where models produce conflicting predictions. Here, we discover domain-specific privileged knowledge: self-representations consistently outperform peer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
