Probing for Knowledge Attribution in Large Language Models
Ivo Brink, Alexander Boer, Dennis Ulmer

TL;DR
This paper introduces a method to identify whether large language models' outputs are based on internal knowledge or user prompts, using a simple probe trained on a new self-supervised dataset, improving interpretability and trustworthiness.
Contribution
It presents AttriWiki, a self-supervised data pipeline for training probes to accurately attribute model outputs to their knowledge source, enhancing interpretability of LLMs.
Findings
Probes achieve up to 0.96 Macro-F1 on in-domain data.
High transferability of attribution accuracy to out-of-domain benchmarks.
Attribution mismatches significantly increase error rates.
Abstract
Large language models (LLMs) often generate fluent but unfounded claims, or hallucinations, which fall into two types: (i) faithfulness violations - misusing user context - and (ii) factuality violations - errors from internal knowledge. Proper mitigation depends on knowing whether a model's answer is based on the prompt or its internal weights. This work focuses on the problem of contributive attribution: identifying the dominant knowledge source behind each output. We show that a probe, a simple linear classifier trained on model hidden representations, can reliably predict contributive attribution. For its training, we introduce AttriWiki, a self-supervised data pipeline that prompts models to recall withheld entities from memory or read them from context, generating labelled examples automatically. Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI)
