Probing for Knowledge Attribution in Large Language Models

Ivo Brink; Alexander Boer; Dennis Ulmer

arXiv:2602.22787·cs.CL·February 27, 2026

Probing for Knowledge Attribution in Large Language Models

Ivo Brink, Alexander Boer, Dennis Ulmer

PDF

Open Access

TL;DR

This paper introduces a method to identify whether large language models' outputs are based on internal knowledge or user prompts, using a simple probe trained on a new self-supervised dataset, improving interpretability and trustworthiness.

Contribution

It presents AttriWiki, a self-supervised data pipeline for training probes to accurately attribute model outputs to their knowledge source, enhancing interpretability of LLMs.

Findings

01

Probes achieve up to 0.96 Macro-F1 on in-domain data.

02

High transferability of attribution accuracy to out-of-domain benchmarks.

03

Attribution mismatches significantly increase error rates.

Abstract

Large language models (LLMs) often generate fluent but unfounded claims, or hallucinations, which fall into two types: (i) faithfulness violations - misusing user context - and (ii) factuality violations - errors from internal knowledge. Proper mitigation depends on knowing whether a model's answer is based on the prompt or its internal weights. This work focuses on the problem of contributive attribution: identifying the dominant knowledge source behind each output. We show that a probe, a simple linear classifier trained on model hidden representations, can reliably predict contributive attribution. For its training, we introduce AttriWiki, a self-supervised data pipeline that prompts models to recall withheld entities from memory or read them from context, generating labelled examples automatically. Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI)