DataDignity: Training Data Attribution for Large Language Models
Xiaomin Li, Andrzej Banburski-Fahey, Jaron Lanier

TL;DR
This paper introduces FakeWiki, a benchmark for evaluating provenance attribution in language models, and proposes methods that significantly improve retrieval accuracy across various models and query conditions.
Contribution
It presents FakeWiki, a new controlled benchmark for provenance attribution, and develops ScoringModel, a supervised contrastive ranker that outperforms baselines in identifying source documents.
Findings
ScoringModel improves mean Recall@10 from 35.0 to 52.2 across models.
SteerFuse, a training-free method, performs competitively as a complement to retrieval.
ScoringModel enhances performance on jailbreak-inspired queries by 15.7 points.
Abstract
Auditing language-model outputs often requires more than judging correctness: an auditor may need to identify which source document most likely supports the knowledge expressed in a response. We study this as pinpoint provenance: given a prompt, a target-model response, and a candidate corpus, rank the documents that best support the response. We introduce FakeWiki, a controlled benchmark of 3,537 fabricated Wikipedia-style articles designed to preserve ground-truth provenance while weakening lexical shortcuts. FakeWiki includes QA probes, source-preserving paraphrases, retro-generated variants, hard anti-documents that remain topically similar while removing answer-critical facts, and five query conditions: clean prompting plus four jailbreak-inspired transformations. We evaluate seven retrieval baselines, a training-free activation-steering retrieval-fusion method, SteerFuse, and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
