Tracing the Data Trail: A Survey of Data Provenance, Transparency and Traceability in LLMs

Richard Hohensinner; Belgin Mutlu; Inti Gabriel Mendoza Estrada; Matej Vukovic; Simone Kopeinik; Roman Kern

arXiv:2601.14311·cs.CR·January 22, 2026

Tracing the Data Trail: A Survey of Data Provenance, Transparency and Traceability in LLMs

Richard Hohensinner, Belgin Mutlu, Inti Gabriel Mendoza Estrada, Matej Vukovic, Simone Kopeinik, Roman Kern

PDF

Open Access

TL;DR

This survey reviews a decade of research on data provenance, transparency, and traceability in large language models, highlighting methodologies, challenges, and a new taxonomy for the field.

Contribution

It introduces a comprehensive taxonomy of data provenance and transparency in LLMs, synthesizing 95 publications and identifying key methodologies and trade-offs.

Findings

01

Key methodologies include data watermarking and bias measurement

02

Trade-offs exist between transparency and data privacy

03

A taxonomy for data provenance in LLMs is proposed

Abstract

Large language models (LLMs) are deployed at scale, yet their training data life cycle remains opaque. This survey synthesizes research from the past ten years on three tightly coupled axes: (1) data provenance, (2) transparency, and (3) traceability, and three supporting pillars: (4) bias \& uncertainty, (5) data privacy, and (6) tools and techniques that operationalize them. A central contribution is a proposed taxonomy defining the field's domains and listing corresponding artifacts. Through analysis of 95 publications, this work identifies key methodologies concerning data generation, watermarking, bias measurement, data curation, data privacy, and the inherent trade-off between transparency and opacity.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Research Data Management Practices · Machine Learning in Materials Science