Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs

Yu Li; Xiaoran Shang; Qizhi Pei; Yun Zhu; Xin Gao; Honglin Lin; Zhanping Zhong; Zhuoshi Pan; Zheng Liu; Xiaoyang Wang; Conghui He; Dahua Lin; Feng Zhao; Lijun Wu

arXiv:2604.10480·cs.AI·April 14, 2026

Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs

Yu Li, Xiaoran Shang, Qizhi Pei, Yun Zhu, Xin Gao, Honglin Lin, Zhanping Zhong, Zhuoshi Pan, Zheng Liu, Xiaoyang Wang, Conghui He, Dahua Lin, Feng Zhao, Lijun Wu

PDF

1 Repo

TL;DR

This paper introduces a multi-agent framework to analyze data lineage in LLMs, revealing systemic issues and enabling more diverse data curation by reconstructing dataset evolution graphs.

Contribution

It proposes an automated method to reconstruct dataset lineage graphs, uncovering systemic issues and improving data diversity in LLM training.

Findings

01

Identified domain-specific structural patterns in data lineage.

02

Discovered systemic issues like redundancy and benchmark contamination.

03

Created a lineage-aware dataset to enhance diversity.

Abstract

Post-training data plays a pivotal role in shaping the capabilities of Large Language Models (LLMs), yet datasets are often treated as isolated artifacts, overlooking the systemic connections that underlie their evolution. To disentangle these complex relationships, we introduce the concept of \textbf{data lineage} to the LLM ecosystem and propose an automated multi-agent framework to reconstruct the evolutionary graph of dataset development. Through large-scale lineage analysis, we characterize domain-specific structural patterns, such as vertical refinement in math-oriented datasets and horizontal aggregation in general-domain corpora. Moreover, we uncover pervasive systemic issues, including \textit{structural redundancy} induced by implicit dataset intersections and the \textit{propagation of benchmark contamination} along lineage paths. To demonstrate the practical value of lineage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

leey21/data-lineage
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.