TL;DR
This paper introduces a multi-agent framework to analyze data lineage in LLMs, revealing systemic issues and enabling more diverse data curation by reconstructing dataset evolution graphs.
Contribution
It proposes an automated method to reconstruct dataset lineage graphs, uncovering systemic issues and improving data diversity in LLM training.
Findings
Identified domain-specific structural patterns in data lineage.
Discovered systemic issues like redundancy and benchmark contamination.
Created a lineage-aware dataset to enhance diversity.
Abstract
Post-training data plays a pivotal role in shaping the capabilities of Large Language Models (LLMs), yet datasets are often treated as isolated artifacts, overlooking the systemic connections that underlie their evolution. To disentangle these complex relationships, we introduce the concept of \textbf{data lineage} to the LLM ecosystem and propose an automated multi-agent framework to reconstruct the evolutionary graph of dataset development. Through large-scale lineage analysis, we characterize domain-specific structural patterns, such as vertical refinement in math-oriented datasets and horizontal aggregation in general-domain corpora. Moreover, we uncover pervasive systemic issues, including \textit{structural redundancy} induced by implicit dataset intersections and the \textit{propagation of benchmark contamination} along lineage paths. To demonstrate the practical value of lineage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
