Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units
Jianhui Chen, Yuzhang Luo, Liangming Pan

TL;DR
This paper introduces a scalable influence-based framework to trace interpretable units in large language models back to specific training data, demonstrating causal effects of data interventions on model interpretability and capabilities.
Contribution
We propose Mechanistic Data Attribution (MDA), a novel influence function-based method to identify and manipulate training samples that shape interpretable model circuits.
Findings
Targeted data removal modulates interpretable head emergence.
Repetitive structural data acts as a mechanistic catalyst.
Interventions on induction heads affect in-context learning.
Abstract
While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model's in-context learning (ICL) capability. This provides direct causal evidence for the long-standing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Software Engineering Research · Advanced Graph Neural Networks
