Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

Jianhui Chen; Yuzhang Luo; Liangming Pan

arXiv:2601.21996·cs.CL·January 30, 2026

Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

Jianhui Chen, Yuzhang Luo, Liangming Pan

PDF

Open Access

TL;DR

This paper introduces a scalable influence-based framework to trace interpretable units in large language models back to specific training data, demonstrating causal effects of data interventions on model interpretability and capabilities.

Contribution

We propose Mechanistic Data Attribution (MDA), a novel influence function-based method to identify and manipulate training samples that shape interpretable model circuits.

Findings

01

Targeted data removal modulates interpretable head emergence.

02

Repetitive structural data acts as a mechanistic catalyst.

03

Interventions on induction heads affect in-context learning.

Abstract

While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model's in-context learning (ICL) capability. This provides direct causal evidence for the long-standing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Software Engineering Research · Advanced Graph Neural Networks