Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training
Hang Chen, Jiaying Zhu, Hongyang Chen, Hongxu Liu, Xinyu Yang, Wenya Wang

TL;DR
This paper examines the limitations of static mechanistic interpretability in LLMs, showing that circuits evolve over time and static mechanisms are insufficient for guiding future updates.
Contribution
It introduces new metrics to analyze circuit evolution and demonstrates the need for predictive, foresight-based approaches in mechanistic interpretability.
Findings
Circuits exhibit 'Free Evolution' during parameter updates.
Static mechanisms suffer from temporal latency and are inadequate for future guidance.
A predictive framework for circuit evolution is proposed.
Abstract
The "Locate-then-Update" paradigm has become a predominant approach in the post-training of large language models (LLMs), identifying critical components via mechanistic interpretability for targeted parameter updates. However, this paradigm rests on a fundamental yet unverified assumption: can mechanisms derived from current static parameters reliably guide future dynamic parameter updates? To investigate this, we systematically track the structural evolution of Transformer circuits throughout the supervised fine-tuning (SFT) process, revealing the underlying dynamics of task mechanisms. We introduce three novel metrics-Circuit Distance, Circuit Stability, and Circuit Conflict-to analyze circuit evolution across three dimensions: neural migration, semantic stability, and cross-task interference. Our empirical results reveal that circuits inherently exhibit "Free Evolution" during…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
