Mechanistic?
Naomi Saphra, Sarah Wiegreffe

TL;DR
This paper clarifies the varied meanings of 'mechanistic interpretability' in neural model research, exploring its technical, cultural, and historical contexts to address semantic confusion and community divides.
Contribution
It provides a comprehensive analysis of the different definitions and cultural interpretations of 'mechanistic' in interpretability research, highlighting community dynamics.
Findings
Identifies four uses of 'mechanistic' in interpretability
Traces the history of NLP and interpretability communities
Explains the semantic drift and community divide
Abstract
The rise of the term "mechanistic interpretability" has accompanied increasing interest in understanding neural models -- particularly language models. However, this jargon has also led to a fair amount of confusion. So, what does it mean to be "mechanistic"? We describe four uses of the term in interpretability research. The most narrow technical definition requires a claim of causality, while a broader technical definition allows for any exploration of a model's internals. However, the term also has a narrow cultural definition describing a cultural movement. To understand this semantic drift, we present a history of the NLP interpretability community and the formation of the separate, parallel "mechanistic" interpretability community. Finally, we discuss the broad cultural definition -- encompassing the entire field of interpretability -- and why the traditional NLP interpretability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Adversarial Robustness in Machine Learning
