Mechanistic?

Naomi Saphra; Sarah Wiegreffe

arXiv:2410.09087·cs.AI·October 15, 2024·2 cites

Mechanistic?

Naomi Saphra, Sarah Wiegreffe

PDF

Open Access

TL;DR

This paper clarifies the varied meanings of 'mechanistic interpretability' in neural model research, exploring its technical, cultural, and historical contexts to address semantic confusion and community divides.

Contribution

It provides a comprehensive analysis of the different definitions and cultural interpretations of 'mechanistic' in interpretability research, highlighting community dynamics.

Findings

01

Identifies four uses of 'mechanistic' in interpretability

02

Traces the history of NLP and interpretability communities

03

Explains the semantic drift and community divide

Abstract

The rise of the term "mechanistic interpretability" has accompanied increasing interest in understanding neural models -- particularly language models. However, this jargon has also led to a fair amount of confusion. So, what does it mean to be "mechanistic"? We describe four uses of the term in interpretability research. The most narrow technical definition requires a claim of causality, while a broader technical definition allows for any exploration of a model's internals. However, the term also has a narrow cultural definition describing a cultural movement. To understand this semantic drift, we present a history of the NLP interpretability community and the formation of the separate, parallel "mechanistic" interpretability community. Finally, we discuss the broad cultural definition -- encompassing the entire field of interpretability -- and why the traditional NLP interpretability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Adversarial Robustness in Machine Learning