Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims
Zezheng Lin, Fengming Liu

TL;DR
Mechanistic interpretability claims often lack explicit disclosure of the assumptions needed for causal inference, risking overinterpretation without proper validation of causal claims.
Contribution
The paper highlights the absence of dedicated identification assumptions sections in interpretability papers and proposes a norm for explicit disclosure.
Findings
Most papers do not explicitly state identification assumptions.
Validation metrics are used as causal support without clear assumptions.
Audits confirm the lack of explicit assumptions and reliance on validation metrics.
Abstract
Mechanistic interpretability papers increasingly use causal vocabulary: circuits, mediators, causal abstraction, monosemanticity. Such claims require explicit identification assumptions. A purposive audit of 10 papers across four methodological strands finds no dedicated identification-assumptions section and a recurring pattern: validation metrics such as faithfulness, completeness, monosemanticity, alignment, or ablation effects are reported as causal support without stating the assumptions that make them identifying. A two-human-coder audit on reproduces the direction of the main finding: dedicated identification sections are absent, and validation-metric substitution is common, though exact Dim B/D counts are coding-rule sensitive. The paper proposes a disclosure norm: state whether the claim is causal, name the identification strategy, enumerate assumptions, stress at least…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
