Causality is Key for Interpretability Claims to Generalise
Shruti Joshi, Aaron Mueller, David Klindt, Wieland Brendel, Patrik Reizinger, Dhanya Sridhar

TL;DR
This paper emphasizes the importance of causal inference in interpretability studies of large language models, proposing a framework that clarifies what claims can be supported and how to ensure their generalizability.
Contribution
It introduces a causal hierarchy-based diagnostic framework for interpretability, guiding practitioners in selecting appropriate methods and evaluations for valid, generalizable claims.
Findings
Causal inference clarifies what interpretability claims can justify.
Interventions support causal effects on model behavior.
Counterfactual claims require controlled supervision for verification.
Abstract
Interpretability research on large language models (LLMs) has yielded important insights into model behaviour, yet recurring pitfalls persist: findings that do not generalise, and causal interpretations that outrun the evidence. Our position is that causal inference specifies what constitutes a valid mapping from model activations to invariant high-level structures, the data or assumptions needed to achieve it, and the inferences it can support. Specifically, Pearl's causal hierarchy clarifies what an interpretability study can justify. Observations establish associations between model behaviour and internal components. Interventions (e.g., ablations or activation patching) support claims how these edits affect a behavioural metric (e.g., average change in token probabilities) over a set of prompts. However, counterfactual claims -- i.e., asking what the model output would have been for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Machine Learning in Healthcare
