Causality is Key for Interpretability Claims to Generalise

Shruti Joshi; Aaron Mueller; David Klindt; Wieland Brendel; Patrik Reizinger; Dhanya Sridhar

arXiv:2602.16698·cs.LG·March 20, 2026

Causality is Key for Interpretability Claims to Generalise

Shruti Joshi, Aaron Mueller, David Klindt, Wieland Brendel, Patrik Reizinger, Dhanya Sridhar

PDF

Open Access

TL;DR

This paper emphasizes the importance of causal inference in interpretability studies of large language models, proposing a framework that clarifies what claims can be supported and how to ensure their generalizability.

Contribution

It introduces a causal hierarchy-based diagnostic framework for interpretability, guiding practitioners in selecting appropriate methods and evaluations for valid, generalizable claims.

Findings

01

Causal inference clarifies what interpretability claims can justify.

02

Interventions support causal effects on model behavior.

03

Counterfactual claims require controlled supervision for verification.

Abstract

Interpretability research on large language models (LLMs) has yielded important insights into model behaviour, yet recurring pitfalls persist: findings that do not generalise, and causal interpretations that outrun the evidence. Our position is that causal inference specifies what constitutes a valid mapping from model activations to invariant high-level structures, the data or assumptions needed to achieve it, and the inferences it can support. Specifically, Pearl's causal hierarchy clarifies what an interpretability study can justify. Observations establish associations between model behaviour and internal components. Interventions (e.g., ablations or activation patching) support claims how these edits affect a behavioural metric (e.g., average change in token probabilities) over a set of prompts. However, counterfactual claims -- i.e., asking what the model output would have been for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Machine Learning in Healthcare