From Mechanistic to Compositional Interpretability
Ward Gauderis, Thomas Dooms, Steven T. Holmer, Kola Ayonrinde, Geraint A. Wiggins

TL;DR
This paper introduces a formal, compositional framework for neural interpretability using category theory, enabling systematic, verifiable, and concise explanations of model behavior.
Contribution
It develops a novel formal framework for interpretability, connecting mechanistic explanations with compositionality and minimum description length, and introduces methods for model simplification.
Findings
Framework unifies mechanistic interpretability with compositionality.
Proves a parsimony criterion for concise explanations.
Situates existing methods as special cases within the framework.
Abstract
Mechanistic interpretability aims to explain neural model behaviour by reverse-engineering learned computational structure into human-understandable components. Without a formal framework, however, mechanistic explanations cannot be objectively verified, compared, or composed. We introduce compositional interpretability, a category-theoretic framework grounded in the principles of compositionality and minimum description length. Compositional interpretations are pairs of syntactic and semantic mappings that must commute to enforce consistency between a model's decomposition and its observed behaviour. We deconstruct explanation quality into measures of faithfulness and complexity to cast interpretability as a constrained optimisation problem, and introduce compressive refinement to systematically restructure models into simpler parts without altering their function. Finally, we prove a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
