Explaining Neural Networks with Reasons
Levin Hornischer, Hannes Leitgeb

TL;DR
This paper introduces a novel interpretability method for neural networks based on a philosophically grounded reasons vector, enabling logical and Bayesian explanations of neuron functions across architectures.
Contribution
It presents a scalable, uniform, and faithful interpretability approach that combines philosophical notions of explanation with practical neural network analysis.
Findings
Method is grounded in established philosophical explanation.
Applicable to various neural network architectures and modalities.
Interventions based on reason vectors lead to predictable output changes.
Abstract
We propose a new interpretability method for neural networks, which is based on a novel mathematico-philosophical theory of reasons. Our method computes a vector for each neuron, called its reasons vector. We then can compute how strongly this reasons vector speaks for various propositions, e.g., the proposition that the input image depicts digit 2 or that the input prompt has a negative sentiment. This yields an interpretation of neurons, and groups thereof, that combines a logical and a Bayesian perspective, and accounts for polysemanticity (i.e., that a single neuron can figure in multiple concepts). We show, both theoretically and empirically, that this method is: (1) grounded in a philosophically established notion of explanation, (2) uniform, i.e., applies to the common neural network architectures and modalities, (3) scalable, since computing reason vectors only involves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Explainable Artificial Intelligence (XAI) · Anomaly Detection Techniques and Applications
