Interpreting Neural Networks through the Polytope Lens
Sid Black, Lee Sharkey, Leo Grinsztajn, Eric Winsor, Dan Braun, Jacob, Merizian, Kip Parker, Carlos Ram\'on Guevara, Beren Millidge, Gabriel Alfour,, Connor Leahy

TL;DR
This paper introduces the polytope lens, a new approach to interpret neural networks by analyzing the partitioning of activation space into polytopes, offering clearer insights than traditional neuron or direction-based methods.
Contribution
The paper proposes the polytope lens as a fundamental unit for neural network interpretability, addressing limitations of neuron and direction-based descriptions by focusing on activation space polytopes.
Findings
Polytopes identify monosemantic regions of activation space.
Density of polytope boundaries correlates with semantic boundaries.
Polytope analysis predicts neural network behavior effectively.
Abstract
Mechanistic interpretability aims to explain what a neural network has learned at a nuts-and-bolts level. What are the fundamental primitives of neural network representations? Previous mechanistic descriptions have used individual neurons or their linear combinations to understand the representations a network has learned. But there are clues that neurons and their linear combinations are not the correct fundamental units of description: directions cannot describe how neural networks use nonlinearities to structure their representations. Moreover, many instances of individual neurons and their combinations are polysemantic (i.e. they have multiple unrelated meanings). Polysemanticity makes interpreting the network in terms of neurons or directions challenging since we can no longer assign a specific feature to a neural unit. In order to find a basic unit of description that does not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Neural Networks and Applications · Adversarial Robustness in Machine Learning
