PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits
Maximilian Dreyer, Erblina Purelku, Johanna Vielhaben, Wojciech Samek,, Sebastian Lapuschkin

TL;DR
This paper introduces PURE, a method to decompose polysemantic neurons in deep neural networks into multiple monosemantic units by identifying relevant circuits, enhancing interpretability of neuron functions.
Contribution
PURE is a novel approach that disentangles polysemantic neurons into pure features by circuit identification, improving interpretability of neural network representations.
Findings
Successfully disentangles polysemantic neurons in ResNet models.
Improves feature visualization and interpretability over existing methods.
Demonstrates effectiveness on ImageNet-trained models.
Abstract
The field of mechanistic interpretability aims to study the role of individual neurons in Deep Neural Networks. Single neurons, however, have the capability to act polysemantically and encode for multiple (unrelated) features, which renders their interpretation difficult. We present a method for disentangling polysemanticity of any Deep Neural Network by decomposing a polysemantic neuron into multiple monosemantic "virtual" neurons. This is achieved by identifying the relevant sub-graph ("circuit") for each "pure" feature. We demonstrate how our approach allows us to find and disentangle various polysemantic units of ResNet models trained on ImageNet. While evaluating feature visualizations using CLIP, our method effectively disentangles representations, improving upon methods based on neuron activations. Our code is available at https://github.com/maxdreyer/PURE.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsAverage Pooling · Convolution · Kaiming Initialization · Max Pooling · Global Average Pooling · Contrastive Language-Image Pre-training
