PURE: Turning Polysemantic Neurons Into Pure Features by Identifying   Relevant Circuits

Maximilian Dreyer; Erblina Purelku; Johanna Vielhaben; Wojciech Samek,; Sebastian Lapuschkin

arXiv:2404.06453·cs.CV·April 10, 2024·2 cites

PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits

Maximilian Dreyer, Erblina Purelku, Johanna Vielhaben, Wojciech Samek,, Sebastian Lapuschkin

PDF

Open Access 1 Repo

TL;DR

This paper introduces PURE, a method to decompose polysemantic neurons in deep neural networks into multiple monosemantic units by identifying relevant circuits, enhancing interpretability of neuron functions.

Contribution

PURE is a novel approach that disentangles polysemantic neurons into pure features by circuit identification, improving interpretability of neural network representations.

Findings

01

Successfully disentangles polysemantic neurons in ResNet models.

02

Improves feature visualization and interpretability over existing methods.

03

Demonstrates effectiveness on ImageNet-trained models.

Abstract

The field of mechanistic interpretability aims to study the role of individual neurons in Deep Neural Networks. Single neurons, however, have the capability to act polysemantically and encode for multiple (unrelated) features, which renders their interpretation difficult. We present a method for disentangling polysemanticity of any Deep Neural Network by decomposing a polysemantic neuron into multiple monosemantic "virtual" neurons. This is achieved by identifying the relevant sub-graph ("circuit") for each "pure" feature. We demonstrate how our approach allows us to find and disentangle various polysemantic units of ResNet models trained on ImageNet. While evaluating feature visualizations using CLIP, our method effectively disentangles representations, improving upon methods based on neuron activations. Our code is available at https://github.com/maxdreyer/PURE.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

maxdreyer/pure
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsAverage Pooling · Convolution · Kaiming Initialization · Max Pooling · Global Average Pooling · Contrastive Language-Image Pre-training