Disentangling Polysemantic Neurons with a Null-Calibrated Polysemanticity Index and Causal Patch Interventions
Manan Gupta, Dhruv Kumar

TL;DR
This paper introduces the Polysemanticity Index (PSI), a new metric to identify and analyze neurons in neural networks that respond to multiple unrelated features, enhancing interpretability.
Contribution
The paper presents PSI, a null-calibrated metric combining geometric, categorical, and semantic components to quantify neuron polysemanticity, validated with causal interventions.
Findings
PSI effectively identifies polysemantic neurons in ResNet-50.
Later layers show higher PSI, indicating more polysemanticity.
Causal patch interventions confirm the functional significance of identified neurons.
Abstract
Neural networks often contain polysemantic neurons that respond to multiple, sometimes unrelated, features, complicating mechanistic interpretability. We introduce the Polysemanticity Index (PSI), a null-calibrated metric that quantifies when a neuron's top activations decompose into semantically distinct clusters. PSI multiplies three independently calibrated components: geometric cluster quality (S), alignment to labeled categories (Q), and open-vocabulary semantic distinctness via CLIP (D). On a pretrained ResNet-50 evaluated with Tiny-ImageNet images, PSI identifies neurons whose activation sets split into coherent, nameable prototypes, and reveals strong depth trends: later layers exhibit substantially higher PSI than earlier layers. We validate our approach with robustness checks (varying hyperparameters, random seeds, and cross-encoder text heads), breadth analyses (comparing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
