Disentangling Polysemantic Neurons with a Null-Calibrated Polysemanticity Index and Causal Patch Interventions

Manan Gupta; Dhruv Kumar

arXiv:2508.16950·cs.LG·August 26, 2025

Disentangling Polysemantic Neurons with a Null-Calibrated Polysemanticity Index and Causal Patch Interventions

Manan Gupta, Dhruv Kumar

PDF

TL;DR

This paper introduces the Polysemanticity Index (PSI), a new metric to identify and analyze neurons in neural networks that respond to multiple unrelated features, enhancing interpretability.

Contribution

The paper presents PSI, a null-calibrated metric combining geometric, categorical, and semantic components to quantify neuron polysemanticity, validated with causal interventions.

Findings

01

PSI effectively identifies polysemantic neurons in ResNet-50.

02

Later layers show higher PSI, indicating more polysemanticity.

03

Causal patch interventions confirm the functional significance of identified neurons.

Abstract

Neural networks often contain polysemantic neurons that respond to multiple, sometimes unrelated, features, complicating mechanistic interpretability. We introduce the Polysemanticity Index (PSI), a null-calibrated metric that quantifies when a neuron's top activations decompose into semantically distinct clusters. PSI multiplies three independently calibrated components: geometric cluster quality (S), alignment to labeled categories (Q), and open-vocabulary semantic distinctness via CLIP (D). On a pretrained ResNet-50 evaluated with Tiny-ImageNet images, PSI identifies neurons whose activation sets split into coherent, nameable prototypes, and reveals strong depth trends: later layers exhibit substantially higher PSI than earlier layers. We validate our approach with robustness checks (varying hyperparameters, random seeds, and cross-encoder text heads), breadth analyses (comparing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.