Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution
Muhammad Umair Haider, Hammad Rizwan, Hassan Sajjad, Peizhong Ju, A.B. Siddique

TL;DR
This paper introduces NeuronLens, a range-based interpretability framework for LLMs that leverages neuron activation ranges to improve concept attribution and targeted manipulation with less collateral impact.
Contribution
The paper presents a novel range-based approach for neuron interpretation and intervention, addressing polysemanticity issues in large language models.
Findings
Neuron activation magnitudes for concepts form distinct, often Gaussian-like distributions.
Range-based interventions effectively manipulate target concepts with less collateral damage.
NeuronLens outperforms traditional masking methods in preserving overall model performance.
Abstract
Pervasive polysemanticity in large language models (LLMs) undermines discrete neuron-concept attribution, posing a significant challenge for model interpretation and control. We systematically analyze both encoder and decoder based LLMs across diverse datasets, and observe that even highly salient neurons for specific semantic concepts consistently exhibit polysemantic behavior. Importantly, we uncover a consistent pattern: concept-conditioned activation magnitudes of neurons form distinct, often Gaussian-like distributions with minimal overlap. Building on this observation, we hypothesize that interpreting and intervening on concept-specific activation ranges can enable more precise interpretability and targeted manipulation in LLMs. To this end, we introduce NeuronLens, a novel range-based interpretation and manipulation framework, that localizes concept attribution to activation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
