Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework
Laura Kopf, Nils Feldhus, Kirill Bykov, Philine Lou Bommer, Anna Hedstr\"om, Marina M.-C. H\"ohne, Oliver Eberle

TL;DR
PRISM is a new framework for interpreting large language models that captures both single and multiple concepts encoded in neurons, improving the accuracy and depth of feature descriptions over existing methods.
Contribution
PRISM introduces a novel approach to identify and score polysemantic features in LLMs, addressing limitations of monosemantic assumptions in interpretability methods.
Findings
PRISM outperforms existing methods in description quality.
PRISM effectively captures polysemantic neuron behaviors.
Benchmark results show improved interpretability metrics.
Abstract
Automated interpretability research aims to identify concepts encoded in neural network features to enhance human understanding of model behavior. Within the context of large language models (LLMs) for natural language processing (NLP), current automated neuron-level feature description methods face two key challenges: limited robustness and the assumption that each neuron encodes a single concept (monosemanticity), despite increasing evidence of polysemanticity. This assumption restricts the expressiveness of feature descriptions and limits their ability to capture the full range of behaviors encoded in model internals. To address this, we introduce Polysemantic FeatuRe Identification and Scoring Method (PRISM), a novel framework specifically designed to capture the complexity of features in LLMs. Unlike approaches that assign a single description per neuron, common in many automated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Multimodal Machine Learning Applications
