Polysemanticity or Polysemy? Lexical Identity Confounds Superposition Metrics
Iyad Ait Hou, Rebecca Hwa

TL;DR
This paper investigates how lexical confounds, such as shared word forms, influence superposition metrics in neural models, revealing that many overlaps are due to shared words rather than concept compression.
Contribution
It introduces a factorial decomposition to distinguish lexical from semantic overlaps, showing lexical confounds are prevalent and impact model interpretability and downstream tasks.
Findings
Lexical confounds exceed semantic overlaps across models from 110M to 70B parameters.
18-36% of features in autoencoders blend senses due to lexical confounds.
Filtering out lexical confounds improves word sense disambiguation and knowledge editing.
Abstract
If the same neuron activates for both "lender" and "riverside," standard metrics attribute the overlap to superposition--the neuron must be compressing two unrelated concepts. This work explores how much of the overlap is due a lexical confound: neurons fire for a shared word form (such as "bank") rather than for two compressed concepts. A 2x2 factorial decomposition reveals that the lexical-only condition (same word, different meaning) consistently exceeds the semantic-only condition (different word, same meaning) across models spanning 110M-70B parameters. The confound carries into sparse autoencoders (18-36% of features blend senses), sits in <=1% of activation dimensions, and hurts downstream tasks: filtering it out improves word sense disambiguation and makes knowledge edits more selective (p = 0.002).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
