Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a   Feature Decorrelation Perspective

Hanqi Yan; Yanzheng Xiang; Guangyi Chen; Yifei Wang; Lin Gui; Yulan He

arXiv:2406.17969·cs.CL·October 17, 2024

Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective

Hanqi Yan, Yanzheng Xiang, Guangyi Chen, Yifei Wang, Lin Gui, Yulan He

PDF

Open Access 1 Repo

TL;DR

This paper investigates the role of monosemanticity in large language models, finding that encouraging monosemanticity through feature decorrelation improves model capacity, representation diversity, and preference alignment.

Contribution

It revisits monosemanticity from a feature decorrelation perspective and proposes a regularizer that enhances model performance by promoting monosemanticity.

Findings

01

Monosemanticity positively correlates with model capacity.

02

Feature decorrelation regularizer improves representation diversity.

03

Enhanced preference alignment performance observed.

Abstract

To better interpret the intrinsic mechanism of large language models (LLMs), recent studies focus on monosemanticity on its basic units. A monosemantic neuron is dedicated to a single and specific concept, which forms a one-to-one correlation between neurons and concepts. Despite extensive research in monosemanticity probing, it remains unclear whether monosemanticity is beneficial or harmful to model capacity. To explore this question, we revisit monosemanticity from the feature decorrelation perspective and advocate for its encouragement. We experimentally observe that the current conclusion by wang2024learning, which suggests that decreasing monosemanticity enhances model performance, does not hold when the model changes. Instead, we demonstrate that monosemanticity consistently exhibits a positive correlation with model capacity, in the preference alignment process. Consequently, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hanqi-qi/revisit_monosemanticity
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Syntax, Semantics, Linguistic Variation

MethodsFocus