Measuring and Guiding Monosemanticity

Ruben H\"arle; Felix Friedrich; Manuel Brack; Stephan W\"aldchen; Bj\"orn Deiseroth; Patrick Schramowski; Kristian Kersting

arXiv:2506.19382·cs.CL·December 2, 2025

Measuring and Guiding Monosemanticity

Ruben H\"arle, Felix Friedrich, Manuel Brack, Stephan W\"aldchen, Bj\"orn Deiseroth, Patrick Schramowski, Kristian Kersting

PDF

Open Access 1 Video

TL;DR

This paper introduces a new metric for measuring feature monosemanticity in language models and proposes a method to improve interpretability and control by conditioning autoencoders on labeled concepts, demonstrating enhanced localization and disentanglement.

Contribution

It presents the Feature Monosemanticity Score (FMS) for quantifying feature monosemanticity and introduces Guided Sparse Autoencoders (G-SAE) that improve interpretability and control of LLMs.

Findings

01

G-SAE improves feature localization and disentanglement.

02

Enhanced interpretability and control with less quality loss.

03

Effective in toxicity, style, and privacy detection.

Abstract

There is growing interest in leveraging mechanistic interpretability and controllability to better understand and influence the internal dynamics of large language models (LLMs). However, current methods face fundamental challenges in reliably localizing and manipulating feature representations. Sparse Autoencoders (SAEs) have recently emerged as a promising direction for feature extraction at scale, yet they, too, are limited by incomplete feature isolation and unreliable monosemanticity. To systematically quantify these limitations, we introduce Feature Monosemanticity Score (FMS), a novel metric to quantify feature monosemanticity in latent representation. Building on these insights, we propose Guided Sparse Autoencoders (G-SAE), a method that conditions latent representations on labeled concepts during training. We demonstrate that reliable localization and disentanglement of target…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Measuring and Guiding Monosemanticity· slideslive

Taxonomy

TopicsLinguistics, Language Diversity, and Identity · Multilingual Education and Policy · Gender Studies in Language