MonoLoss: A Training Objective for Interpretable Monosemantic Representations

Ali Nasiri-Sarvi; Anh Tien Nguyen; Hassan Rivaz; Dimitris Samaras; Mahdi S. Hosseini

arXiv:2602.12403·cs.CV·February 16, 2026

MonoLoss: A Training Objective for Interpretable Monosemantic Representations

Ali Nasiri-Sarvi, Anh Tien Nguyen, Hassan Rivaz, Dimitris Samaras, Mahdi S. Hosseini

PDF

Open Access

TL;DR

MonoLoss introduces a new training objective that enhances the interpretability of neural representations by directly encouraging monosemantic features, significantly improving class purity and efficiency in evaluation.

Contribution

The paper proposes MonoLoss, a novel, efficient training objective that directly optimizes monosemanticity in neural representations, enabling more interpretable features and improved performance.

Findings

01

MonoLoss increases MonoScore across various models and features.

02

MonoLoss significantly improves class purity of latent representations.

03

MonoLoss yields up to 0.6% accuracy gains on ImageNet-1K.

Abstract

Sparse autoencoders (SAEs) decompose polysemantic neural representations, where neurons respond to multiple unrelated concepts, into monosemantic features that capture single, interpretable concepts. However, standard training objectives only weakly encourage this decomposition, and existing monosemanticity metrics require pairwise comparisons across all dataset samples, making them inefficient during training and evaluation. We study a recent MonoScore metric and derive a single-pass algorithm that computes exactly the same quantity, but with a cost that grows linearly, rather than quadratically, with the number of dataset images. On OpenImagesV7, we achieve up to a 1200x speedup wall-clock speedup in evaluation and 159x during training, while adding only ~4% per-epoch overhead. This allows us to treat MonoScore as a training signal: we introduce the Monosemanticity Loss (MonoLoss), a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning