MonoLoss: A Training Objective for Interpretable Monosemantic Representations
Ali Nasiri-Sarvi, Anh Tien Nguyen, Hassan Rivaz, Dimitris Samaras, Mahdi S. Hosseini

TL;DR
MonoLoss introduces a new training objective that enhances the interpretability of neural representations by directly encouraging monosemantic features, significantly improving class purity and efficiency in evaluation.
Contribution
The paper proposes MonoLoss, a novel, efficient training objective that directly optimizes monosemanticity in neural representations, enabling more interpretable features and improved performance.
Findings
MonoLoss increases MonoScore across various models and features.
MonoLoss significantly improves class purity of latent representations.
MonoLoss yields up to 0.6% accuracy gains on ImageNet-1K.
Abstract
Sparse autoencoders (SAEs) decompose polysemantic neural representations, where neurons respond to multiple unrelated concepts, into monosemantic features that capture single, interpretable concepts. However, standard training objectives only weakly encourage this decomposition, and existing monosemanticity metrics require pairwise comparisons across all dataset samples, making them inefficient during training and evaluation. We study a recent MonoScore metric and derive a single-pass algorithm that computes exactly the same quantity, but with a cost that grows linearly, rather than quadratically, with the number of dataset images. On OpenImagesV7, we achieve up to a 1200x speedup wall-clock speedup in evaluation and 159x during training, while adding only ~4% per-epoch overhead. This allows us to treat MonoScore as a training signal: we introduce the Monosemanticity Loss (MonoLoss), a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning
