Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness
Qi Zhang, Yifei Wang, Jingyi Cui, Xiang Pan, Qi Lei, Stefanie Jegelka,, Yisen Wang

TL;DR
This paper demonstrates that feature monosemanticity enhances both interpretability and robustness of deep learning models, achieving performance gains across various challenging scenarios without sacrificing accuracy.
Contribution
It challenges the belief of an accuracy-interpretability tradeoff by showing monosemantic features improve robustness and accuracy, supported by empirical and theoretical analysis.
Findings
Monosemantic features lead to better robustness against noise and out-of-domain data.
Models with monosemantic neurons outperform polysemantic ones in accuracy.
Empirical and theoretical evidence links monosemanticity to improved decision boundary separation.
Abstract
Deep learning models often suffer from a lack of interpretability due to polysemanticity, where individual neurons are activated by multiple unrelated semantics, resulting in unclear attributions of model behavior. Recent advances in monosemanticity, where neurons correspond to consistent and distinct semantics, have significantly improved interpretability but are commonly believed to compromise accuracy. In this work, we challenge the prevailing belief of the accuracy-interpretability tradeoff, showing that monosemantic features not only enhance interpretability but also bring concrete gains in model performance. Across multiple robust learning scenarios-including input and label noise, few-shot learning, and out-of-domain generalization-our results show that models leveraging monosemantic features significantly outperform those relying on polysemantic features. Furthermore, we provide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
