Beyond Interpretability: The Gains of Feature Monosemanticity on Model   Robustness

Qi Zhang; Yifei Wang; Jingyi Cui; Xiang Pan; Qi Lei; Stefanie Jegelka,; Yisen Wang

arXiv:2410.21331·cs.LG·October 30, 2024

Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness

Qi Zhang, Yifei Wang, Jingyi Cui, Xiang Pan, Qi Lei, Stefanie Jegelka,, Yisen Wang

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that feature monosemanticity enhances both interpretability and robustness of deep learning models, achieving performance gains across various challenging scenarios without sacrificing accuracy.

Contribution

It challenges the belief of an accuracy-interpretability tradeoff by showing monosemantic features improve robustness and accuracy, supported by empirical and theoretical analysis.

Findings

01

Monosemantic features lead to better robustness against noise and out-of-domain data.

02

Models with monosemantic neurons outperform polysemantic ones in accuracy.

03

Empirical and theoretical evidence links monosemanticity to improved decision boundary separation.

Abstract

Deep learning models often suffer from a lack of interpretability due to polysemanticity, where individual neurons are activated by multiple unrelated semantics, resulting in unclear attributions of model behavior. Recent advances in monosemanticity, where neurons correspond to consistent and distinct semantics, have significantly improved interpretability but are commonly believed to compromise accuracy. In this work, we challenge the prevailing belief of the accuracy-interpretability tradeoff, showing that monosemantic features not only enhance interpretability but also bring concrete gains in model performance. Across multiple robust learning scenarios-including input and label noise, few-shot learning, and out-of-domain generalization-our results show that models leveraging monosemantic features significantly outperform those relying on polysemantic features. Furthermore, we provide…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pku-ml/beyond_interpretability
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques