SAFR: Neuron Redistribution for Interpretability

Ruidi Chang; Chunyuan Deng; Hanjie Chen

arXiv:2501.16374·cs.LG·February 12, 2025

SAFR: Neuron Redistribution for Interpretability

Ruidi Chang, Chunyuan Deng, Hanjie Chen

PDF

Open Access 1 Repo 1 Video

TL;DR

SAFR is a regularization method that enhances neural network interpretability by promoting clearer neuron feature representations, while maintaining prediction accuracy, through targeted loss function modifications.

Contribution

This paper introduces SAFR, a novel regularization technique that improves interpretability of transformer models by controlling neuron superposition, a feature not previously exploited for this purpose.

Findings

01

SAFR improves interpretability without reducing accuracy

02

Neuron representations become more monosemantic with SAFR

03

SAFR enables visualization of neuron allocations in models

Abstract

Superposition refers to encoding representations of multiple features within a single neuron, which is common in deep neural networks. This property allows neurons to combine and represent multiple features, enabling the model to capture intricate information and handle complex tasks. Despite promising performance, the model's interpretability has been diminished. This paper presents a novel approach to enhance model interpretability by regularizing feature superposition. We introduce SAFR, which simply applies regularizations to the loss function to promote monosemantic representations for important tokens while encouraging polysemanticity for correlated token pairs, where important tokens and correlated token pairs are identified via VMASK and attention weights respectively. We evaluate SAFR with a transformer model on two classification tasks. Experiments demonstrate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chili-lab/safr
pytorchOfficial

Videos

SAFR: Neuron Redistribution for Interpretability· underline

Taxonomy

TopicsNeural Networks and Applications

MethodsSoftmax · Attention Is All You Need