Tokenized SAEs: Disentangling SAE Reconstructions
Thomas Dooms, Daniel Wilhelm

TL;DR
This paper investigates how sparse auto-encoders interpret language models, revealing that many features relate to simple input statistics, and proposes a method to disentangle token and feature reconstructions to learn more meaningful features.
Contribution
It introduces a novel disentanglement method using per-token bias to improve feature learning in sparse auto-encoders for language models.
Findings
Many SAE features correspond to simple input statistics
Disentangling token and feature reconstructions enhances feature quality
Improved reconstruction in sparse auto-encoder regimes
Abstract
Sparse auto-encoders (SAEs) have become a prevalent tool for interpreting language models' inner workings. However, it is unknown how tightly SAE features correspond to computationally important directions in the model. This work empirically shows that many RES-JB SAE features predominantly correspond to simple input statistics. We hypothesize this is caused by a large class imbalance in training data combined with a lack of complex error signals. To reduce this behavior, we propose a method that disentangles token reconstruction from feature reconstruction. This improvement is achieved by introducing a per-token bias, which provides an enhanced baseline for interesting reconstruction. As a result, significantly more interesting features and improved reconstruction in sparse regimes are learned.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics
