Tokenized SAEs: Disentangling SAE Reconstructions

Thomas Dooms; Daniel Wilhelm

arXiv:2502.17332·cs.LG·February 25, 2025

Tokenized SAEs: Disentangling SAE Reconstructions

Thomas Dooms, Daniel Wilhelm

PDF

Open Access

TL;DR

This paper investigates how sparse auto-encoders interpret language models, revealing that many features relate to simple input statistics, and proposes a method to disentangle token and feature reconstructions to learn more meaningful features.

Contribution

It introduces a novel disentanglement method using per-token bias to improve feature learning in sparse auto-encoders for language models.

Findings

01

Many SAE features correspond to simple input statistics

02

Disentangling token and feature reconstructions enhances feature quality

03

Improved reconstruction in sparse auto-encoder regimes

Abstract

Sparse auto-encoders (SAEs) have become a prevalent tool for interpreting language models' inner workings. However, it is unknown how tightly SAE features correspond to computationally important directions in the model. This work empirically shows that many RES-JB SAE features predominantly correspond to simple input statistics. We hypothesize this is caused by a large class imbalance in training data combined with a lack of complex error signals. To reduce this behavior, we propose a method that disentangles token reconstruction from feature reconstruction. This improvement is achieved by introducing a per-token bias, which provides an enhanced baseline for interesting reconstruction. As a result, significantly more interesting features and improved reconstruction in sparse regimes are learned.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics