FaithfulSAE: Towards Capturing Faithful Features with Sparse Autoencoders without External Dataset Dependencies

Seonglae Cho; Harryn Oh; Donghyun Lee; Luis Eduardo Rodrigues Vieira; Andrew Bermingham; Ziad El Sayed

arXiv:2506.17673·cs.LG·June 24, 2025

FaithfulSAE: Towards Capturing Faithful Features with Sparse Autoencoders without External Dataset Dependencies

Seonglae Cho, Harryn Oh, Donghyun Lee, Luis Eduardo Rodrigues Vieira, Andrew Bermingham, Ziad El Sayed

PDF

TL;DR

FaithfulSAE introduces a method for training sparse autoencoders on a model's own synthetic data, improving stability and interpretability of features without relying on external datasets, thus reducing hallucinated features.

Contribution

This paper presents FaithfulSAE, a novel approach that trains SAEs on internal synthetic data, enhancing feature fidelity and stability without external dataset dependencies.

Findings

01

SAEs trained on internal synthetic data are more stable across seeds.

02

FaithfulSAE outperforms web-trained SAEs in probing tasks.

03

Lower Fake Feature Ratio in most models with FaithfulSAE.

Abstract

Sparse Autoencoders (SAEs) have emerged as a promising solution for decomposing large language model representations into interpretable features. However, Paulo and Belrose (2025) have highlighted instability across different initialization seeds, and Heap et al. (2025) have pointed out that SAEs may not capture model-internal features. These problems likely stem from training SAEs on external datasets - either collected from the Web or generated by another model - which may contain out-of-distribution (OOD) data beyond the model's generalisation capabilities. This can result in hallucinated SAE features, which we term "Fake Features", that misrepresent the model's internal activations. To address these issues, we propose FaithfulSAE, a method that trains SAEs on the model's own synthetic dataset. Using FaithfulSAEs, we demonstrate that training SAEs on less-OOD instruction datasets…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.