Resurrecting the Salmon: Rethinking Mechanistic Interpretability with Domain-Specific Sparse Autoencoders
Charles O'Neill, Mudith Jayasekara, Max Kirkby

TL;DR
This paper demonstrates that restricting sparse autoencoder training to a specific domain like medical text improves the interpretability and fidelity of latent features, revealing more meaningful structure in language model activations.
Contribution
It introduces a domain-specific training approach for SAEs that enhances interpretability and reconstruction accuracy over broad-domain methods, challenging the need for large-scale general-purpose autoencoders.
Findings
Domain-specific SAEs explain up to 20% more variance.
Higher loss recovery compared to broad-domain SAEs.
Features align with clinically meaningful concepts.
Abstract
Sparse autoencoders (SAEs) decompose large language model (LLM) activations into latent features that reveal mechanistic structure. Conventional SAEs train on broad data distributions, forcing a fixed latent budget to capture only high-frequency, generic patterns. This often results in significant linear ``dark matter'' in reconstruction error and produces latents that fragment or absorb each other, complicating interpretation. We show that restricting SAE training to a well-defined domain (medical text) reallocates capacity to domain-specific features, improving both reconstruction fidelity and interpretability. Training JumpReLU SAEs on layer-20 activations of Gemma-2 models using 195k clinical QA examples, we find that domain-confined SAEs explain up to 20\% more variance, achieve higher loss recovery, and reduce linear residual error compared to broad-domain SAEs. Automated and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
