Teach Old SAEs New Domain Tricks with Boosting

Nikita Koriagin; Yaroslav Aksenov; Daniil Laptev; Gleb Gerasimov; Nikita Balagansky; Daniil Gavrilov

arXiv:2507.12990·cs.LG·July 18, 2025

Teach Old SAEs New Domain Tricks with Boosting

Nikita Koriagin, Yaroslav Aksenov, Daniil Laptev, Gleb Gerasimov, Nikita Balagansky, Daniil Gavrilov

PDF

Open Access

TL;DR

This paper presents a residual learning method that enhances Sparse Autoencoders' ability to capture domain-specific features in Large Language Models without full retraining, improving interpretability across specialized domains.

Contribution

It introduces a secondary SAE trained on reconstruction errors to selectively improve domain-specific feature capture in existing SAEs, without retraining the entire model.

Findings

01

Significant improvements in cross-entropy and explained variance metrics.

02

Efficient incorporation of domain knowledge into existing SAEs.

03

Maintains general task performance while enhancing domain-specific interpretability.

Abstract

Sparse Autoencoders have emerged as powerful tools for interpreting the internal representations of Large Language Models, yet they often fail to capture domain-specific features not prevalent in their training corpora. This paper introduces a residual learning approach that addresses this feature blindness without requiring complete retraining. We propose training a secondary SAE specifically to model the reconstruction error of a pretrained SAE on domain-specific texts, effectively capturing features missed by the primary model. By summing the outputs of both models during inference, we demonstrate significant improvements in both LLM cross-entropy and explained variance metrics across multiple specialized domains. Our experiments show that this method efficiently incorporates new domain knowledge into existing SAEs while maintaining their performance on general tasks. This approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques