Ensembling Sparse Autoencoders

Soham Gadgil; Chris Lin; Su-In Lee

arXiv:2505.16077·cs.LG·May 23, 2025

Ensembling Sparse Autoencoders

Soham Gadgil, Chris Lin, Su-In Lee

PDF

Open Access 3 Reviews

TL;DR

This paper explores ensembling multiple sparse autoencoders to enhance feature diversity, stability, and downstream task performance in language models, demonstrating significant empirical improvements over single autoencoder approaches.

Contribution

It introduces ensemble methods for sparse autoencoders, including bagging and boosting, to improve feature extraction and downstream task effectiveness.

Findings

01

Ensembling SAEs improves activation reconstruction.

02

Ensembling increases feature diversity and stability.

03

Ensembling outperforms single SAEs in downstream tasks.

Abstract

Sparse autoencoders (SAEs) are used to decompose neural network activations into human-interpretable features. Typically, features learned by a single SAE are used for downstream applications. However, it has recently been shown that SAEs trained with different initial weights can learn different features, demonstrating that a single SAE captures only a limited subset of features that can be extracted from the activation space. Motivated by this limitation, we propose to ensemble multiple SAEs through naive bagging and boosting. Specifically, SAEs trained with different weight initializations are ensembled in naive bagging, whereas SAEs sequentially trained to minimize the residual error are ensembled in boosting. We evaluate our ensemble approaches with three settings of language models and SAE architectures. Our empirical results demonstrate that ensembling SAEs can improve the…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

The authors present a nice formalization of SAE ensembling. The ensembled SAEs show signficantly better performance even when compared with SAEs which have a equivalent number of features to the ensembled ones. Authors not only look at better reconstruction loss, but also a wether these techniques improve training stability, whether these SAEs can be used to do concept detection and also to detect spurious correlation.

Weaknesses

Even though the SAEs were compared to to equal number of features, they were not compared to the same amount of training compute/train time. Both boosting and bagging are significantly slower to train, almost a order of magnitude larger training times for some of the model sizes. Boosting seems very similar to matching pursuit SAES (From Flat to Hierarchical : Extracting Sparse Representations with Matching Pursuit), if I'm understanding it correctly, but it is never mentioned in the paper

Reviewer 02Rating 6Confidence 4

Strengths

The overall paper is clearly written and the evaluations cover evaluations expected for the work on SAEs. I encourage the authors to open-source the naive bagging + boosting by implementing it in one of the popular Dictionary learning repos, such as decoderesearch/SAELens or saprmarks/dictionary_learning.

Weaknesses

I will defer to AC to judge whether the contribution is big enough to meet the ICLR bar. Idea for follow-up: Apply bagging across Specialized SAEs finetuend on subdomains (SSAE https://arxiv.org/abs/2411.00743).

Reviewer 03Rating 4Confidence 3

Strengths

- Clear written style - Related work shows a good knowledge of the relevant literature - Figure 1 is helpful in understanding the core technique - The propositions and proofs are helpful in justifying the argument - The mathematical framework is mostly clear (though the notation is non-standard; see below) - Presents a useful way to improve the performance of SAEs. - Shows results on two downstream tasks - The evaluation uses multiple models, architectures and tasks/metrics (though see below fo

Weaknesses

- Nits: - In Section 3.1 the activation functions and the citations are in different orders - Would suggest against using the letter k for the dimension of the SAE hidden layer as this conflicts with the k in the activation function for top k. Perhaps using f would be clearer. - All of the notation section uses quite non-standard notation - it would be good to use similar notation to e.g. Gao et al or a similar paper for readability - A Pareto plot with Sparsity or Description Length (

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning