Improving Robustness In Sparse Autoencoders via Masked Regularization
Vivek Narayanaswamy, Kowshik Thopalli, Bhavya Kailkhura, Wesam Sakla

TL;DR
This paper introduces a masking-based regularization technique for sparse autoencoders that enhances robustness, reduces feature absorption, and improves interpretability and out-of-distribution performance.
Contribution
It proposes a novel masking regularization method that disrupts co-occurrence patterns in training, improving the robustness and interpretability of sparse autoencoders.
Findings
Reduces feature absorption in sparse autoencoders.
Improves out-of-distribution robustness.
Enhances probing performance of latent representations.
Abstract
Sparse autoencoders (SAEs) are widely used in mechanistic interpretability to project LLM activations onto sparse latent spaces. However, sparsity alone is an imperfect proxy for interpretability, and current training objectives often result in brittle latent representations. SAEs are known to be prone to feature absorption, where general features are subsumed by more specific ones due to co-occurrence, degrading interpretability despite high reconstruction fidelity. Recent negative results on Out-of-Distribution (OOD) performance further underscore broader robustness related failures tied to under-specified training objectives. We address this by proposing a masking-based regularization that randomly replaces tokens during training to disrupt co-occurrence patterns. This improves robustness across SAE architectures and sparsity levels reducing absorption, enhancing probing performance,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
