On the transferability of Sparse Autoencoders for interpreting compressed models
Suchit Gupte, Vishnu Kabir Chhabra, Mohammad Mahdi Khalili

TL;DR
This paper investigates how Sparse Autoencoders (SAEs) can interpret compressed large language models, finding that SAEs trained on original models can effectively interpret compressed models with minimal retraining, reducing computational costs.
Contribution
It demonstrates the transferability of SAEs from original to compressed models, showing that retraining SAEs on compressed models is often unnecessary, thus saving training resources.
Findings
SAEs trained on original models can interpret compressed models with slight performance loss.
Pruning the original SAE yields comparable results to retraining on compressed models.
Transferability of SAEs reduces the need for extensive retraining on compressed models.
Abstract
Modern LLMs face inference efficiency challenges due to their scale. To address this, many compression methods have been proposed, such as pruning and quantization. However, the effect of compression on a model's interpretability remains elusive. While several model interpretation approaches exist, such as circuit discovery, Sparse Autoencoders (SAEs) have proven particularly effective in decomposing a model's activation space into its feature basis. In this work, we explore the differences in SAEs for the original and compressed models. We find that SAEs trained on the original model can interpret the compressed model albeit with slight performance degradation compared to the trained SAE on the compressed model. Furthermore, simply pruning the original SAE itself achieves performance comparable to training a new SAE on the pruned model. This finding enables us to mitigate the extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Signal Denoising Methods · Computational Physics and Python Applications · Time Series Analysis and Forecasting
