On the transferability of Sparse Autoencoders for interpreting compressed models

Suchit Gupte; Vishnu Kabir Chhabra; Mohammad Mahdi Khalili

arXiv:2507.15977·cs.LG·July 23, 2025

On the transferability of Sparse Autoencoders for interpreting compressed models

Suchit Gupte, Vishnu Kabir Chhabra, Mohammad Mahdi Khalili

PDF

Open Access

TL;DR

This paper investigates how Sparse Autoencoders (SAEs) can interpret compressed large language models, finding that SAEs trained on original models can effectively interpret compressed models with minimal retraining, reducing computational costs.

Contribution

It demonstrates the transferability of SAEs from original to compressed models, showing that retraining SAEs on compressed models is often unnecessary, thus saving training resources.

Findings

01

SAEs trained on original models can interpret compressed models with slight performance loss.

02

Pruning the original SAE yields comparable results to retraining on compressed models.

03

Transferability of SAEs reduces the need for extensive retraining on compressed models.

Abstract

Modern LLMs face inference efficiency challenges due to their scale. To address this, many compression methods have been proposed, such as pruning and quantization. However, the effect of compression on a model's interpretability remains elusive. While several model interpretation approaches exist, such as circuit discovery, Sparse Autoencoders (SAEs) have proven particularly effective in decomposing a model's activation space into its feature basis. In this work, we explore the differences in SAEs for the original and compressed models. We find that SAEs trained on the original model can interpret the compressed model albeit with slight performance degradation compared to the trained SAE on the compressed model. Furthermore, simply pruning the original SAE itself achieves performance comparable to training a new SAE on the pruned model. This finding enables us to mitigate the extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage and Signal Denoising Methods · Computational Physics and Python Applications · Time Series Analysis and Forecasting