How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models
Hector Borobia, Elies Segu\'i-Mas, Guillermina Tormo-Carb\'o

TL;DR
This study systematically analyzes how weight pruning affects internal representations of language models using Sparse Autoencoders, revealing that pruning preferentially preserves rare, specialized features over frequent ones, with implications for interpretability.
Contribution
It provides the first detailed analysis of feature geometry changes due to pruning in language models, highlighting the survival of rare features and comparing pruning methods.
Findings
Rare features survive pruning better than frequent ones.
Wanda pruning preserves feature structure more effectively than magnitude pruning.
Pre-trained SAEs remain viable on highly pruned models up to 50% sparsity.
Abstract
Weight pruning is a standard technique for compressing large language models, yet its effect on learned internal representations remains poorly understood. We present the first systematic study of how unstructured pruning reshapes the feature geometry of language models, using Sparse Autoencoders (SAEs) as interpretability probes. Across three model families (Gemma 3 1B, Gemma 2 2B, Llama 3.2 1B), two pruning methods (magnitude and Wanda), and six sparsity levels (0--60%), we investigate five research questions spanning seed stability, feature survival, SAE transferability, feature fragility, and causal relevance. Our most striking finding is that rare SAE features--those with low firing rates--survive pruning far better than frequent ones, with within-condition Spearman correlations of rho = -1.0 in 11 of 17 experimental conditions. This counter-intuitive result suggests that pruning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis · Topic Modeling
