How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models

Hector Borobia; Elies Segu\'i-Mas; Guillermina Tormo-Carb\'o

arXiv:2603.25325·cs.LG·March 27, 2026

How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models

Hector Borobia, Elies Segu\'i-Mas, Guillermina Tormo-Carb\'o

PDF

Open Access

TL;DR

This study systematically analyzes how weight pruning affects internal representations of language models using Sparse Autoencoders, revealing that pruning preferentially preserves rare, specialized features over frequent ones, with implications for interpretability.

Contribution

It provides the first detailed analysis of feature geometry changes due to pruning in language models, highlighting the survival of rare features and comparing pruning methods.

Findings

01

Rare features survive pruning better than frequent ones.

02

Wanda pruning preserves feature structure more effectively than magnitude pruning.

03

Pre-trained SAEs remain viable on highly pruned models up to 50% sparsity.

Abstract

Weight pruning is a standard technique for compressing large language models, yet its effect on learned internal representations remains poorly understood. We present the first systematic study of how unstructured pruning reshapes the feature geometry of language models, using Sparse Autoencoders (SAEs) as interpretability probes. Across three model families (Gemma 3 1B, Gemma 2 2B, Llama 3.2 1B), two pruning methods (magnitude and Wanda), and six sparsity levels (0--60%), we investigate five research questions spanning seed stability, feature survival, SAE transferability, feature fragility, and causal relevance. Our most striking finding is that rare SAE features--those with low firing rates--survive pruning far better than frequent ones, with within-condition Spearman correlations of rho = -1.0 in 11 of 17 experimental conditions. This counter-intuitive result suggests that pruning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis · Topic Modeling