Fundamental Limits of Neural Network Sparsification: Evidence from Catastrophic Interpretability Collapse
Dip Roy, Rajiv Misra, Sanjay Kumar Singh

TL;DR
This paper demonstrates that aggressive neural network sparsification leads to a systematic collapse of feature interpretability, despite stable global representation quality, revealing fundamental limits of interpretability under compression.
Contribution
It introduces an adaptive sparsity framework and provides empirical evidence of intrinsic interpretability collapse during severe network compression across benchmark datasets.
Findings
Global representation quality remains stable despite interpretability collapse.
Dead neuron rates reach over 60% at high sparsity levels.
Collapse is dataset complexity-dependent, more severe on complex datasets.
Abstract
Extreme neural network sparsification (90% activation reduction) presents a critical challenge for mechanistic interpretability: understanding whether interpretable features survive aggressive compression. This work investigates feature survival under severe capacity constraints in hybrid Variational Autoencoder--Sparse Autoencoder (VAE-SAE) architectures. We introduce an adaptive sparsity scheduling framework that progressively reduces active neurons from 500 to 50 over 50 training epochs, and provide empirical evidence for fundamental limits of the sparsification-interpretability relationship. Testing across two benchmark datasets -- dSprites and Shapes3D -- with both Top-k and L1 sparsification methods, our key finding reveals a pervasive paradox: while global representation quality (measured by Mutual Information Gap) remains stable, local feature interpretability collapses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
