Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales
Samuel Salfati

TL;DR
This paper empirically analyzes transformer compressibility across different scales, revealing structural properties that influence compression effectiveness and highlighting the limits of static post-training methods.
Contribution
It systematically uncovers five structural properties affecting transformer compression and demonstrates the potential of adaptive, per-token computation methods.
Findings
High-variance directions are uncorrelated with predictive directions.
Transformer blocks are approximately linear only under correct distribution.
Single-block linear replacement achieves 34x compression with minimal perplexity increase.
Abstract
We present a systematic empirical study of transformer compression through over 40 experiments on GPT-2 (124M parameters) and Mistral 7B (7.24B parameters). Our analysis covers spectral compression, block-level function replacement, rotation-based quantization, activation geometry, and adaptive early exit. We identify five structural properties relevant to compression. (1) Variance is not importance: high-variance activation directions are approximately 96 percent uncorrelated with predictive directions (measured via CCA), and projecting onto these subspaces preserves over 90 percent of variance while degrading perplexity. (2) Block linearity is conditional: transformer blocks are approximately linear (R^2 ~ 0.95 on GPT-2, 0.93 on Mistral block 31) only under the correct upstream distribution; modifying earlier blocks induces distribution shift that degrades downstream approximations.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
