Structural Sensitivity in Compressed Transformers: Relative Error Propagation and Layer Removal
Abhinaba Basu, Kumkum Basu, Koushik Deb

TL;DR
This paper investigates how errors from compressing transformer models propagate through layers, revealing insights into error accumulation, pruning strategies, and providing a training-free method for model compression decisions.
Contribution
It introduces a detailed analysis of error propagation in compressed transformers, compares pruning techniques, and proposes a new error contraction profile for effective model compression.
Findings
Errors scale downstream as the product of rho values, affecting representation drift.
Activation-aware pruning reduces component sensitivity spread from 600x to 3-7x.
Ranking layers by rho improves depth pruning, achieving lower perplexity and speed-up.
Abstract
Compressing transformer weights makes large language models cheaper to deploy. But each layer's compression introduces an error. These errors accumulate as the signal passes through later layers, and how they accumulate is not well understood. We measure this directly: at each layer, we take the ratio of output to input error, calling it rho. A value below one means the layer absorbs the error; above one means it grows. Computing rho on six transformers (117M to 8B parameters) yields three findings. (i) Errors at layer t scale downstream by the product of later rho values, predicting representation drift (Spearman r = -0.44, p < 10^-4). This explains why compressing early layers hurts more than late ones, and why depth-decreasing sparsity schedules outperform uniform ones. Across architecture families, however, model width and redundancy matter more than rho alone. (ii) Within a layer,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
