No Free Swap: Protocol-Dependent Layer Redundancy in Transformers
Gabriel Garcia

TL;DR
This paper investigates how different protocols for evaluating layer redundancy in transformers can lead to vastly different pruning decisions, emphasizing the importance of protocol choice.
Contribution
It introduces a detailed analysis of protocol-dependent layer redundancy measures and demonstrates their impact on pruning strategies across various transformer models.
Findings
Protocol gaps can change layer pruneability by several-fold.
Interchange-guided removal can be safer than replacement-guided in certain regimes.
Metric gaps between protocols do not always correspond to pruning costs.
Abstract
When researchers ask whether two transformer layers are "equivalent" for compression, they often conflate distinct tests. Replacement asks whether one layer's map can substitute for another's in place; interchange asks whether two layers approximately commute when their positions are swapped. Both are output-grounded swap-KL probes, but they need not agree: on pretrained transformers the protocol gap can change which layers look safe to prune by several-fold under the same evaluator, especially when replacement distances are high. We measure both protocols across checkpoints and architectures. On a Pythia training trajectory (410M and 1.4B), the replacement-interchange gap grows from initialization to convergence. Under one matched WikiText-2 contract at 8B scale, Qwen3-8B enters a divergent regime: interchange-guided removal is several-fold safer than replacement-guided at the same…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
