TL;DR
This paper investigates why network pruning works well for some language tasks and not others by analyzing the effects on different internal representations within language models.
Contribution
It introduces a representation-hierarchy perspective to explain the varying impact of pruning across language tasks and provides practical guidance for its application.
Findings
Representations in embedding and logit spaces are robust to pruning.
The transformation from logits to probabilities amplifies pruning effects, affecting generation.
Pruning is effective for non-generative tasks due to stable probability subspaces.
Abstract
Network pruning, which removes less important parameters or architectures, is often expected to improve efficiency while preserving performance. However, this expectation does not consistently hold across language tasks: pruned models can perform well on non-generative tasks but frequently fail in generative settings. To understand this discrepancy, we analyze network pruning from a representation-hierarchy perspective, decomposing the internal computation of language models into three sequential spaces: embedding (hidden representations), logit (pre-softmax outputs), and probability (post-softmax distributions). We find that representations in the embedding and logit spaces are largely robust to pruning-induced perturbations. However, the nonlinear transformation from logits to probabilities amplifies these deviations, which accumulate across time steps and lead to substantial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
