The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter
Ajay Jaiswal, Shiwei Liu, Tianlong Chen, Zhangyang Wang

TL;DR
This paper investigates the phenomenon of essential sparsity in large pre-trained transformers, revealing a sharp drop in performance beyond a certain sparsity level and uncovering emergent sparsification during pre-training.
Contribution
It introduces the concept of essential sparsity, demonstrates its validity across models and sparsity patterns, and uncovers emergent sparsification phenomena during BERT pre-training.
Findings
Essential sparsity exhibits a sharp performance decline beyond a certain sparsity threshold.
Emergent abrupt sparsification occurs during BERT pre-training after specific iterations.
Self-supervised learning enhances emergent sparsification compared to supervised learning.
Abstract
Large pre-trained transformers are show-stealer in modern-day deep learning, and it becomes crucial to comprehend the parsimonious patterns that exist within them as they grow in scale. With exploding parameter counts, Lottery Ticket Hypothesis (LTH) and its variants, have lost their pragmatism in sparsifying them due to high computation and memory bottleneck of repetitive train-prune-retrain routine of iterative magnitude pruning (IMP) which worsens with increasing model size. This paper comprehensively studies induced sparse patterns across multiple large pre-trained vision and language transformers. We propose the existence of -- essential sparsity defined with a sharp dropping point beyond which the performance declines much faster w.r.t the rise of sparsity level, when we directly remove weights with the smallest magnitudes in one-shot without re-training. We also find essential…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Speech Recognition and Synthesis · Topic Modeling
