The Emergence of Essential Sparsity in Large Pre-trained Models: The   Weights that Matter

Ajay Jaiswal; Shiwei Liu; Tianlong Chen; Zhangyang Wang

arXiv:2306.03805·cs.LG·August 11, 2023·6 cites

The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter

Ajay Jaiswal, Shiwei Liu, Tianlong Chen, Zhangyang Wang

PDF

Open Access 1 Repo

TL;DR

This paper investigates the phenomenon of essential sparsity in large pre-trained transformers, revealing a sharp drop in performance beyond a certain sparsity level and uncovering emergent sparsification during pre-training.

Contribution

It introduces the concept of essential sparsity, demonstrates its validity across models and sparsity patterns, and uncovers emergent sparsification phenomena during BERT pre-training.

Findings

01

Essential sparsity exhibits a sharp performance decline beyond a certain sparsity threshold.

02

Emergent abrupt sparsification occurs during BERT pre-training after specific iterations.

03

Self-supervised learning enhances emergent sparsification compared to supervised learning.

Abstract

Large pre-trained transformers are show-stealer in modern-day deep learning, and it becomes crucial to comprehend the parsimonious patterns that exist within them as they grow in scale. With exploding parameter counts, Lottery Ticket Hypothesis (LTH) and its variants, have lost their pragmatism in sparsifying them due to high computation and memory bottleneck of repetitive train-prune-retrain routine of iterative magnitude pruning (IMP) which worsens with increasing model size. This paper comprehensively studies induced sparse patterns across multiple large pre-trained vision and language transformers. We propose the existence of -- essential sparsity defined with a sharp dropping point beyond which the performance declines much faster w.r.t the rise of sparsity level, when we directly remove weights with the smallest magnitudes in one-shot without re-training. We also find essential…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vita-group/essential_sparsity
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Speech Recognition and Synthesis · Topic Modeling