From Dense to Sparse: Contrastive Pruning for Better Pre-trained Language Model Compression
Runxin Xu, Fuli Luo, Chengyu Wang, Baobao Chang, Jun Huang, Songfang, Huang, Fei Huang

TL;DR
This paper introduces ContrAstive Pruning (CAP), a novel framework for compressing pre-trained language models by preserving both task-agnostic and task-specific knowledge through contrastive learning, leading to high sparsity with minimal performance loss.
Contribution
CAP is a general pruning framework that effectively maintains knowledge during compression by leveraging contrastive learning and model snapshots, outperforming prior methods especially at high sparsity levels.
Findings
CAP achieves 99.2% of BERT's performance with only 3% parameters in QQP.
CAP outperforms existing pruning methods at high sparsity levels.
Pruned models by CAP show improved generalization ability.
Abstract
Pre-trained Language Models (PLMs) have achieved great success in various Natural Language Processing (NLP) tasks under the pre-training and fine-tuning paradigm. With large quantities of parameters, PLMs are computation-intensive and resource-hungry. Hence, model pruning has been introduced to compress large-scale PLMs. However, most prior approaches only consider task-specific knowledge towards downstream tasks, but ignore the essential task-agnostic knowledge during pruning, which may cause catastrophic forgetting problem and lead to poor generalization ability. To maintain both task-agnostic and task-specific knowledge in our pruned model, we propose ContrAstive Pruning (CAP) under the paradigm of pre-training and fine-tuning. It is designed as a general framework, compatible with both structured and unstructured pruning. Unified in contrastive learning, CAP enables the pruned model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Pruning · Linear Layer · Adam · Multi-Head Attention · Residual Connection · Layer Normalization · Attention Dropout · Linear Warmup With Linear Decay · Dense Connections
