VTrans: Accelerating Transformer Compression with Variational   Information Bottleneck based Pruning

Oshin Dutta; Ritvik Gupta; Sumeet Agarwal

arXiv:2406.05276·cs.LG·June 13, 2024·1 cites

VTrans: Accelerating Transformer Compression with Variational Information Bottleneck based Pruning

Oshin Dutta, Ritvik Gupta, Sumeet Agarwal

PDF

Open Access

TL;DR

VTrans introduces a variational information bottleneck-based iterative pruning framework that compresses all transformer components, including embeddings, achieving significant size reduction with minimal performance loss.

Contribution

The paper presents a novel VIB-guided pruning method that compresses all transformer parts, including embeddings, and introduces faster variants requiring less data and time.

Findings

01

Achieves up to 70% more compression than prior methods.

02

Faster-VTrans accelerates compression by up to 25 times with minimal performance impact.

03

Effectively scales to large models like LLaMA-2-7B.

Abstract

In recent years, there has been a growing emphasis on compressing large pre-trained transformer models for resource-constrained devices. However, traditional pruning methods often leave the embedding layer untouched, leading to model over-parameterization. Additionally, they require extensive compression time with large datasets to maintain performance in pruned models. To address these challenges, we propose VTrans, an iterative pruning framework guided by the Variational Information Bottleneck (VIB) principle. Our method compresses all structural components, including embeddings, attention heads, and layers using VIB-trained masks. This approach retains only essential weights in each layer, ensuring compliance with specified model size or computational constraints. Notably, our method achieves upto 70% more compression than prior state-of-the-art approaches, both task-agnostic and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOptical Network Technologies · Advanced Data Compression Techniques · PAPR reduction in OFDM

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · WordPiece · Byte Pair Encoding · Linear Warmup With Linear Decay · Adam · Attention Dropout · Weight Decay · Linear Warmup With Cosine Annealing