VTrans: Accelerating Transformer Compression with Variational Information Bottleneck based Pruning
Oshin Dutta, Ritvik Gupta, Sumeet Agarwal

TL;DR
VTrans introduces a variational information bottleneck-based iterative pruning framework that compresses all transformer components, including embeddings, achieving significant size reduction with minimal performance loss.
Contribution
The paper presents a novel VIB-guided pruning method that compresses all transformer parts, including embeddings, and introduces faster variants requiring less data and time.
Findings
Achieves up to 70% more compression than prior methods.
Faster-VTrans accelerates compression by up to 25 times with minimal performance impact.
Effectively scales to large models like LLaMA-2-7B.
Abstract
In recent years, there has been a growing emphasis on compressing large pre-trained transformer models for resource-constrained devices. However, traditional pruning methods often leave the embedding layer untouched, leading to model over-parameterization. Additionally, they require extensive compression time with large datasets to maintain performance in pruned models. To address these challenges, we propose VTrans, an iterative pruning framework guided by the Variational Information Bottleneck (VIB) principle. Our method compresses all structural components, including embeddings, attention heads, and layers using VIB-trained masks. This approach retains only essential weights in each layer, ensuring compliance with specified model size or computational constraints. Notably, our method achieves upto 70% more compression than prior state-of-the-art approaches, both task-agnostic and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOptical Network Technologies · Advanced Data Compression Techniques · PAPR reduction in OFDM
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · WordPiece · Byte Pair Encoding · Linear Warmup With Linear Decay · Adam · Attention Dropout · Weight Decay · Linear Warmup With Cosine Annealing
