Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining
Pihe Hu, Shaolong Li, Longbo Huang

TL;DR
This paper introduces Mixed Sparsity Training (MST), a novel method that reduces transformer pretraining FLOPs by 75% through dynamic sparsity and hybrid attention, maintaining GPT-2 performance.
Contribution
MST combines dynamic sparse training, sparsity variation, and hybrid sparse attention to significantly cut computational costs during transformer pretraining.
Findings
Achieves 4x FLOP reduction on GPT-2 without performance loss.
Demonstrates the effectiveness of combining DST, SV, and HSA during pretraining.
Reduces training resource requirements for large language models.
Abstract
Large language models (LLMs) have made significant strides in complex tasks, yet their widespread adoption is impeded by substantial computational demands. With hundreds of billion parameters, transformer-based LLMs necessitate months of pretraining across a high-end GPU cluster. However, this paper reveals a compelling finding: transformers exhibit considerable redundancy in pretraining computations, which motivates our proposed solution, Mixed Sparsity Training (MST), an efficient pretraining method that can reduce about of Floating Point Operations (FLOPs) while maintaining performance. MST integrates dynamic sparse training (DST) with Sparsity Variation (SV) and Hybrid Sparse Attention (HSA) during pretraining, involving three distinct phases: warm-up, ultra-sparsification, and restoration. The warm-up phase transforms the dense model into a sparse one, and the restoration…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPower Transformer Diagnostics and Insulation · Advanced Surface Polishing Techniques · Advanced machining processes and optimization
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Residual Connection · Multi-Head Attention · Cosine Annealing · Adam · Layer Normalization · Weight Decay · Attention Is All You Need · Dense Connections
