Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for   Transformer Pretraining

Pihe Hu; Shaolong Li; Longbo Huang

arXiv:2408.11746·cs.LG·August 22, 2024

Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining

Pihe Hu, Shaolong Li, Longbo Huang

PDF

Open Access

TL;DR

This paper introduces Mixed Sparsity Training (MST), a novel method that reduces transformer pretraining FLOPs by 75% through dynamic sparsity and hybrid attention, maintaining GPT-2 performance.

Contribution

MST combines dynamic sparse training, sparsity variation, and hybrid sparse attention to significantly cut computational costs during transformer pretraining.

Findings

01

Achieves 4x FLOP reduction on GPT-2 without performance loss.

02

Demonstrates the effectiveness of combining DST, SV, and HSA during pretraining.

03

Reduces training resource requirements for large language models.

Abstract

Large language models (LLMs) have made significant strides in complex tasks, yet their widespread adoption is impeded by substantial computational demands. With hundreds of billion parameters, transformer-based LLMs necessitate months of pretraining across a high-end GPU cluster. However, this paper reveals a compelling finding: transformers exhibit considerable redundancy in pretraining computations, which motivates our proposed solution, Mixed Sparsity Training (MST), an efficient pretraining method that can reduce about $75%$ of Floating Point Operations (FLOPs) while maintaining performance. MST integrates dynamic sparse training (DST) with Sparsity Variation (SV) and Hybrid Sparse Attention (HSA) during pretraining, involving three distinct phases: warm-up, ultra-sparsification, and restoration. The warm-up phase transforms the dense model into a sparse one, and the restoration…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPower Transformer Diagnostics and Insulation · Advanced Surface Polishing Techniques · Advanced machining processes and optimization

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Residual Connection · Multi-Head Attention · Cosine Annealing · Adam · Layer Normalization · Weight Decay · Attention Is All You Need · Dense Connections