Training Large Language Models Efficiently with Sparsity and Dataflow
Venkat Srinivasan, Darshan Gandhi, Urmish Thakker, Raghu Prabhakar

TL;DR
This paper presents an end-to-end training approach for large language models using sparsity and dataflow techniques, achieving significant speedups while maintaining model quality.
Contribution
It introduces a novel dataflow execution model and architecture that efficiently handles sparsity in training large language models, enabling faster training without quality loss.
Findings
Achieved 4.5x speedup over dense A100 baseline.
Successfully trained GPT 13B with sparsity to match dense model quality.
Demonstrated efficient on-chip irregular memory access handling.
Abstract
Large foundation language models have shown their versatility in being able to be adapted to perform a wide variety of downstream tasks, such as text generation, sentiment analysis, semantic search etc. However, training such large foundational models is a non-trivial exercise that requires a significant amount of compute power and expertise from machine learning and systems experts. As models get larger, these demands are only increasing. Sparsity is a promising technique to relieve the compute requirements for training. However, sparsity introduces new challenges in training the sparse model to the same quality as the dense counterparts. Furthermore, sparsity drops the operation intensity and introduces irregular memory access patterns that makes it challenging to efficiently utilize compute resources. This paper demonstrates an end-to-end training flow on a large language model - 13…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Advanced Neural Network Applications
MethodsAttention Is All You Need · Linear Layer · Cosine Annealing · Linear Warmup With Cosine Annealing · Dense Connections · Attention Dropout · Multi-Head Attention · Discriminative Fine-Tuning · Weight Decay · Dropout
