Balancing Pipeline Parallelism with Vocabulary Parallelism
Man Tsung Yeung, Penghui Qi, Min Lin, Xinyi Wan

TL;DR
This paper proposes a novel method to balance computation and memory in pipeline parallelism for large language models by partitioning vocabulary layers and optimizing communication, leading to significant throughput and memory improvements.
Contribution
It introduces a new approach to evenly partition vocabulary layers and integrate vocabulary parallelism with pipeline schedules, addressing imbalance issues in large language model training.
Findings
Achieves 5% to 51% throughput improvement
Reduces peak memory usage significantly
Balances computation and memory across pipeline stages
Abstract
Pipeline parallelism is widely used to scale the training of transformer-based large language models, various works have been done to improve its throughput and memory footprint. In this paper, we address a frequently overlooked issue: the vocabulary layers can cause imbalanced computation and memory usage across pipeline stages, worsening pipeline bubbles and the memory bottleneck. To tackle this, we partition the vocabulary layers evenly across pipeline devices and group the computation into pipeline passes. To reduce the activation memory overhead, we propose several algorithms to reduce communication barriers within vocabulary layers. Additionally, we utilize a generalizable method to integrate Vocabulary Parallelism with existing pipeline schedules. By combining these techniques, our methods effectively balance the computation and parameter memory, with only a small constant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
