Balancing Pipeline Parallelism with Vocabulary Parallelism

Man Tsung Yeung; Penghui Qi; Min Lin; Xinyi Wan

arXiv:2411.05288·cs.DC·May 6, 2025

Balancing Pipeline Parallelism with Vocabulary Parallelism

Man Tsung Yeung, Penghui Qi, Min Lin, Xinyi Wan

PDF

Open Access 1 Repo

TL;DR

This paper proposes a novel method to balance computation and memory in pipeline parallelism for large language models by partitioning vocabulary layers and optimizing communication, leading to significant throughput and memory improvements.

Contribution

It introduces a new approach to evenly partition vocabulary layers and integrate vocabulary parallelism with pipeline schedules, addressing imbalance issues in large language model training.

Findings

01

Achieves 5% to 51% throughput improvement

02

Reduces peak memory usage significantly

03

Balances computation and memory across pipeline stages

Abstract

Pipeline parallelism is widely used to scale the training of transformer-based large language models, various works have been done to improve its throughput and memory footprint. In this paper, we address a frequently overlooked issue: the vocabulary layers can cause imbalanced computation and memory usage across pipeline stages, worsening pipeline bubbles and the memory bottleneck. To tackle this, we partition the vocabulary layers evenly across pipeline devices and group the computation into pipeline passes. To reduce the activation memory overhead, we propose several algorithms to reduce communication barriers within vocabulary layers. Additionally, we utilize a generalizable method to integrate Vocabulary Parallelism with existing pipeline schedules. By combining these techniques, our methods effectively balance the computation and parameter memory, with only a small constant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sail-sg/vocabularyparallelism
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques