PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training

Daiyaan Arfeen; Zhen Zhang; Xinwei Fu; Gregory R. Ganger; Yida Wang

arXiv:2410.07192·cs.DC·October 11, 2024

PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training

Daiyaan Arfeen, Zhen Zhang, Xinwei Fu, Gregory R. Ganger, Yida Wang

PDF

Open Access

TL;DR

PipeFill enhances GPU utilization during pipeline-parallel large language model training by filling pipeline bubbles with auxiliary jobs, significantly increasing efficiency with minimal slowdown.

Contribution

This paper introduces PipeFill, a novel method to fill pipeline bubbles in GPU training, improving utilization and scalability of large-scale LLM training.

Findings

01

GPU utilization increased by up to 63%

02

Training slowdown kept below 2%

03

Additional work equivalent to 2,600 GPUs at 8K GPU scale

Abstract

Training Deep Neural Networks (DNNs) with billions of parameters generally involves pipeline-parallel (PP) execution. Unfortunately, PP model training can use GPUs inefficiently, especially at large scale, due to idle GPU time caused by pipeline bubbles, which are often 15-30% and can exceed 60% of the training job's GPU allocation. To improve the GPU utilization of PP model training, this paper describes PipeFill, which fills pipeline bubbles with execution of other pending jobs. By leveraging bubble GPU time, PipeFill reduces the GPU utilization sacrifice associated with scaling-up of large-model training. To context-switch between fill jobs and the main training job with minimal overhead to the main job, and maximize fill job efficiency, PipeFill carefully fits fill job work to measured bubble durations and GPU memory availability, introduces explicit pipeline-bubble instructions,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOil and Gas Production Techniques · Advancements in Photolithography Techniques · Manufacturing Process and Optimization