Pipette: Automatic Fine-grained Large Language Model Training   Configurator for Real-World Clusters

Jinkyu Yim; Jaeyong Song; Yerim Choi; Jaebeen Lee; Jaewon Jung,; Hongsun Jang; Jinho Lee

arXiv:2405.18093·cs.DC·May 29, 2024

Pipette: Automatic Fine-grained Large Language Model Training Configurator for Real-World Clusters

Jinkyu Yim, Jaeyong Song, Yerim Choi, Jaebeen Lee, Jaewon Jung,, Hongsun Jang, Jinho Lee

PDF

Open Access 1 Repo 1 Datasets

TL;DR

Pipette is an automatic configurator that optimizes large language model training on real-world GPU clusters by considering heterogeneity, communication, and memory constraints, leading to faster and feasible configurations.

Contribution

It introduces a fine-grained, performance-aware configuration method that accounts for real-world cluster heterogeneity and memory limits, improving over prior approaches.

Findings

01

Achieves significant speedup over previous methods.

02

Provides configurations that satisfy memory constraints.

03

Effectively models heterogeneous interconnect bandwidths.

Abstract

Training large language models (LLMs) is known to be challenging because of the huge computational and memory capacity requirements. To address these issues, it is common to use a cluster of GPUs with 3D parallelism, which splits a model along the data batch, pipeline stage, and intra-layer tensor dimensions. However, the use of 3D parallelism produces the additional challenge of finding the optimal number of ways on each dimension and mapping the split models onto the GPUs. Several previous studies have attempted to automatically find the optimal configuration, but many of these lacked several important aspects. For instance, the heterogeneous nature of the interconnect speeds is often ignored. While the peak bandwidths for the interconnects are usually made equal, the actual attained bandwidth varies per link in real-world clusters. Combined with the critical path modeling that does…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yimjinkyu1/date2024_pipette
pytorchOfficial

Datasets

Groq/mtob
dataset· 47 dl
47 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques