Characterization of Large Language Model Development in the Datacenter
Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang,, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen,, Tianwei Zhang

TL;DR
This paper provides a detailed analysis of large language model development workloads in datacenter environments, highlighting challenges, resource utilization patterns, and proposing system improvements for fault tolerance and scheduling efficiency.
Contribution
It offers the first comprehensive characterization of LLM development workloads and introduces novel system techniques for fault tolerance and decoupled scheduling.
Findings
Identifies key resource utilization patterns and failure impacts in LLM training.
Proposes fault-tolerant pretraining to improve robustness against hardware failures.
Introduces decoupled scheduling for more efficient evaluation feedback.
Abstract
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs, often riddled with numerous challenges such as frequent hardware failures, intricate parallelization strategies, and imbalanced resource utilization. In this paper, we present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme. Specifically, we investigate discrepancies between LLMs and prior task-specific Deep Learning (DL) workloads, explore resource utilization patterns, and identify the impact of various job failures. Our analysis summarizes hurdles we encountered and uncovers potential opportunities to optimize systems tailored for LLMs. Furthermore, we introduce our system efforts: (1) fault-tolerant pretraining,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Topic Modeling · Scientific Computing and Data Management
