Hecaton: Training Large Language Models with Scalable Chiplet Systems

Zongle Huang; Shupei Fan; Chen Tang; Xinyuan Lin; Shuwen Deng; Yongpan; Liu

arXiv:2407.05784·cs.AR·November 28, 2024

Hecaton: Training Large Language Models with Scalable Chiplet Systems

Zongle Huang, Shupei Fan, Chen Tang, Xinyuan Lin, Shuwen Deng, Yongpan, Liu

PDF

Open Access

TL;DR

Hecaton introduces a scalable chiplet system tailored for large language model training, reducing communication overheads and improving performance and energy efficiency compared to traditional tensor parallelism methods.

Contribution

This work presents the first chiplet architecture specifically designed for LLM training, with tailored scheduling and distributed training methods to enhance scalability and efficiency.

Findings

01

Achieves 5.29x performance improvement on Llama3.1-405B.

02

Reduces energy consumption by 3.46x.

03

Maintains weak scaling with proportional workload and hardware growth.

Abstract

Large Language Models (LLMs) have achieved remarkable success in various fields, but their training and finetuning require massive computation and memory, necessitating parallelism which introduces heavy communication overheads. Driven by advances in packaging, the chiplet architecture emerges as a potential solution, as it can integrate computing power, as well as utilize on-package links with better signal integrity, higher bandwidth, and lower energy consumption. However, most existing chiplet-related works focus on DNN inference. Directly porting them to LLM training introduces significantly large quantities of DRAM access and network-on-package (NoP) overheads which make state-of-the-art chiplet designs fail, highlighting a research gap. This work proposes Hecaton, a scalable and cost-effective chiplet system for LLM training. We first provide a chiplet architecture with tailored…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques