Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep   Learning

Wei An; Xiao Bi; Guanting Chen; Shanhuang Chen; Chengqi Deng; Honghui; Ding; Kai Dong; Qiushi Du; Wenjun Gao; Kang Guan; Jianzhong Guo; Yongqiang; Guo; Zhe Fu; Ying He; Panpan Huang; Jiashi Li; Wenfeng Liang; Xiaodong Liu,; Xin Liu; Yiyuan Liu; Yuxuan Liu; Shanghao Lu; Xuan Lu; Xiaotao Nie; Tian Pei,; Junjie Qiu; Hui Qu; Zehui Ren; Zhangli Sha; Xuecheng Su; Xiaowen Sun; Yixuan; Tan; Minghui Tang; Shiyu Wang; Yaohui Wang; Yongji Wang; Ziwei Xie; Yiliang; Xiong; Yanhong Xu; Shengfeng Ye; Shuiping Yu; Yukun Zha; Liyue Zhang; Haowei; Zhang; Mingchuan Zhang; Wentao Zhang; Yichao Zhang; Chenggang Zhao; Yao Zhao,; Shangyan Zhou; Shunfeng Zhou; Yuheng Zou

arXiv:2408.14158·cs.DC·September 4, 2024

Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning

Wei An, Xiao Bi, Guanting Chen, Shanhuang Chen, Chengqi Deng, Honghui, Ding, Kai Dong, Qiushi Du, Wenjun Gao, Kang Guan, Jianzhong Guo, Yongqiang, Guo, Zhe Fu, Ying He, Panpan Huang, Jiashi Li, Wenfeng Liang, Xiaodong Liu,, Xin Liu, Yiyuan Liu, Yuxuan Liu, Shanghao Lu, Xuan Lu

PDF

Open Access

TL;DR

The paper presents Fire-Flyer AI-HPC, a hardware-software co-design framework that significantly reduces costs and energy consumption in deep learning HPC systems while maintaining high performance.

Contribution

It introduces a novel co-design architecture and software stack that enhances scalability and efficiency in AI-HPC, with practical deployment on large GPU clusters.

Findings

01

Cost reduced by 50% compared to DGX-A100

02

Energy consumption decreased by 40%

03

Achieved high scalability through overlapping computation and communication

Abstract

The rapid progress in Deep Learning (DL) and Large Language Models (LLMs) has exponentially increased demands of computational power and bandwidth. This, combined with the high costs of faster computing chips and interconnects, has significantly inflated High Performance Computing (HPC) construction costs. To address these challenges, we introduce the Fire-Flyer AI-HPC architecture, a synergistic hardware-software co-design framework and its best practices. For DL training, we deployed the Fire-Flyer 2 with 10,000 PCIe A100 GPUs, achieved performance approximating the DGX-A100 while reducing costs by half and energy consumption by 40%. We specifically engineered HFReduce to accelerate allreduce communication and implemented numerous measures to keep our Computation-Storage Integrated Network congestion-free. Through our software stack, including HaiScale, 3FS, and HAI-Platform, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques