FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems
Rui Ma, Evangelos Georganas, Alexander Heinecke, Andrew Boutros, Eriko, Nurvitadhi

TL;DR
This paper introduces FPGA-based AI smart NICs that accelerate collective communication, notably all-reduce, in distributed AI training, significantly improving performance and scalability of multi-node systems.
Contribution
The paper presents a novel FPGA-based smart NIC design that accelerates all-reduce operations and optimizes bandwidth, enabling scalable and efficient distributed AI training.
Findings
Achieved 1.6x performance improvement on 6 nodes
Validated an analytical model for larger system scaling
Estimated 2.5x performance gain at 32 nodes
Abstract
Rapid advances in artificial intelligence (AI) technology have led to significant accuracy improvements in a myriad of application domains at the cost of larger and more compute-intensive models. Training such models on massive amounts of data typically requires scaling to many compute nodes and relies heavily on collective communication algorithms, such as all-reduce, to exchange the weight gradients between different nodes. The overhead of these collective communication operations in a distributed AI training system can bottleneck its performance, with more pronounced effects as the number of nodes increases. In this paper, we first characterize the all-reduce operation overhead by profiling distributed AI training. Then, we propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs) to accelerate all-reduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Advanced Neural Network Applications · Machine Learning and ELM
