SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and   Composition of Experts

Raghu Prabhakar; Ram Sivaramakrishnan; Darshan Gandhi; Yun Du; Mingran; Wang; Xiangyu Song; Kejie Zhang; Tianren Gao; Angela Wang; Karen Li; Yongning; Sheng; Joshua Brot; Denis Sokolov; Apurv Vivek; Calvin Leung; Arjun Sabnis,; Jiayu Bai; Tuowen Zhao; Mark Gottscho; David Jackson; Mark Luttrell; Manish; K. Shah; Edison Chen; Kaizhao Liang; Swayambhoo Jain; Urmish Thakker; Dawei; Huang; Sumti Jairath; Kevin J. Brown; Kunle Olukotun

arXiv:2405.07518·cs.AR·November 6, 2024·1 cites

SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts

Raghu Prabhakar, Ram Sivaramakrishnan, Darshan Gandhi, Yun Du, Mingran, Wang, Xiangyu Song, Kejie Zhang, Tianren Gao, Angela Wang, Karen Li, Yongning, Sheng, Joshua Brot, Denis Sokolov, Apurv Vivek, Calvin Leung, Arjun Sabnis,, Jiayu Bai, Tuowen Zhao, Mark Gottscho

PDF

Open Access

TL;DR

This paper presents Samba-CoE, a scalable, efficient system combining Composition of Experts, dataflow architecture, and a three-tier memory system on SambaNova hardware, significantly improving AI deployment performance and cost-effectiveness.

Contribution

It introduces Samba-CoE, a novel modular AI system with 150 experts and a trillion parameters, optimized for enterprise inference on SambaNova hardware, addressing memory and utilization challenges.

Findings

01

Achieved 2x to 13x speedups on benchmarks with 8 RDU sockets.

02

Reduced machine footprint by up to 19x for CoE inference.

03

Speeded up model switching by 15x to 31x, outperforming traditional hardware.

Abstract

Monolithic large language models (LLMs) like GPT-4 have paved the way for modern generative AI applications. Training, serving, and maintaining monolithic LLMs at scale, however, remains prohibitively expensive and challenging. The disproportionate increase in compute-to-memory ratio of modern AI accelerators have created a memory wall, necessitating new methods to deploy AI. Composition of Experts (CoE) is an alternative modular approach that lowers the cost and complexity of training and serving. However, this approach presents two key challenges when using conventional hardware: (1) without fused operations, smaller models have lower operational intensity, which makes high utilization more challenging to achieve; and (2) hosting a large number of models can be either prohibitively expensive or slow when dynamically switching between them. In this paper, we describe how combining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Position-Wise Feed-Forward Layer · Dropout · Label Smoothing · Residual Connection · Absolute Position Encodings · Byte Pair Encoding