A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network
Aojie Jiang, Kang Zhu, Zhiheng Zhang, Zhengxu Su, Juntao Liu, Yuan Du, Li Du

TL;DR
This paper introduces SCIN, a switch-centric in-network architecture that accelerates collective communication for large language model inference, reducing latency and bandwidth usage with a novel in-switch accelerator and in-network quantization.
Contribution
SCIN is the first switch-centric architecture enabling direct memory access for in-network processing, improving All-Reduce performance and supporting low-precision quantization for LLM inference.
Findings
SCIN reduces All-Reduce latency by up to 8.7x for small messages.
SCIN nearly doubles bandwidth by enabling 8-bit in-network quantization.
Simulation shows up to 1.74x speedup in LLaMA-2 inference.
Abstract
In-network computing techniques, exemplified by NVLink SHARP (NVLS), offer a promising approach to addressing the communication bottlenecks in LLM inference by offloading collective operations such as All-Reduce to switches. However, the accelerator-centric architecture of NVLS suffers from two fundamental limitations: 1) it relies on GPU load instructions to trigger in-switch reduction, which means that the data reduced in the switch must be transferred back to the initiating GPU rather than being broadcast directly, thereby introducing unnecessary communication overhead; 2) due to its architectural constraints, NVLS cannot offload operators that are not decomposable into memory-semantic instructions, such as the in-network quantization (INQ) proposed in this work. As a result, All-Reduce in NVLS during inference still operates at 16-bit precision, leading to substantial bandwidth…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
