A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network

Aojie Jiang; Kang Zhu; Zhiheng Zhang; Zhengxu Su; Juntao Liu; Yuan Du; Li Du

arXiv:2603.28239·cs.AR·April 9, 2026

A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network

Aojie Jiang, Kang Zhu, Zhiheng Zhang, Zhengxu Su, Juntao Liu, Yuan Du, Li Du

PDF

TL;DR

This paper introduces SCIN, a switch-centric in-network architecture that accelerates collective communication for large language model inference, reducing latency and bandwidth usage with a novel in-switch accelerator and in-network quantization.

Contribution

SCIN is the first switch-centric architecture enabling direct memory access for in-network processing, improving All-Reduce performance and supporting low-precision quantization for LLM inference.

Findings

01

SCIN reduces All-Reduce latency by up to 8.7x for small messages.

02

SCIN nearly doubles bandwidth by enabling 8-bit in-network quantization.

03

Simulation shows up to 1.74x speedup in LLaMA-2 inference.

Abstract

In-network computing techniques, exemplified by NVLink SHARP (NVLS), offer a promising approach to addressing the communication bottlenecks in LLM inference by offloading collective operations such as All-Reduce to switches. However, the accelerator-centric architecture of NVLS suffers from two fundamental limitations: 1) it relies on GPU load instructions to trigger in-switch reduction, which means that the data reduced in the switch must be transferred back to the initiating GPU rather than being broadcast directly, thereby introducing unnecessary communication overhead; 2) due to its architectural constraints, NVLS cannot offload operators that are not decomposable into memory-semantic instructions, such as the in-network quantization (INQ) proposed in this work. As a result, All-Reduce in NVLS during inference still operates at 16-bit precision, leading to substantial bandwidth…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.