Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems

Chen Zhang; Qijun Zhang; Zhuoshan Zhou; Yijia Diao; Haibo Wang; Zhe Zhou; Zhipeng Tu; Zhiyao Li; Guangyu Sun; Zhuoran Song; Zhigang Ji; Jingwen Leng; Minyi Guo

arXiv:2605.05628·cs.AR·May 8, 2026

Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems

Chen Zhang, Qijun Zhang, Zhuoshan Zhou, Yijia Diao, Haibo Wang, Zhe Zhou, Zhipeng Tu, Zhiyao Li, Guangyu Sun, Zhuoran Song, Zhigang Ji, Jingwen Leng, Minyi Guo

PDF

TL;DR

This paper introduces CAIS, a compute-aware in-switch computing framework that improves tensor parallelism efficiency in multi-GPU LLM training by better aligning communication with computation semantics.

Contribution

CAIS is the first framework to integrate compute-aware in-switch computing with techniques for improved resource utilization and overlap in multi-GPU systems.

Findings

01

CAIS achieves 1.38× average end-to-end training speedup over NVLS.

02

CAIS outperforms T3 with 1.61× speedup in tensor parallelism.

03

Evaluation on LLM workloads demonstrates significant acceleration benefits.

Abstract

Tensor parallelism (TP) in large-scale LLM inference and training introduces frequent collective operations that dominate inter-GPU communication. While in-switch computing, exemplified by NVLink SHARP (NVLS), accelerates collective operations by reducing redundant data transfer, its communication-centric design philosophy introduces the mismatch between its communication mode and the memory semantic requirement of LLM's computation kernel. Such a mismatch isolates the compute and communication phases, resulting in underutilized resources and limited overlap in multi-GPU systems. To address the limitation, we propose CAIS, the first Compute-Aware In-Switch computing framework that aligns communication modes with computation's memory semantics requirement. CAIS consists of three integral techniques: (1) compute-aware ISA and microarchitecture extension to enable compute-aware in-switch…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.