TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference

Raja Gond; Nipun Kwatra; Ramachandran Ramjee

arXiv:2505.11329·cs.DC·May 4, 2026

TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference

Raja Gond, Nipun Kwatra, Ramachandran Ramjee

PDF

1 Repo

TL;DR

TokenWeave is a system that enables efficient compute-communication overlap for distributed LLM inference at small token lengths, significantly reducing latency and increasing throughput.

Contribution

It introduces a novel fused AllReduce--RMSNorm kernel leveraging GPU features, achieving improved performance for tensor-parallel inference.

Findings

01

Up to 1.28x latency speedup over baseline.

02

Up to 1.19x higher throughput compared to baseline.

03

Effective for token lengths as small as 1024.

Abstract

Distributed inference of large language models (LLMs) using tensor parallelism can introduce communication overheads of $20$ % even over GPUs connected via NVLink, a high-speed GPU interconnect. Several techniques have been proposed to mitigate these overheads by decomposing computations into smaller tasks and overlapping communication with these subtasks. However, none of these techniques are turned on by default during tensor-parallel serving in systems like vLLM, SGLang and TensorRT-LLM. This is because the number of tokens processed per iteration is typically kept small to support low-latency serving, and decomposing such smaller workloads to enable communication overlap results in worse performance. Further, the communication itself uses many streaming multiprocessors (SMs) that would otherwise be available for computation, increasing overhead. We present TokenWeave, the first…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/tokenweave
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.