TL;DR
TokenWeave is a system that enables efficient compute-communication overlap for distributed LLM inference at small token lengths, significantly reducing latency and increasing throughput.
Contribution
It introduces a novel fused AllReduce--RMSNorm kernel leveraging GPU features, achieving improved performance for tensor-parallel inference.
Findings
Up to 1.28x latency speedup over baseline.
Up to 1.19x higher throughput compared to baseline.
Effective for token lengths as small as 1024.
Abstract
Distributed inference of large language models (LLMs) using tensor parallelism can introduce communication overheads of % even over GPUs connected via NVLink, a high-speed GPU interconnect. Several techniques have been proposed to mitigate these overheads by decomposing computations into smaller tasks and overlapping communication with these subtasks. However, none of these techniques are turned on by default during tensor-parallel serving in systems like vLLM, SGLang and TensorRT-LLM. This is because the number of tokens processed per iteration is typically kept small to support low-latency serving, and decomposing such smaller workloads to enable communication overlap results in worse performance. Further, the communication itself uses many streaming multiprocessors (SMs) that would otherwise be available for computation, increasing overhead. We present TokenWeave, the first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
