Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs

Octavian Alexandru Trifan; Karthik Sangaiah; Muhammad Awad; Muhammad Osama; Sumanth Gudaparthi; Alexandru Nicolau; Alexander Veidenbaum; Ganesh Dasika

arXiv:2511.02168·cs.DC·November 5, 2025

Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs

Octavian Alexandru Trifan, Karthik Sangaiah, Muhammad Awad, Muhammad Osama, Sumanth Gudaparthi, Alexandru Nicolau, Alexander Veidenbaum, Ganesh Dasika

PDF

Open Access

TL;DR

This paper introduces a system approach that moves beyond traditional BSP models to optimize distributed GPU execution for large language models, significantly reducing bottlenecks and improving performance.

Contribution

It proposes a new fine-grained programming paradigm that eliminates three key performance taxes in distributed GPU workloads, enabling more efficient LLM training and inference.

Findings

01

Achieved 10-20% speedup in end-to-end latency.

02

Demonstrated effectiveness on kernels like All-Gather and matrix multiplication.

03

Provided a flexible, programmable framework for distributed LLM workloads.

Abstract

As large language models (LLMs) continue to scale, their workloads increasingly rely on distributed execution across multiple GPUs. However, the conventional bulk synchronous parallel~(BSP) model used in such settings introduces significant performance inefficiencies. To characterize these bottlenecks, we introduce the ''Three Taxes'' (Bulk Synchronous, Inter-Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework. We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution. By exploiting libraries like Iris for Triton, we gain access to in-kernel communication primitives that enable the design of novel fine-grained programming patterns, offering greater flexibility and performance than traditional BSP-based approaches. These patterns systematically eliminate the three taxes by creating direct, tile-level producer-consumer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Cloud Computing and Resource Management