SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models

Han-Byul Kim; Duc Hoang; Arnav Kundu; Mohammad Samragh; Minsik Cho

arXiv:2502.20727·cs.DC·June 3, 2025

SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models

Han-Byul Kim, Duc Hoang, Arnav Kundu, Mohammad Samragh, Minsik Cho

PDF

TL;DR

This paper introduces Sync-Point Drop (SPD), a technique that reduces communication overhead in tensor parallelism for large language models, significantly improving inference speed with minimal accuracy loss.

Contribution

The paper presents SPD, a novel method that selectively drops synchronization points in tensor parallelism to enhance scalability and reduce latency in LLM inference.

Findings

01

SPD achieves about 20% reduction in inference latency.

02

Minimal accuracy degradation (<1%) on LLaMA2-70B.

03

Effective across diverse distributed environments.

Abstract

With the rapid expansion in the scale of large language models (LLMs), enabling efficient distributed inference across multiple computing units has become increasingly critical. However, communication overheads from popular distributed inference techniques such as Tensor Parallelism pose a significant challenge to achieve scalability and low latency. Therefore, we introduce a novel optimization technique, Sync-Point Drop (SPD), to reduce communication overheads in tensor parallelism by selectively dropping synchronization on attention outputs. In detail, we first propose a block design that allows execution to proceed without communication through SPD. Second, we apply different SPD strategies to attention blocks based on their sensitivity to the model accuracy. The proposed methods effectively alleviate communication bottlenecks while minimizing accuracy degradation during LLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need