ISO: Overlap of Computation and Communication within Seqenence For LLM   Inference

Bin Xiao; Lei Su

arXiv:2409.11155·cs.DC·September 18, 2024

ISO: Overlap of Computation and Communication within Seqenence For LLM Inference

Bin Xiao, Lei Su

PDF

Open Access

TL;DR

This paper proposes a sequence-level overlap strategy for computation and communication in LLM inference, significantly improving efficiency and reducing time consumption during the prefill stage on multiple GPU types.

Contribution

It introduces a novel sequence-level overlap method that surpasses existing techniques in efficiency and applicability for LLM inference.

Findings

01

35% time reduction on 4090 GPU

02

15% time reduction on A800 GPU

03

Effective in prefill stage of LLM inference

Abstract

In the realm of Large Language Model (LLM) inference, the inherent structure of transformer models coupled with the multi-GPU tensor parallelism strategy leads to a sequential execution of computation and communication. This results in substantial underutilization of computing resources during the communication phase. To mitigate this inefficiency, various techniques have been developed to optimize the use of computational power throughout the communication process. These strategies primarily involve overlapping matrix computations and communications, as well as interleaving micro-batches across different requests. Nonetheless, these approaches either fall short of achieving ideal overlap or impose certain limitations on their application. To overcome these challenges, this paper introduces a novel strategy for computation-communication overlap that operates at the sequence level. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques