ISO: Overlap of Computation and Communication within Seqenence For LLM Inference
Bin Xiao, Lei Su

TL;DR
This paper proposes a sequence-level overlap strategy for computation and communication in LLM inference, significantly improving efficiency and reducing time consumption during the prefill stage on multiple GPU types.
Contribution
It introduces a novel sequence-level overlap method that surpasses existing techniques in efficiency and applicability for LLM inference.
Findings
35% time reduction on 4090 GPU
15% time reduction on A800 GPU
Effective in prefill stage of LLM inference
Abstract
In the realm of Large Language Model (LLM) inference, the inherent structure of transformer models coupled with the multi-GPU tensor parallelism strategy leads to a sequential execution of computation and communication. This results in substantial underutilization of computing resources during the communication phase. To mitigate this inefficiency, various techniques have been developed to optimize the use of computational power throughout the communication process. These strategies primarily involve overlapping matrix computations and communications, as well as interleaving micro-batches across different requests. Nonetheless, these approaches either fall short of achieving ideal overlap or impose certain limitations on their application. To overcome these challenges, this paper introduces a novel strategy for computation-communication overlap that operates at the sequence level. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
