Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast   Large Language Model Inference

Qingyuan Li; Bo Zhang; Liang Ye; Yifan Zhang; Wei Wu; Yerui Sun; Lin; Ma; Yuchen Xie

arXiv:2412.04964·cs.AI·December 12, 2024

Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference

Qingyuan Li, Bo Zhang, Liang Ye, Yifan Zhang, Wei Wu, Yerui Sun, Lin, Ma, Yuchen Xie

PDF

Open Access

TL;DR

This paper presents Flash Communication, a low-bit compression technique that significantly reduces communication bottlenecks in tensor parallelism during large language model inference, leading to faster speeds with minimal accuracy loss.

Contribution

The paper introduces a novel low-bit compression method that alleviates communication bottlenecks in tensor parallelism for LLM inference, improving speed without sacrificing accuracy.

Findings

01

Intra-node communication speed increased by over 3x

02

Time-to-first-token reduced by 2x

03

Model accuracy remains nearly unaffected

Abstract

The ever-increasing sizes of large language models necessitate distributed solutions for fast inference that exploit multi-dimensional parallelism, where computational loads are split across various accelerators such as GPU clusters. However, this approach often introduces significant communication overhead, especially on devices with limited bandwidth. In this paper, we introduce Flash Communication, a novel low-bit compression technique designed to alleviate the tensor-parallelism communication bottleneck during inference. Our method substantially boosts intra-node communication speed by more than 3x and reduces the time-to-first-token by 2x, with nearly no sacrifice in model accuracy. Extensive experiments on various up-to-date LLMs demonstrate the effectiveness of our approach.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational Physics and Python Applications · Topic Modeling

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings