FastDecode: High-Throughput GPU-Efficient LLM Serving using   Heterogeneous Pipelines

Jiaao He; Jidong Zhai

arXiv:2403.11421·cs.DC·March 19, 2024·2 cites

FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines

Jiaao He, Jidong Zhai

PDF

Open Access

TL;DR

FastDecode introduces a heterogeneous pipeline approach that decomposes transformer models to leverage CPU clusters for memory-bound components, significantly boosting GPU throughput and reducing costs in large language model serving.

Contribution

The paper proposes a novel model decomposition and scheduling method that effectively utilizes CPU resources to improve LLM serving efficiency on GPUs.

Findings

01

Achieves 1.88x to 5.04x throughput improvement over vLLM.

02

Effectively offloads memory-bound KV-Cache to CPU clusters.

03

Reduces data transmission overhead and enhances GPU processing efficiency.

Abstract

Cost of serving large language models (LLM) is high, but the expensive and scarce GPUs are poorly efficient when generating tokens sequentially, unless the batch of sequences is enlarged. However, the batch size is limited by some constantly reused intermediate results, namely KV-Cache. They occupy too much memory to fit more sequences into a GPU simultaneously. While they could be offloaded to host memory, the CPU-GPU bandwidth is an inevitable bottleneck. We find a way to decompose the transformer models into two parts of different characteristics, one of which includes the memory-bound KV-Cache accessing. Our key insight is that the aggregated memory capacity, bandwidth, and computing power of CPUs across multiple nodes is an efficient option to process this part. Performance improvement comes from reduced data transmission overhead and boosted GPU throughput to process the other…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsUnderwater Vehicles and Communication Systems