LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition   and Adaptive Quantization

Juntao Zhao; Borui Wan; Yanghua Peng; Haibin Lin; Chuan Wu

arXiv:2403.01136·cs.LG·March 5, 2024·3 cites

LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization

Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, Chuan Wu

PDF

Open Access 1 Repo

TL;DR

This paper introduces LLM-PQ, a system that optimizes large language model serving on heterogeneous GPU clusters through adaptive quantization and phase-aware partitioning, significantly improving throughput and reducing costs.

Contribution

The paper presents a novel approach combining mixed-precision quantization and phase-aware partitioning tailored for heterogeneous GPU clusters, enhancing LLM inference efficiency.

Findings

01

Achieves up to 2.88x throughput improvement in inference.

02

Demonstrates effectiveness across 11 different production clusters.

03

Outperforms state-of-the-art methods in LLM serving efficiency.

Abstract

Recent breakthroughs in Large-scale language models (LLMs) have demonstrated impressive performance on various tasks. The immense sizes of LLMs have led to very high resource demand and cost for running the models. Though the models are largely served using uniform high-caliber GPUs nowadays, utilizing a heterogeneous cluster with a mix of available high- and low-capacity GPUs can potentially substantially reduce the serving cost. There is a lack of designs to support efficient LLM serving using a heterogeneous cluster, while the current solutions focus on model partition and uniform compression among homogeneous devices. This paper proposes LLM-PQ, a system that advocates adaptive model quantization and phase-aware partition to improve LLM serving efficiency on heterogeneous GPU clusters. We carefully decide on mixed-precision model quantization together with phase-aware model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tonyzhao-jt/LLM-PQ
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Distributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques

MethodsFocus