M\'elange: Cost Efficient Large Language Model Serving by Exploiting GPU   Heterogeneity

Tyler Griggs; Xiaoxuan Liu; Jiaxiang Yu; Doyoung Kim; Wei-Lin Chiang,; Alvin Cheung; Ion Stoica

arXiv:2404.14527·cs.DC·July 23, 2024·1 cites

M\'elange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang,, Alvin Cheung, Ion Stoica

PDF

Open Access 1 Repo

TL;DR

This paper introduces Mélange, a framework that optimally allocates heterogeneous GPU types for large language model serving, significantly reducing deployment costs by tailoring GPU choices to specific service characteristics.

Contribution

It presents a novel cost-aware bin packing formulation for GPU allocation that considers service-specific factors and GPU heterogeneity, enabling cost-efficient LLM deployment.

Findings

01

Mélange reduces deployment costs by up to 77% in conversational settings.

02

It achieves 33% cost savings in document-based settings.

03

The framework effectively adapts to diverse service requirements and GPU options.

Abstract

Large language models (LLMs) are increasingly integrated into many online services, yet they remain cost-prohibitive to deploy due to the requirement of expensive GPU instances. Prior work has addressed the high cost of LLM serving by improving the inference engine, but less attention has been given to selecting the most cost-efficient GPU type(s) for a specific LLM service. There is a large and growing landscape of GPU types and, within these options, higher cost does not always lead to increased performance. Instead, through a comprehensive investigation, we find that three key LLM service characteristics (request size, request rate, SLO) strongly influence GPU cost efficiency, and differing GPU types are most cost efficient for differing LLM service settings. As a result, the most cost-efficient allocation for a given service is typically a mix of heterogeneous GPU types. Based on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tyler-griggs/melange-release
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

Methodstravel james