M\'elange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity
Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang,, Alvin Cheung, Ion Stoica

TL;DR
This paper introduces Mélange, a framework that optimally allocates heterogeneous GPU types for large language model serving, significantly reducing deployment costs by tailoring GPU choices to specific service characteristics.
Contribution
It presents a novel cost-aware bin packing formulation for GPU allocation that considers service-specific factors and GPU heterogeneity, enabling cost-efficient LLM deployment.
Findings
Mélange reduces deployment costs by up to 77% in conversational settings.
It achieves 33% cost savings in document-based settings.
The framework effectively adapts to diverse service requirements and GPU options.
Abstract
Large language models (LLMs) are increasingly integrated into many online services, yet they remain cost-prohibitive to deploy due to the requirement of expensive GPU instances. Prior work has addressed the high cost of LLM serving by improving the inference engine, but less attention has been given to selecting the most cost-efficient GPU type(s) for a specific LLM service. There is a large and growing landscape of GPU types and, within these options, higher cost does not always lead to increased performance. Instead, through a comprehensive investigation, we find that three key LLM service characteristics (request size, request rate, SLO) strongly influence GPU cost efficiency, and differing GPU types are most cost efficient for differing LLM service settings. As a result, the most cost-efficient allocation for a given service is typically a mix of heterogeneous GPU types. Based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
Methodstravel james
