BOute: Cost-Efficient LLM Serving with Heterogeneous LLMs and GPUs via Multi-Objective Bayesian Optimization
Youhe Jiang, Fangcheng Fu, Eiko Yoneki

TL;DR
BOute is a system that uses multi-objective Bayesian optimization to jointly optimize query routing and GPU deployment, significantly reducing costs and improving efficiency in large language model serving.
Contribution
It introduces a novel co-optimization framework for heterogeneous LLM and GPU deployment, addressing complex management challenges in cost-efficient serving systems.
Findings
BOute outperforms state-of-the-art systems by up to 157% in efficiency.
It reduces serving costs by 15%-61% while maintaining quality.
Demonstrates effective joint optimization of routing and deployment.
Abstract
The rapid growth of large language model (LLM) deployments has made cost-efficient serving systems essential. Recent efforts to enhance system cost-efficiency adopt two main perspectives: (i) An algorithmic perspective that exploits heterogeneous model capabilities to route simpler queries to lower-cost models and complex queries to higher-cost models (i.e., heterogeneous query routing); and (ii) a systems perspective that utilizes heterogeneous GPU resources as cost-effective alternatives to homogeneous high-end GPUs (i.e., heterogeneous model deployment). However, algorithm-system co-design for cost-efficient LLM serving necessitates sophisticated management: (i) Determining optimal query routing strategies under latency and quality requirements, (ii) configuring model deployment across heterogeneous GPUs with appropriate resource allocation and parallelism strategies, and (iii)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Big Data and Digital Economy · Topic Modeling
