GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments
Yanyu Chen, Ganhong Huang

TL;DR
GUIDE is a comprehensive framework that uses dynamic modeling and simulation-based optimization to improve the deployment efficiency of large language models across diverse hardware and workload scenarios, addressing key bottlenecks.
Contribution
We introduce GUIDE, a systematic inference engine that predicts and optimizes LLM performance in heterogeneous environments, enabling non-experts to deploy models efficiently.
Findings
Prediction errors between 9.9% and 42.3% for key metrics.
Effectively bridges the gap between theoretical and practical performance.
Addresses memory, latency, and throughput bottlenecks in LLM deployment.
Abstract
Efficiently deploying large language models (LLMs) in real-world scenarios remains a critical challenge, primarily due to hardware heterogeneity, inference framework limitations, and workload complexities.Efficiently deploying large language models (LLMs) in real-world scenarios remains a critical challenge, primarily due to hardware heterogeneity, inference framework limitations, and workload complexities. These challenges often lead to inefficiencies in memory utilization, latency, and throughput, hindering the effective deployment of LLMs, especially for non-experts. Through extensive experiments, we identify key performance bottlenecks, including sudden drops in memory utilization, latency fluctuations with varying batch sizes, and inefficiencies in multi-GPU configurations. These insights reveal a vast optimization space shaped by the intricate interplay of hardware, frameworks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
