GUIDE: A Global Unified Inference Engine for Deploying Large Language   Models in Heterogeneous Environments

Yanyu Chen; Ganhong Huang

arXiv:2412.04788·cs.AI·January 28, 2025

GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments

Yanyu Chen, Ganhong Huang

PDF

Open Access

TL;DR

GUIDE is a comprehensive framework that uses dynamic modeling and simulation-based optimization to improve the deployment efficiency of large language models across diverse hardware and workload scenarios, addressing key bottlenecks.

Contribution

We introduce GUIDE, a systematic inference engine that predicts and optimizes LLM performance in heterogeneous environments, enabling non-experts to deploy models efficiently.

Findings

01

Prediction errors between 9.9% and 42.3% for key metrics.

02

Effectively bridges the gap between theoretical and practical performance.

03

Addresses memory, latency, and throughput bottlenecks in LLM deployment.

Abstract

Efficiently deploying large language models (LLMs) in real-world scenarios remains a critical challenge, primarily due to hardware heterogeneity, inference framework limitations, and workload complexities.Efficiently deploying large language models (LLMs) in real-world scenarios remains a critical challenge, primarily due to hardware heterogeneity, inference framework limitations, and workload complexities. These challenges often lead to inefficiencies in memory utilization, latency, and throughput, hindering the effective deployment of LLMs, especially for non-experts. Through extensive experiments, we identify key performance bottlenecks, including sudden drops in memory utilization, latency fluctuations with varying batch sizes, and inefficiencies in multi-GPU configurations. These insights reveal a vast optimization space shaped by the intricate interplay of hardware, frameworks,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling