Efficient Serving of LLM Applications with Probabilistic Demand Modeling
Yifei Liu, Zuo Gan, Zhenghao Gan, Weiye Wang, Chen Chen, Yizhou Shan, Xusheng Chen, Zhenhua Han, Yifei Zhu, Shixuan Sun, Minyi Guo

TL;DR
This paper introduces Hermes, a system that models LLM application demands with a probabilistic graph to optimize scheduling and prewarming, significantly improving serving efficiency and reducing completion times.
Contribution
The paper presents PDGraph for accurate demand modeling and Hermes that uses this model with Gittins policy for optimized scheduling and prewarming in LLM serving systems.
Findings
Over 70% reduction in average completion time
Over 80% reduction in P95 completion time
Effective demand modeling improves overall efficiency
Abstract
Applications based on Large Language Models (LLMs) contains a series of tasks to address real-world problems with boosted capability, which have dynamic demand volumes on diverse backends. Existing serving systems treat the resource demands of LLM applications as a blackbox, compromising end-to-end efficiency due to improper queuing order and backend warm up latency. We find that the resource demands of LLM applications can be modeled in a general and accurate manner with Probabilistic Demand Graph (PDGraph). We then propose Hermes, which leverages PDGraph for efficient serving of LLM applications. Confronting probabilistic demand description, Hermes applies the Gittins policy to determine the scheduling order that can minimize the average application completion time. It also uses the PDGraph model to help prewarm cold backends at proper moments. Experiments with diverse LLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPower Line Communications and Noise · Smart Grid Energy Management · Advanced Wireless Network Optimization
