RT-LM: Uncertainty-Aware Resource Management for Real-Time Inference of Language Models
Yufei Li, Zexin Li, Wei Yang, Cong Liu

TL;DR
RT-LM introduces an uncertainty-aware resource management system that dynamically optimizes real-time language model inference by quantifying input uncertainties and adjusting system resources, significantly reducing response time and increasing throughput.
Contribution
This work is the first to quantify and utilize input uncertainty to optimize real-time language model inference through a dynamic, uncertainty-aware scheduling system.
Findings
Reduces average response time significantly.
Improves throughput across multiple models and hardware.
Maintains low runtime overhead.
Abstract
Recent advancements in language models (LMs) have gained substantial attentions on their capability to generate human-like responses. Though exhibiting a promising future for various applications such as conversation AI, these LMs face deployment challenges on various devices due to their extreme computational cost and unpredictable inference latency. Such varied inference latency, identified as a consequence of uncertainty intrinsic to the nature of language, can lead to computational inefficiency and degrade the overall performance of LMs, especially under high-traffic workloads. Unfortunately, the bandwidth of these uncertainty sources is extensive, complicating the prediction of latency and the effects emanating from such uncertainties. To understand and mitigate the impact of uncertainty on real-time response-demanding systems, we take the first step to comprehend, quantify and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Topic Modeling · Advanced Neural Network Applications
