RT-LM: Uncertainty-Aware Resource Management for Real-Time Inference of   Language Models

Yufei Li; Zexin Li; Wei Yang; Cong Liu

arXiv:2309.06619·cs.LG·September 14, 2023·1 cites

RT-LM: Uncertainty-Aware Resource Management for Real-Time Inference of Language Models

Yufei Li, Zexin Li, Wei Yang, Cong Liu

PDF

Open Access

TL;DR

RT-LM introduces an uncertainty-aware resource management system that dynamically optimizes real-time language model inference by quantifying input uncertainties and adjusting system resources, significantly reducing response time and increasing throughput.

Contribution

This work is the first to quantify and utilize input uncertainty to optimize real-time language model inference through a dynamic, uncertainty-aware scheduling system.

Findings

01

Reduces average response time significantly.

02

Improves throughput across multiple models and hardware.

03

Maintains low runtime overhead.

Abstract

Recent advancements in language models (LMs) have gained substantial attentions on their capability to generate human-like responses. Though exhibiting a promising future for various applications such as conversation AI, these LMs face deployment challenges on various devices due to their extreme computational cost and unpredictable inference latency. Such varied inference latency, identified as a consequence of uncertainty intrinsic to the nature of language, can lead to computational inefficiency and degrade the overall performance of LMs, especially under high-traffic workloads. Unfortunately, the bandwidth of these uncertainty sources is extensive, complicating the prediction of latency and the effects emanating from such uncertainties. To understand and mitigate the impact of uncertainty on real-time response-demanding systems, we take the first step to comprehend, quantify and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Topic Modeling · Advanced Neural Network Applications