Enabling Efficient Batch Serving for LMaaS via Generation Length   Prediction

Ke Cheng; Wen Hu; Zhi Wang; Peng Du; Jianguo Li; Sheng Zhang

arXiv:2406.04785·cs.DC·June 10, 2024

Enabling Efficient Batch Serving for LMaaS via Generation Length Prediction

Ke Cheng, Wen Hu, Zhi Wang, Peng Du, Jianguo Li, Sheng Zhang

PDF

Open Access

TL;DR

This paper introduces Magnus, a method that predicts generation lengths of LMaaS requests to optimize batching, significantly improving throughput and reducing response time.

Contribution

Magnus is a novel approach that accurately predicts request generation lengths using semantic features, enabling efficient batch scheduling for LMaaS.

Findings

01

Request throughput increased by up to 234%

02

Response time reduced by up to 89.7%

03

Effective batching based on length prediction improves server utilization

Abstract

Nowadays, large language models (LLMs) are published as a service and can be accessed by various applications via APIs, also known as language-model-as-a-service (LMaaS). Without knowing the generation length of requests, existing serving systems serve requests in a first-come, first-served (FCFS) manner with a fixed batch size, which leads to two problems that affect batch serving efficiency. First, the generation lengths of requests in a batch vary, and requests with short generation lengths must wait for requests with long generation lengths to finish during the batch serving procedure. Second, requests with longer generation lengths consume more memory during serving. Without knowing the generation lengths of batched requests, the batch size is always set small to avoid the out-of-memory (OOM) error, thus preventing the GPU from being fully utilized. In this paper, we find that a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEnergy Efficient Wireless Sensor Networks · Wireless Communication Networks Research · Optical Wireless Communication Technologies