ALISE: Accelerating Large Language Model Serving with Speculative   Scheduling

Youpeng Zhao; Jun Wang

arXiv:2410.23537·cs.PF·November 1, 2024

ALISE: Accelerating Large Language Model Serving with Speculative Scheduling

Youpeng Zhao, Jun Wang

PDF

Open Access

TL;DR

ALISE is a novel LLM inference framework that uses speculative scheduling and adaptive memory management to significantly improve throughput and reduce latency compared to existing solutions.

Contribution

ALISE introduces a speculative scheduler with execution time estimation and a priority-based memory management protocol for efficient LLM serving.

Findings

01

Up to 1.8x throughput improvement on Alpaca dataset

02

Up to 2.1x throughput improvement on ShareGPT dataset

03

Reduces queuing delays and mitigates memory overhead in LLM inference serving

Abstract

Large Language Models (LLMs) represent a revolutionary advancement in the contemporary landscape of artificial general intelligence (AGI). As exemplified by ChatGPT, LLM-based applications necessitate minimal response latency and maximal throughput for inference serving. However, due to the unpredictability of LLM execution, the first-come-first-serve (FCFS) scheduling policy employed by current LLM serving systems suffers from head-of-line (HoL) blocking issues and long job response times. In this paper, we propose a new efficient LLM inference serving framework, named ALISE. The key design paradigm of ALISE is to leverage a novel speculative scheduler by estimating the execution time for each job and exploiting such prior knowledge to assign appropriate job priority orders, thus minimizing potential queuing delays for heterogeneous workloads. Furthermore, to mitigate the memory…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques