Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency
Ruixiao Li, Fahao Chen, and Peng Li

TL;DR
This paper introduces LAPS-SD, a semi-clairvoyant scheduling algorithm that adaptively manages speculative decoding requests to significantly reduce LLM inference latency, especially under uncertain execution times.
Contribution
The paper proposes a novel semi-clairvoyant scheduling algorithm for speculative decoding that dynamically adapts to changing execution conditions to optimize inference latency.
Findings
LAPS-SD reduces inference latency by approximately 39%.
It effectively handles dynamic token acceptance rates.
The method outperforms existing scheduling approaches.
Abstract
Speculative decoding accelerates Large Language Model (LLM) inference by employing a small speculative model (SSM) to generate multiple candidate tokens and verify them using the LLM in parallel. This technique has been widely integrated into LLM inference serving systems. However, inference requests typically exhibit uncertain execution time, which poses a significant challenge of efficiently scheduling requests in these systems. Existing work estimates execution time based solely on predicted output length, which could be inaccurate because execution time depends on both output length and token acceptance rate of verification by the LLM. In this paper, we propose a semi-clairvoyant request scheduling algorithm called Least-Attained/Perceived-Service for Speculative Decoding (LAPS-SD). Given a number of inference requests, LAPS-SD can effectively minimize average inference latency by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
