AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving

Ying Wang; Zhen Jin; Jiexiong Xu; Wenhai Lin; Yiquan Chen; Wenzhi Chen

arXiv:2512.04013·cs.CL·December 17, 2025

AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving

Ying Wang, Zhen Jin, Jiexiong Xu, Wenhai Lin, Yiquan Chen, Wenzhi Chen

PDF

Open Access

TL;DR

AugServe is a novel adaptive request scheduling framework for augmented LLM inference that significantly improves throughput and reduces latency by dynamically optimizing request order and batching based on runtime conditions.

Contribution

It introduces a two-stage adaptive scheduling strategy and dynamic batching mechanism to enhance inference efficiency for augmented LLM services.

Findings

01

Achieves 4.7x higher effective throughput than vLLM

02

Reduces time-to-first-token by up to 96.3%

03

Outperforms existing systems in latency and throughput

Abstract

As augmented large language models (LLMs) with external tools become increasingly popular in web applications, improving augmented LLM inference serving efficiency and optimizing service-level objectives (SLOs) are critical for enhancing user experience. To achieve this, inference systems must maximize request handling within latency constraints, referred to as increasing effective throughput. However, existing systems face two major challenges: (i) reliance on first-come-first-served (FCFS) scheduling causes severe head-of-line blocking, leading to queuing delays exceeding the SLOs for many requests; and (ii) static batch token limit, which fails to adapt to fluctuating loads and hardware conditions. Both of these factors degrade effective throughput and service quality. This paper presents AugServe, an efficient inference framework designed to reduce queueing latency and enhance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Software System Performance and Reliability · Natural Language Processing Techniques