Staggered Batch Scheduling: Co-optimizing Time-to-First-Token and Throughput for High-Efficiency LLM Inference

Jian Tian; Shuailong Li; Yang Cao; Wenbo Cui; Minghan Zhu; Wenkang Wu; Jianming Zhang; Yanpeng Wang; Zhiwen Xiao; Zhenyu Hou; Dou Shen

arXiv:2512.16134·cs.DC·December 19, 2025

Staggered Batch Scheduling: Co-optimizing Time-to-First-Token and Throughput for High-Efficiency LLM Inference

Jian Tian, Shuailong Li, Yang Cao, Wenbo Cui, Minghan Zhu, Wenkang Wu, Jianming Zhang, Yanpeng Wang, Zhiwen Xiao, Zhenyu Hou, Dou Shen

PDF

Open Access

TL;DR

This paper introduces Staggered Batch Scheduling (SBS), a novel approach that buffers requests to optimize batch formation, significantly reducing Time-to-First-Token and increasing throughput in large-scale LLM inference systems.

Contribution

The paper presents SBS, a new scheduling mechanism that buffers requests to improve LLM inference efficiency, and a load-aware global allocation strategy for better load balancing.

Findings

01

Reduces TTFT by 30%-40%

02

Improves throughput by 15%-20%

03

Effective in large-scale distributed LLM serving environments

Abstract

The evolution of Large Language Model (LLM) serving towards complex, distributed architectures--specifically the P/D-separated, large-scale DP+EP paradigm--introduces distinct scheduling challenges. Unlike traditional deployments where schedulers can treat instances as black boxes, DP+EP architectures exhibit high internal synchronization costs. We identify that immediate request dispatching in such systems leads to severe in-engine queuing and parallelization bubbles, degrading Time-to-First-Token (TTFT). To address this, we propose Staggered Batch Scheduling (SBS), a mechanism that deliberately buffers requests to form optimal execution batches. This temporal decoupling eliminates internal queuing bubbles without compromising throughput. Furthermore, leveraging the scheduling window created by buffering, we introduce a Load-Aware Global Allocation strategy that balances computational…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Cloud Computing and Resource Management · Parallel Computing and Optimization Techniques