CascadeInfer: Length-Aware Scheduling of LLM Serving with Low Latency and Load Balancing

Yitao Yuan (1; 2); Chenqi Zhao (1); Bohan Zhao (2); Zane Cao (2); Yongchao He (2); Wenfei Wu (1) ((1) Peking University; (2) ScitiX AI)

arXiv:2512.19179·cs.DC·May 18, 2026

CascadeInfer: Length-Aware Scheduling of LLM Serving with Low Latency and Load Balancing

Yitao Yuan (1, 2), Chenqi Zhao (1), Bohan Zhao (2), Zane Cao (2), Yongchao He (2), Wenfei Wu (1) ((1) Peking University, (2) ScitiX AI)

PDF

TL;DR

CascadeInfer is a runtime system that improves LLM serving efficiency by dynamically scheduling requests based on length heterogeneity, significantly reducing latency and increasing throughput.

Contribution

It introduces a length-aware, multi-instance scheduling approach with a dynamic programming algorithm for optimal stage partitioning and load balancing.

Findings

01

Reduces end-to-end latency by up to 67%

02

Improves tail latency by up to 69%

03

Increases system throughput by up to 2.89 times

Abstract

Efficiently harnessing GPU compute is critical to improving user experience and reducing operational costs in large language model (LLM) services. However, current inference engine schedulers overlook the attention backend's sensitivity to request-length heterogeneity within a batch. As state-of-the-art models now support context windows exceeding 128K tokens, this once-tolerable inefficiency has escalated into a primary system bottleneck, causing severe performance degradation through GPU underutilization and increased latency. We present CascadeInfer, a runtime system that dynamically reschedules requests across multiple instances serving the same LLM to mitigate per-instance length heterogeneity. CascadeInfer partitions these instances into length-specialized groups, each handling requests within a designated length range, naturally forming a pipeline as requests flow through them.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Natural Language Processing Techniques · Embedded Systems Design Techniques