CascadeInfer: Length-Aware Scheduling of LLM Serving with Low Latency and Load Balancing
Yitao Yuan (1, 2), Chenqi Zhao (1), Bohan Zhao (2), Zane Cao (2), Yongchao He (2), Wenfei Wu (1) ((1) Peking University, (2) ScitiX AI)

TL;DR
CascadeInfer is a runtime system that improves LLM serving efficiency by dynamically scheduling requests based on length heterogeneity, significantly reducing latency and increasing throughput.
Contribution
It introduces a length-aware, multi-instance scheduling approach with a dynamic programming algorithm for optimal stage partitioning and load balancing.
Findings
Reduces end-to-end latency by up to 67%
Improves tail latency by up to 69%
Increases system throughput by up to 2.89 times
Abstract
Efficiently harnessing GPU compute is critical to improving user experience and reducing operational costs in large language model (LLM) services. However, current inference engine schedulers overlook the attention backend's sensitivity to request-length heterogeneity within a batch. As state-of-the-art models now support context windows exceeding 128K tokens, this once-tolerable inefficiency has escalated into a primary system bottleneck, causing severe performance degradation through GPU underutilization and increased latency. We present CascadeInfer, a runtime system that dynamically reschedules requests across multiple instances serving the same LLM to mitigate per-instance length heterogeneity. CascadeInfer partitions these instances into length-specialized groups, each handling requests within a designated length range, naturally forming a pipeline as requests flow through them.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Natural Language Processing Techniques · Embedded Systems Design Techniques
