BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching
Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang, Zhou, Jiarong Xing, Ion Stoica

TL;DR
BlendServe is a system that enhances offline inference throughput for large auto-regressive models by intelligently combining resource overlapping and prefix sharing, leading to significant performance improvements.
Contribution
It introduces a resource-aware prefix tree to optimize request scheduling, balancing resource utilization and prefix sharing in offline batch inference.
Findings
Achieves up to 1.44x throughput increase over industry standards
Effectively balances resource overlapping and prefix sharing
Demonstrates benefits on synthetic multi-modal workloads
Abstract
Offline batch inference, which leverages the flexibility of request batching to achieve higher throughput and lower costs, is becoming more popular for latency-insensitive applications. Meanwhile, recent progress in model capability and modality makes requests more diverse in compute and memory demands, creating unique opportunities for throughput improvement by resource overlapping. However, a request schedule that maximizes resource overlapping can conflict with the schedule that maximizes prefix sharing, a widely-used performance optimization, causing sub-optimal inference throughput. We present BlendServe, a system that maximizes resource utilization of offline batch inference by combining the benefits of resource overlapping and prefix sharing using a resource-aware prefix tree. BlendServe exploits the relaxed latency requirements in offline batch inference to reorder and overlap…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Gaussian Processes and Bayesian Inference · AI in cancer detection
