Symphony: Optimized DNN Model Serving using Deferred Batch Scheduling

Lequn Chen; Weixin Deng; Anirudh Canumalla; Yu Xin; Danyang Zhuo,; Matthai Philipose; Arvind Krishnamurthy

arXiv:2308.07470·cs.DC·March 1, 2024·1 cites

Symphony: Optimized DNN Model Serving using Deferred Batch Scheduling

Lequn Chen, Weixin Deng, Anirudh Canumalla, Yu Xin, Danyang Zhuo,, Matthai Philipose, Arvind Krishnamurthy

PDF

Open Access

TL;DR

Symphony is a DNN model serving system that uses deferred batch scheduling to improve efficiency and throughput, achieving higher goodput and better GPU utilization while supporting autoscaling.

Contribution

It introduces a novel deferred batch scheduling approach that optimizes GPU usage and system throughput in DNN inference serving.

Findings

01

Symphony achieves 5x higher goodput compared to prior systems.

02

It reduces GPU usage by 60% for the same workload.

03

The system can schedule millions of requests per second across thousands of GPUs.

Abstract

Having large batch sizes is one of the most critical aspects of increasing the accelerator efficiency and the performance of DNN model inference. However, existing model serving systems cannot achieve adequate batch sizes while meeting latency objectives as these systems eagerly dispatch requests to accelerators to minimize the accelerator idle time. We propose Symphony, a DNN serving system that explores deferred batch scheduling to optimize system efficiency and throughput. Further, unlike other prior systems, Symphony's GPU usage is load-proportional: it consolidates workloads on the appropriate number of GPUs and works smoothly with cluster auto-scaling tools. Symphony consists of two core design points. First, Symphony defines a schedulable window in which a batch of inference requests can be dispatched. This window is computed in order to improve accelerator efficiency while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Age of Information Optimization · Stochastic Gradient Optimization Techniques