Harpagon: Minimizing DNN Serving Cost via Efficient Dispatching, Scheduling and Splitting
Zhixin Zhao, Yitao Hu, Ziqi Gong, Guotao Yang, Wenxin Li, Xiulong Liu, Keqiu Li, Hao Wang

TL;DR
Harpagon is a DNN inference system that reduces cloud serving costs by optimizing dispatching, scheduling, and latency splitting to meet real-time application constraints.
Contribution
It introduces a three-level optimization framework for DNN inference that significantly lowers serving costs while satisfying latency requirements.
Findings
Harpagon reduces serving costs by up to 2.37 times compared to existing systems.
It achieves near-optimal cost solutions for over 91% of workloads within milliseconds.
The system effectively balances throughput and latency through its novel dispatching and scheduling policies.
Abstract
Advances in deep neural networks (DNNs) have significantly contributed to the development of real-time video processing applications. Efficient scheduling of DNN workloads in cloud-hosted inference systems is crucial to minimizing serving costs while meeting application latency constraints. However, existing systems suffer from excessive module latency during request dispatching, low execution throughput during module scheduling, and wasted latency budget during latency splitting for multi-DNN application, which undermines their capability to minimize the serving cost. In this paper, we design a DNN inference system called Harpagon, which minimizes the serving cost under latency constraints with a three-level design. It first maximizes the batch collection rate with a batch-aware request dispatch policy to minimize the module latency. It then maximizes the module throughput with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
