PARD: Enhancing Goodput for Inference Pipeline via Proactive Request Dropping
Zhixin Zhao, Yitao Hu, Simin Chen, Mingfang Ji, Wei Yang, Yuhao Zhang, Laiping Zhao, Wenxin Li, Xiulong Liu, Wenyu Qu, Hao Wang

TL;DR
PARD introduces a proactive request dropping approach for DNN inference pipelines, significantly improving goodput and resource efficiency by making timely, informed dropping decisions based on runtime data.
Contribution
It proposes a novel proactive dropping method and adaptive priority mechanism to optimize request handling in inference pipelines, outperforming reactive strategies.
Findings
Achieves 16%-176% higher goodput than existing methods.
Reduces request drop rate and wasted resources by up to 17x and 62x.
Effectively manages latency and workload variability in real-world settings.
Abstract
Modern deep neural network (DNN) applications integrate multiple DNN models into inference pipelines with stringent latency requirements for customized tasks. To mitigate extensive request timeouts caused by accumulation, systems for inference pipelines commonly drop a subset of requests so the remaining ones can satisfy latency constraints. Since it is commonly believed that request dropping adversely affects goodput, existing systems only drop requests when they have to, which we call reactive dropping. However, this reactive policy can not maintain high goodput, as it neither makes timely dropping decisions nor identifies the proper set of requests to drop, leading to issues of dropping requests too late or dropping the wrong set of requests. We propose that the inference system should proactively drop certain requests in advance to enhance the goodput across the entire workload.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Parallel Computing and Optimization Techniques
