Dynamic Space-Time Scheduling for GPU Inference
Paras Jain, Xiangxi Mo, Ajay Jain, Harikaran Subbaraj, Rehan Sohail, Durrani, Alexey Tumanov, Joseph Gonzalez, Ion Stoica

TL;DR
This paper introduces a dynamic space-time scheduling approach for GPU inference that significantly improves GPU utilization, throughput, and latency predictability by leveraging both temporal and spatial multiplexing techniques.
Contribution
It presents novel techniques for dynamic space-time multiplexing in GPU inference, demonstrating substantial improvements over traditional methods.
Findings
Up to 5x potential for GPU utilization improvement
3.23x increase in floating-point throughput with the prototype
7.73x increase over time-only multiplexing
Abstract
Serving deep neural networks in latency critical interactive settings often requires GPU acceleration. However, the small batch sizes typical in online inference results in poor GPU utilization, a potential performance gap which GPU resource sharing can address. In this paper, we explore several techniques to leverage both temporal and spatial multiplexing to improve GPU utilization for deep learning inference workloads. We evaluate the performance trade-offs of each approach with respect to resource-efficiency, latency predictability, and isolation when compared with conventional batched inference. Our experimental analysis suggests up to a 5x potential for improved utilization through the exploration of more advanced spatial and temporal multiplexing strategies. Our preliminary prototype of a dynamic space-time scheduler demonstrates a 3.23x floating-point throughput increase over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Stochastic Gradient Optimization Techniques
