ECLIP: Energy-efficient and Practical Co-Location of ML Inference on Spatially Partitioned GPUs
Ryan Quach, Yidi Wang, Ali Jahanshahi, Daniel Wong, Hyoseung Kim

TL;DR
ECLIP is a framework that enhances energy efficiency and throughput of co-located ML inference on GPUs by minimizing repartitioning overheads through kernel-wise resource partitioning and optimized CU assignment.
Contribution
ECLIP introduces a low-overhead, kernel-wise resource partitioning framework with a resource optimizer to improve GPU utilization and energy efficiency during ML inference co-location.
Findings
13% throughput improvement
25% energy efficiency gain
Reduced repartitioning overheads
Abstract
As AI inference becomes mainstream, research has begun to focus on improving the energy consumption of inference servers. Inference kernels commonly underutilize a GPU's compute resources and waste power from idling components. To improve utilization and energy efficiency, multiple models can co-locate and share the GPU. However, typical GPU spatial partitioning techniques often experience significant overheads when reconfiguring spatial partitions, which can waste additional energy through repartitioning overheads or non-optimal partition configurations. In this paper, we present ECLIP, a framework to enable low-overhead energy-efficient kernel-wise resource partitioning between co-located inference kernels. ECLIP minimizes repartitioning overheads by pre-allocating pools of CU masked streams and assigns optimal CU assignments to groups of kernels through our resource allocation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Medical Image Segmentation Techniques · Brain Tumor Detection and Classification
MethodsFocus
