PREBA: A Hardware/Software Co-Design for Multi-Instance GPU based AI   Inference Servers

Gwangoo Yeo; Jiin Kim; Yujeong Choi; Minsoo Rhu

arXiv:2411.19114·cs.DC·December 2, 2024

PREBA: A Hardware/Software Co-Design for Multi-Instance GPU based AI Inference Servers

Gwangoo Yeo, Jiin Kim, Yujeong Choi, Minsoo Rhu

PDF

Open Access

TL;DR

PREBA is a hardware/software co-design that significantly enhances multi-instance GPU AI inference servers by accelerating data preprocessing with FPGA and optimizing batching, leading to major improvements in throughput, latency, energy, and cost efficiency.

Contribution

The paper introduces PREBA, a novel FPGA-based data preprocessing accelerator combined with dynamic batching, specifically designed for MIG-based AI inference servers, addressing preprocessing bottlenecks.

Findings

01

3.7x throughput improvement

02

3.4x tail latency reduction

03

3.5x energy-efficiency gain

Abstract

NVIDIA's Multi-Instance GPU (MIG) is a feature that enables system designers to reconfigure one large GPU into multiple smaller GPU slices. This work characterizes this emerging GPU and evaluates its effectiveness in designing high-performance AI inference servers. Our study reveals that the data preprocessing stage of AI inference causes significant performance bottlenecks to MIG. To this end, we present PREBA, which is a hardware/software co-design targeting MIG inference servers. Our first proposition is an FPGA-based data preprocessing accelerator that unlocks the full potential of MIG with domain-specific acceleration of data preprocessing. The MIG inference server unleashed from preprocessing overheads is then augmented with our dynamic batching system that enables high-performance inference. PREBA is implemented end-to-end in real systems, providing a 3.7x improvement in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIoT and Edge/Fog Computing · Parallel Computing and Optimization Techniques