PREBA: A Hardware/Software Co-Design for Multi-Instance GPU based AI Inference Servers
Gwangoo Yeo, Jiin Kim, Yujeong Choi, Minsoo Rhu

TL;DR
PREBA is a hardware/software co-design that significantly enhances multi-instance GPU AI inference servers by accelerating data preprocessing with FPGA and optimizing batching, leading to major improvements in throughput, latency, energy, and cost efficiency.
Contribution
The paper introduces PREBA, a novel FPGA-based data preprocessing accelerator combined with dynamic batching, specifically designed for MIG-based AI inference servers, addressing preprocessing bottlenecks.
Findings
3.7x throughput improvement
3.4x tail latency reduction
3.5x energy-efficiency gain
Abstract
NVIDIA's Multi-Instance GPU (MIG) is a feature that enables system designers to reconfigure one large GPU into multiple smaller GPU slices. This work characterizes this emerging GPU and evaluates its effectiveness in designing high-performance AI inference servers. Our study reveals that the data preprocessing stage of AI inference causes significant performance bottlenecks to MIG. To this end, we present PREBA, which is a hardware/software co-design targeting MIG inference servers. Our first proposition is an FPGA-based data preprocessing accelerator that unlocks the full potential of MIG with domain-specific acceleration of data preprocessing. The MIG inference server unleashed from preprocessing overheads is then augmented with our dynamic batching system that enables high-performance inference. PREBA is implemented end-to-end in real systems, providing a 3.7x improvement in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT and Edge/Fog Computing · Parallel Computing and Optimization Techniques
