DxPU: Large Scale Disaggregated GPU Pools in the Datacenter

Bowen He; Xiao Zheng; Yuan Chen; Weinan Li; Yajin Zhou; Xin Long,; Pengcheng Zhang; Xiaowei Lu; Linquan Jiang; Qiang Liu; Dennis Cai; Xiantao; Zhang

arXiv:2310.04648·cs.DC·October 10, 2023

DxPU: Large Scale Disaggregated GPU Pools in the Datacenter

Bowen He, Xiao Zheng, Yuan Chen, Weinan Li, Yajin Zhou, Xin Long,, Pengcheng Zhang, Xiaowei Lu, Linquan Jiang, Qiang Liu, Dennis Cai, Xiantao, Zhang

PDF

TL;DR

DxPU introduces a scalable GPU disaggregation system for datacenters that improves resource utilization and flexibility, with minimal performance overhead for AI workloads, enabling more efficient cloud GPU management.

Contribution

The paper presents DxPU, a novel datacenter-scale GPU disaggregation system that addresses compatibility, scope, and capacity issues of existing solutions, with a performance model and real-world deployment.

Findings

01

Overhead of DxPU is less than 10% in most scenarios.

02

DxPU effectively allocates GPU resources based on user demand.

03

Prototype deployed in a leading cloud provider's datacenter demonstrates practical viability.

Abstract

The rapid adoption of AI and convenience offered by cloud services have resulted in the growing demands for GPUs in the cloud. Generally, GPUs are physically attached to host servers as PCIe devices. However, the fixed assembly combination of host servers and GPUs is extremely inefficient in resource utilization, upgrade, and maintenance. Due to these issues, the GPU disaggregation technique has been proposed to decouple GPUs from host servers. It aggregates GPUs into a pool, and allocates GPU node(s) according to user demands. However, existing GPU disaggregation systems have flaws in software-hardware compatibility, disaggregation scope, and capacity. In this paper, we present a new implementation of datacenter-scale GPU disaggregation, named DxPU. DxPU efficiently solves the above problems and can flexibly allocate as many GPU node(s) as users demand. In order to understand the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.