DxPU: Large Scale Disaggregated GPU Pools in the Datacenter
Bowen He, Xiao Zheng, Yuan Chen, Weinan Li, Yajin Zhou, Xin Long,, Pengcheng Zhang, Xiaowei Lu, Linquan Jiang, Qiang Liu, Dennis Cai, Xiantao, Zhang

TL;DR
DxPU introduces a scalable GPU disaggregation system for datacenters that improves resource utilization and flexibility, with minimal performance overhead for AI workloads, enabling more efficient cloud GPU management.
Contribution
The paper presents DxPU, a novel datacenter-scale GPU disaggregation system that addresses compatibility, scope, and capacity issues of existing solutions, with a performance model and real-world deployment.
Findings
Overhead of DxPU is less than 10% in most scenarios.
DxPU effectively allocates GPU resources based on user demand.
Prototype deployed in a leading cloud provider's datacenter demonstrates practical viability.
Abstract
The rapid adoption of AI and convenience offered by cloud services have resulted in the growing demands for GPUs in the cloud. Generally, GPUs are physically attached to host servers as PCIe devices. However, the fixed assembly combination of host servers and GPUs is extremely inefficient in resource utilization, upgrade, and maintenance. Due to these issues, the GPU disaggregation technique has been proposed to decouple GPUs from host servers. It aggregates GPUs into a pool, and allocates GPU node(s) according to user demands. However, existing GPU disaggregation systems have flaws in software-hardware compatibility, disaggregation scope, and capacity. In this paper, we present a new implementation of datacenter-scale GPU disaggregation, named DxPU. DxPU efficiently solves the above problems and can flexibly allocate as many GPU node(s) as users demand. In order to understand the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
