Computron: Serving Distributed Deep Learning Models with Model Parallel   Swapping

Daniel Zou; Xinchen Jin; Xueyang Yu; Hao Zhang; James Demmel

arXiv:2306.13835·cs.DC·June 27, 2023·1 cites

Computron: Serving Distributed Deep Learning Models with Model Parallel Swapping

Daniel Zou, Xinchen Jin, Xueyang Yu, Hao Zhang, James Demmel

PDF

Open Access 1 Repo

TL;DR

Computron is a system that enables efficient serving of large, distributed deep learning models by using memory swapping and model parallelism across GPU clusters, improving resource utilization and handling variable workloads.

Contribution

We introduce Computron, a novel system that leverages model parallel swapping to serve large models efficiently on shared GPU clusters, addressing scalability and workload variability.

Findings

01

Successfully parallelizes model swapping on multiple GPUs

02

Handles bursty and skewed request patterns effectively

03

Improves resource utilization for large model serving

Abstract

Many of the most performant deep learning models today in fields like language and image understanding are fine-tuned models that contain billions of parameters. In anticipation of workloads that involve serving many of such large models to handle different tasks, we develop Computron, a system that uses memory swapping to serve multiple distributed models on a shared GPU cluster. Computron implements a model parallel swapping design that takes advantage of the aggregate CPU-GPU link bandwidth of a cluster to speed up model parameter transfers. This design makes swapping large models feasible and can improve resource utilization. We demonstrate that Computron successfully parallelizes model swapping on multiple GPUs, and we test it on randomized workloads to show how it can tolerate real world variability factors like burstiness and skewed request rates. Computron's source code is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dlzou/computron
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings