Computron: Serving Distributed Deep Learning Models with Model Parallel Swapping
Daniel Zou, Xinchen Jin, Xueyang Yu, Hao Zhang, James Demmel

TL;DR
Computron is a system that enables efficient serving of large, distributed deep learning models by using memory swapping and model parallelism across GPU clusters, improving resource utilization and handling variable workloads.
Contribution
We introduce Computron, a novel system that leverages model parallel swapping to serve large models efficiently on shared GPU clusters, addressing scalability and workload variability.
Findings
Successfully parallelizes model swapping on multiple GPUs
Handles bursty and skewed request patterns effectively
Improves resource utilization for large model serving
Abstract
Many of the most performant deep learning models today in fields like language and image understanding are fine-tuned models that contain billions of parameters. In anticipation of workloads that involve serving many of such large models to handle different tasks, we develop Computron, a system that uses memory swapping to serve multiple distributed models on a shared GPU cluster. Computron implements a model parallel swapping design that takes advantage of the aggregate CPU-GPU link bandwidth of a cluster to speed up model parameter transfers. This design makes swapping large models feasible and can improve resource utilization. We demonstrate that Computron successfully parallelizes model swapping on multiple GPUs, and we test it on randomized workloads to show how it can tolerate real world variability factors like burstiness and skewed request rates. Computron's source code is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
