Enabling Elastic Model Serving with MultiWorld

Myungjin Lee; Akshay Jajoo; Ramana Rao Kompella

arXiv:2407.08980·cs.DC·July 15, 2024

Enabling Elastic Model Serving with MultiWorld

Myungjin Lee, Akshay Jajoo, Ramana Rao Kompella

PDF

Open Access 1 Repo

TL;DR

This paper introduces MultiWorld, a system that enables elastic, fault-tolerant, and scalable deployment of large machine learning models across multiple GPUs, addressing the limitations of existing collective communication libraries for inference workloads.

Contribution

MultiWorld provides a novel approach to fault tolerance and online scaling for large model serving, bridging the gap between inference workload characteristics and collective communication libraries.

Findings

01

Small overheads (1.4-4.3% throughput loss) in various scenarios

02

Enables elastic scaling and fault tolerance for large models

03

Improves GPU resource utilization during inference

Abstract

Machine learning models have been exponentially growing in terms of their parameter size over the past few years. We are now seeing the rise of trillion-parameter models. The large models cannot fit into a single GPU and thus require partitioned deployment across GPUs and even hosts. A high-performance collective communication library (CCL) such as NCCL is essential to fully utilize expensive GPU resources. However, CCL is not a great fit for inference. Unlike training for which a fixed amount of GPU resources is used for fixed workloads (e.g., input datasets), the inference workloads can change dynamically over time. Failures at the serving time can also impact individual user's experiences directly. In contrast, workers in a CCL process group share a single fault domain and the process group cannot grow as the workloads increase. The gap between the unique characteristics of model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cisco-open/pymultiworld
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI-based Problem Solving and Planning