A Simulation Platform for Multi-tenant Machine Learning Services on   Thousands of GPUs

Ruofan Liang; Bingsheng He; Shengen Yan; Peng Sun

arXiv:2201.03175·cs.DC·January 11, 2022

A Simulation Platform for Multi-tenant Machine Learning Services on Thousands of GPUs

Ruofan Liang, Bingsheng He, Shengen Yan, Peng Sun

PDF

Open Access

TL;DR

AnalySIM is a trace-driven cluster simulator designed for efficient evaluation of multi-tenant machine learning services on thousands of GPUs, enabling policy testing without real cluster deployment.

Contribution

The paper introduces AnalySIM, a scalable simulation platform that models large GPU clusters for performance analysis of scheduling policies in multi-tenant ML workloads.

Findings

01

Preemption and migration reduce job completion time.

02

AnalySIM effectively models large GPU clusters.

03

Scheduling policies impact resource utilization.

Abstract

Multi-tenant machine learning services have become emerging data-intensive workloads in data centers with heavy usage of GPU resources. Due to the large scale, many tuning parameters and heavy resource usage, it is usually impractical to evaluate and benchmark those machine learning services on real clusters. In this demonstration, we present AnalySIM, a cluster simulator that allows efficient design explorations for multi-tenant machine learning services. Specifically, by trace-driven cluster workload simulation, AnalySIM can easily test and analyze various scheduling policies in a number of performance metrics such as GPU resource utilization. AnalySIM simulates the cluster computational resource based on both physical topology and logical partition. The tool has been used in SenseTime to understand the impact of different scheduling policies with the trace from a real production…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · IoT and Edge/Fog Computing · Distributed and Parallel Computing Systems