Auto-Parallelizing Large Models with Rhino: A Systematic Approach on   Production AI Platform

Shiwei Zhang; Lansong Diao; Siyu Wang; Zongyan Cao; Yiliang Gu; Chang; Si; Ziji Shi; Zhen Zheng; Chuan Wu; Wei Lin

arXiv:2302.08141·cs.DC·February 17, 2023·1 cites

Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform

Shiwei Zhang, Lansong Diao, Siyu Wang, Zongyan Cao, Yiliang Gu, Chang, Si, Ziji Shi, Zhen Zheng, Chuan Wu, Wei Lin

PDF

Open Access

TL;DR

Rhino is a system that automatically parallelizes tensor programs for large models on AI platforms, enabling scalable distributed execution without user configuration, and discovering superior strategies for diverse models.

Contribution

It introduces a systematic approach to parallelization that generalizes across applications and explores a comprehensive strategy space with efficient search heuristics.

Findings

01

Rhino can replicate expert-crafted strategies for existing models.

02

It discovers novel parallelization strategies surpassing current systems.

03

Demonstrates scalability to thousands of devices in production environments.

Abstract

We present Rhino, a system for accelerating tensor programs with automatic parallelization on AI platform for real production environment. It transforms a tensor program written for a single device into an equivalent distributed program that is capable of scaling up to thousands of devices with no user configuration. Rhino firstly works on a semantically independent intermediate representation of tensor programs, which facilitates its generalization to unprecedented applications. Additionally, it implements a task-oriented controller and a distributed runtime for optimal performance. Rhino explores on a complete and systematic parallelization strategy space that comprises all the paradigms commonly employed in deep learning (DL), in addition to strided partitioning and pipeline parallelism on non-linear models. Aiming to efficiently search for a near-optimal parallel execution plan, our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems · Tensor decomposition and applications

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings