Oobleck: Resilient Distributed Training of Large Models Using Pipeline   Templates

Insu Jang; Zhenning Yang; Zhen Zhang; Xin Jin; Mosharaf Chowdhury

arXiv:2309.08125·cs.DC·November 9, 2023

Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates

Insu Jang, Zhenning Yang, Zhen Zhang, Xin Jin, Mosharaf Chowdhury

PDF

2 Repos

TL;DR

Oobleck introduces a fault-tolerant distributed training framework for large DNNs that guarantees resilience and high throughput by using heterogeneous pipeline templates and multiple replicas, outperforming existing solutions.

Contribution

It proposes a planning-execution co-design approach with heterogeneous pipeline templates and multiple replicas to ensure fault tolerance without resource idling during large model training.

Findings

01

Achieves up to 29.6x higher throughput than state-of-the-art solutions.

02

Provides guaranteed fault tolerance with minimal resource idling.

03

Successfully scales to large models with billions of parameters.

Abstract

Oobleck enables resilient distributed training of large DNN models with guaranteed fault tolerance. It takes a planning-execution co-design approach, where it first generates a set of heterogeneous pipeline templates and instantiates at least $f + 1$ logically equivalent pipeline replicas to tolerate any $f$ simultaneous failures. During execution, it relies on already-replicated model states across the replicas to provide fast recovery. Oobleck provably guarantees that some combination of the initially created pipeline templates can be used to cover all available resources after $f$ or fewer simultaneous failures, thereby avoiding resource idling at all times. Evaluation on large DNN models with billions of parameters shows that Oobleck provides consistently high throughput, and it outperforms state-of-the-art fault tolerance solutions like Bamboo and Varuna by up to $29.6 x$ .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.