Mesh-TensorFlow: Deep Learning for Supercomputers

Noam Shazeer; Youlong Cheng; Niki Parmar; Dustin Tran; Ashish Vaswani,; Penporn Koanantakool; Peter Hawkins; HyoukJoong Lee; Mingsheng Hong; Cliff; Young; Ryan Sepassi; Blake Hechtman

arXiv:1811.02084·cs.LG·November 7, 2018·52 cites

Mesh-TensorFlow: Deep Learning for Supercomputers

Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani,, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff, Young, Ryan Sepassi, Blake Hechtman

PDF

Open Access 1 Repo

TL;DR

Mesh-TensorFlow introduces a flexible language for specifying distributed tensor computations, enabling efficient training of large models across supercomputers, surpassing previous state-of-the-art results in translation and language modeling.

Contribution

It presents Mesh-TensorFlow, a novel language for defining general tensor distribution strategies, facilitating scalable model and data parallelism on large clusters.

Findings

01

Trained Transformer models with up to 5 billion parameters.

02

Achieved state-of-the-art results on WMT'14 translation.

03

Demonstrated efficient training on TPU meshes of up to 512 cores.

Abstract

Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network (DNN) training strategy, due to its universal applicability and its amenability to Single-Program-Multiple-Data (SPMD) programming. However, batch-splitting suffers from problems including the inability to train very large models (due to memory constraints), high latency, and inefficiency at small batch sizes. All of these can be solved by more general distribution strategies (model-parallelism). Unfortunately, efficient model-parallel algorithms tend to be complicated to discover, describe, and to implement, particularly on large clusters. We introduce Mesh-TensorFlow, a language for specifying a general class of distributed tensor computations. Where data-parallelism can be viewed as splitting tensors and operations along the "batch" dimension, in Mesh-TensorFlow, the user can specify any…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tensorflow/mesh
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Tensor decomposition and applications · Parallel Computing and Optimization Techniques

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Mesh-TensorFlow · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam