Highly Available Data Parallel ML training on Mesh Networks

Sameer Kumar; Norm Jouppi

arXiv:2011.03605·cs.LG·November 10, 2020

Highly Available Data Parallel ML training on Mesh Networks

Sameer Kumar, Norm Jouppi

PDF

Open Access

TL;DR

This paper introduces fault-tolerant routing techniques for data parallel ML training on mesh networks, enabling resilient gradient aggregation with minimal throughput impact despite chip failures.

Contribution

It presents novel methods for routing allreduce traffic around failed chips in 2-D mesh networks, improving fault tolerance in large-scale ML training.

Findings

01

Minimal throughput impact on 512 TPU-v3 chips

02

Effective routing around failed chips in large meshes

03

Successful evaluation with MLPerf benchmarks

Abstract

Data parallel ML models can take several days or weeks to train on several accelerators. The long duration of training relies on the cluster of resources to be available for the job to keep running for the entire duration. On a mesh network this is challenging because failures will create holes in the mesh. Packets must be routed around the failed chips for full connectivity. In this paper, we present techniques to route gradient summation allreduce traffic around failed chips on 2-D meshes. We evaluate performance of our fault tolerant allreduce techniques via the MLPerf-v0.7 ResNet-50 and BERT benchmarks. Performance results show minimal impact to training throughput on 512 and 1024 TPU-v3 chips.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Advanced Data Storage Technologies

MethodsLinear Layer · Attention Dropout · Dropout · Softmax · Multi-Head Attention · Refunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Dense Connections · WordPiece · Linear Warmup With Linear Decay