Highly Available Data Parallel ML training on Mesh Networks
Sameer Kumar, Norm Jouppi

TL;DR
This paper introduces fault-tolerant routing techniques for data parallel ML training on mesh networks, enabling resilient gradient aggregation with minimal throughput impact despite chip failures.
Contribution
It presents novel methods for routing allreduce traffic around failed chips in 2-D mesh networks, improving fault tolerance in large-scale ML training.
Findings
Minimal throughput impact on 512 TPU-v3 chips
Effective routing around failed chips in large meshes
Successful evaluation with MLPerf benchmarks
Abstract
Data parallel ML models can take several days or weeks to train on several accelerators. The long duration of training relies on the cluster of resources to be available for the job to keep running for the entire duration. On a mesh network this is challenging because failures will create holes in the mesh. Packets must be routed around the failed chips for full connectivity. In this paper, we present techniques to route gradient summation allreduce traffic around failed chips on 2-D meshes. We evaluate performance of our fault tolerant allreduce techniques via the MLPerf-v0.7 ResNet-50 and BERT benchmarks. Performance results show minimal impact to training throughput on 512 and 1024 TPU-v3 chips.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Advanced Data Storage Technologies
MethodsLinear Layer · Attention Dropout · Dropout · Softmax · Multi-Head Attention · Refunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Dense Connections · WordPiece · Linear Warmup With Linear Decay
