Pathways: Asynchronous Distributed Dataflow for ML
Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, and Steven Hand, Dan Hurt, Michael Isard, Hyeontaek Lim, Ruoming, Pang, Sudip Roy, Brennan Saeta, Parker Schuh, Ryan Sepassi and, Laurent El Shafey, Chandramohan A. Thekkath, Yonghui Wu

TL;DR
Pathways introduces a novel asynchronous distributed dataflow system for large-scale ML training, enabling flexible exploration of new system designs while maintaining high performance on thousands of accelerators.
Contribution
It presents a new asynchronous distributed dataflow architecture that simplifies complex parallelism patterns and achieves high utilization on large-scale accelerator clusters.
Findings
Achieves ~100% utilization on 2048 TPUs for SPMD workloads.
Maintains throughput comparable to traditional SPMD for pipelined Transformer models.
Supports complex parallelism patterns with a single-controller model.
Abstract
We present the design of a new large scale orchestration layer for accelerators. Our system, Pathways, is explicitly designed to enable exploration of new systems and ML research ideas, while retaining state of the art performance for current models. Pathways uses a sharded dataflow graph of asynchronous operators that consume and produce futures, and efficiently gang-schedules heterogeneous parallel computations on thousands of accelerators while coordinating data transfers over their dedicated interconnects. Pathways makes use of a novel asynchronous distributed dataflow design that lets the control plane execute in parallel despite dependencies in the data plane. This design, with careful engineering, allows Pathways to adopt a single-controller model that makes it easier to express complex new parallelism patterns. We demonstrate that Pathways can achieve performance parity (~100%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Distributed and Parallel Computing Systems
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Residual Connection · Position-Wise Feed-Forward Layer · Dense Connections · Softmax · Label Smoothing · Dropout
