TF-Replicator: Distributed Machine Learning for Researchers

Peter Buchlovsky; David Budden; Dominik Grewe; Chris Jones; John; Aslanides; Frederic Besse; Andy Brock; Aidan Clark; Sergio G\'omez; Colmenarejo; Aedan Pope; Fabio Viola; Dan Belov

arXiv:1902.00465·cs.LG·February 4, 2019·21 cites

TF-Replicator: Distributed Machine Learning for Researchers

Peter Buchlovsky, David Budden, Dominik Grewe, Chris Jones, John, Aslanides, Frederic Besse, Andy Brock, Aidan Clark, Sergio G\'omez, Colmenarejo, Aedan Pope, Fabio Viola, Dan Belov

PDF

Open Access 1 Repo

TL;DR

TF-Replicator is a flexible, scalable framework built on TensorFlow that simplifies distributed machine learning research across various architectures and models, enabling researchers to deploy and benchmark complex models efficiently.

Contribution

It introduces TF-Replicator, a new abstraction over TensorFlow that simplifies writing and deploying distributed machine learning models across different hardware setups.

Findings

01

Achieves strong scalability performance

02

Supports diverse models including ResNet-50, SN-GAN, and D4PG

03

Requires minimal distributed systems expertise

Abstract

We describe TF-Replicator, a framework for distributed machine learning designed for DeepMind researchers and implemented as an abstraction over TensorFlow. TF-Replicator simplifies writing data-parallel and model-parallel research code. The same models can be effortlessly deployed to different cluster architectures (i.e. one or many machines containing CPUs, GPUs or TPU accelerators) using synchronous or asynchronous training regimes. To demonstrate the generality and scalability of TF-Replicator, we implement and benchmark three very different models: (1) A ResNet-50 for ImageNet classification, (2) a SN-GAN for class-conditional ImageNet image generation, and (3) a D4PG reinforcement learning agent for continuous control. Our results show strong scalability performance without demanding any distributed systems expertise of the user. The TF-Replicator programming model will be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tensorflow/community
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Neural Network Applications · Explainable Artificial Intelligence (XAI)

MethodsN-step Returns · Prioritized Experience Replay · Adam · Batch Normalization · Distributed Distributional DDPG