TicTac: Accelerating Distributed Deep Learning with Communication Scheduling
Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, Roy H. Campbell

TL;DR
TicTac is a system that optimizes communication scheduling in distributed deep learning, reducing iteration time and stragglers by enforcing transfer order without requiring model changes.
Contribution
It introduces a novel communication scheduling method for distributed training that guarantees near-optimal overlap and improves performance without modifying models.
Findings
Up to 37.7% throughput increase in inference
Up to 19.2% throughput increase in training
Straggler effects reduced by up to 2.3 times
Abstract
State-of-the-art deep learning systems rely on iterative distributed training to tackle the increasing complexity of models and input data. The iteration time in these communication-heavy systems depends on the computation time, communication time and the extent of overlap of computation and communication. In this work, we identify a shortcoming in systems with graph representation for computation, such as TensorFlow and PyTorch, that result in high variance in iteration time --- random order of received parameters across workers. We develop a system, TicTac, to improve the iteration time by fixing this issue in distributed deep learning with Parameter Servers while guaranteeing near-optimal overlap of communication and computation. TicTac identifies and enforces an order of network transfers which improves the iteration time using prioritization. Our system is implemented over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data
