Tula: Optimizing Time, Cost, and Generalization in Distributed Large-Batch Training
Sahil Tyagi, Feiyi Wang

TL;DR
Tula is an online service that optimizes large-batch distributed training by balancing time, cost, and model quality, using performance prediction to find the optimal batch-size for various models and resources.
Contribution
It introduces a novel method combining system modeling and statistical prediction to automatically optimize large-batch training, addressing efficiency and generalization issues.
Findings
Predicts training time and cost with 7.5-14% error
Achieves up to 20x speedup in training
Improves test accuracy by 9% on average
Abstract
Distributed training increases the number of batches processed per iteration either by scaling-out (adding more nodes) or scaling-up (increasing the batch-size). However, the largest configuration does not necessarily yield the best performance. Horizontal scaling introduces additional communication overhead, while vertical scaling is constrained by computation cost and device memory limits. Thus, simply increasing the batch-size leads to diminishing returns: training time and cost decrease initially but eventually plateaus, creating a knee-point in the time/cost versus batch-size pareto curve. The optimal batch-size therefore depends on the underlying model, data and available compute resources. Large batches also suffer from worse model quality due to the well-known generalization gap. In this paper, we present Tula, an online service that automatically optimizes time, cost, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Memory and Neural Computing
