Tula: Optimizing Time, Cost, and Generalization in Distributed Large-Batch Training

Sahil Tyagi; Feiyi Wang

arXiv:2603.18112·cs.LG·March 20, 2026

Tula: Optimizing Time, Cost, and Generalization in Distributed Large-Batch Training

Sahil Tyagi, Feiyi Wang

PDF

Open Access

TL;DR

Tula is an online service that optimizes large-batch distributed training by balancing time, cost, and model quality, using performance prediction to find the optimal batch-size for various models and resources.

Contribution

It introduces a novel method combining system modeling and statistical prediction to automatically optimize large-batch training, addressing efficiency and generalization issues.

Findings

01

Predicts training time and cost with 7.5-14% error

02

Achieves up to 20x speedup in training

03

Improves test accuracy by 9% on average

Abstract

Distributed training increases the number of batches processed per iteration either by scaling-out (adding more nodes) or scaling-up (increasing the batch-size). However, the largest configuration does not necessarily yield the best performance. Horizontal scaling introduces additional communication overhead, while vertical scaling is constrained by computation cost and device memory limits. Thus, simply increasing the batch-size leads to diminishing returns: training time and cost decrease initially but eventually plateaus, creating a knee-point in the time/cost versus batch-size pareto curve. The optimal batch-size therefore depends on the underlying model, data and available compute resources. Large batches also suffer from worse model quality due to the well-known generalization gap. In this paper, we present Tula, an online service that automatically optimizes time, cost, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Memory and Neural Computing