TL;DR
This paper introduces HDET, a scalable ensemble training method that explores learning rate configurations automatically during large model training, improving optimization and generalization without extra hyperparameter tuning.
Contribution
HDET repurposes data-parallel replicas for simultaneous learning rate exploration with a novel auto-LR controller, enabling self-adapting hyperparameter schedules during training.
Findings
HDET improves training efficiency and model performance.
The auto-LR controller adapts hyperparameters without additional tuning.
Framework generalizes to other scalar hyperparameters beyond learning rate.
Abstract
Training large neural networks with data-parallel stochastic gradient descent allocates N GPU replicas to compute effectively identical updates -- a practice that leaves the rich space of learning rate configurations entirely unexplored during training. We propose Hyperparameter-Divergent Ensemble Training (HDET), a method that repurposes these replicas for simultaneous learning rate exploration at negligible communication overhead. HDET operates in alternating phases: a fan-out stage in which replicas train independently under a structured, symmetric spread of learning rates, and a converge stage in which parameters are averaged across all replicas via AllReduce every T steps. Building on this ensemble substrate, we further propose an automatic learning rate (auto-LR) controller that treats the relative training loss across replicas as a performance signal, updating the shared base…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
