ADADELTA: An Adaptive Learning Rate Method
Matthew D. Zeiler

TL;DR
ADADELTA introduces an adaptive, per-dimension learning rate method for gradient descent that automatically adjusts over time, reducing the need for manual tuning and demonstrating robustness across tasks and architectures.
Contribution
It proposes a novel adaptive learning rate algorithm that requires no manual tuning and is computationally efficient, improving training robustness and performance.
Findings
Outperforms other methods on MNIST digit classification.
Effective on large-scale voice dataset in distributed settings.
Requires no manual learning rate tuning.
Abstract
We present a novel per-dimension learning rate method for gradient descent called ADADELTA. The method dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent. The method requires no manual tuning of a learning rate and appears robust to noisy gradient information, different model architecture choices, various data modalities and selection of hyperparameters. We show promising results compared to other methods on the MNIST digit classification task using a single machine and on a large scale voice dataset in a distributed cluster environment.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Domain Adaptation and Few-Shot Learning · Machine Learning and Algorithms
MethodsAdaDelta
