ADADELTA: An Adaptive Learning Rate Method

Matthew D. Zeiler

arXiv:1212.5701·cs.LG·December 27, 2012·5.5k cites

ADADELTA: An Adaptive Learning Rate Method

Matthew D. Zeiler

PDF

Open Access 5 Repos

TL;DR

ADADELTA introduces an adaptive, per-dimension learning rate method for gradient descent that automatically adjusts over time, reducing the need for manual tuning and demonstrating robustness across tasks and architectures.

Contribution

It proposes a novel adaptive learning rate algorithm that requires no manual tuning and is computationally efficient, improving training robustness and performance.

Findings

01

Outperforms other methods on MNIST digit classification.

02

Effective on large-scale voice dataset in distributed settings.

03

Requires no manual learning rate tuning.

Abstract

We present a novel per-dimension learning rate method for gradient descent called ADADELTA. The method dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent. The method requires no manual tuning of a learning rate and appears robust to noisy gradient information, different model architecture choices, various data modalities and selection of hyperparameters. We show promising results compared to other methods on the MNIST digit classification task using a single machine and on a large scale voice dataset in a distributed cluster environment.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Domain Adaptation and Few-Shot Learning · Machine Learning and Algorithms

MethodsAdaDelta