Blockwise Adaptivity: Faster Training and Better Generalization in Deep   Learning

Shuai Zheng; James T. Kwok

arXiv:1905.09899·cs.LG·May 27, 2019·1 cites

Blockwise Adaptivity: Faster Training and Better Generalization in Deep Learning

Shuai Zheng, James T. Kwok

PDF

Open Access

TL;DR

This paper introduces blockwise adaptive gradient descent, which balances adaptivity and generalization, leading to faster training and better generalization in deep learning compared to coordinate-wise methods.

Contribution

It proposes a novel blockwise adaptive stepsize method, providing theoretical convergence and stability analysis, and demonstrates improved empirical performance over Adam and Nesterov's method.

Findings

01

Faster convergence than Adam and Nesterov's accelerated gradient.

02

Lower generalization error due to reduced adaptivity aggressiveness.

03

Theoretically comparable convergence rate with improved stability.

Abstract

Stochastic methods with coordinate-wise adaptive stepsize (such as RMSprop and Adam) have been widely used in training deep neural networks. Despite their fast convergence, they can generalize worse than stochastic gradient descent. In this paper, by revisiting the design of Adagrad, we propose to split the network parameters into blocks, and use a blockwise adaptive stepsize. Intuitively, blockwise adaptivity is less aggressive than adaptivity to individual coordinates, and can have a better balance between adaptivity and generalization. We show theoretically that the proposed blockwise adaptive gradient descent has comparable convergence rate as its counterpart with coordinate-wise adaptive stepsize, but is faster up to some constant. We also study its uniform stability and show that blockwise adaptivity can lead to lower generalization error than coordinate-wise adaptivity.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications

MethodsAdam · RMSProp