How Do Adam and Training Strategies Help BNNs Optimization?
Zechun Liu, Zhiqiang Shen, Shichao Li, Koen Helwegen, Dong Huang,, Kwang-Ting Cheng

TL;DR
This paper investigates why Adam optimizer outperforms SGD in training Binary Neural Networks, revealing the importance of second-order momentum, adaptive learning rates, and weight decay, leading to improved training schemes and higher accuracy on ImageNet.
Contribution
It provides analytical insights into Adam's effectiveness for BNNs and proposes a simple training scheme that enhances accuracy over existing methods.
Findings
Adam's second-order momentum regularizes dead weights.
Adaptive learning rates help navigate BNN loss surfaces.
Weight decay impacts BNN stability and sluggishness.
Abstract
The best performing Binary Neural Networks (BNNs) are usually attained using Adam optimization and its multi-step training variants. However, to the best of our knowledge, few studies explore the fundamental reasons why Adam is superior to other optimizers like SGD for BNN optimization or provide analytical explanations that support specific training strategies. To address this, in this paper we first investigate the trajectories of gradients and weights in BNNs during the training process. We show the regularization effect of second-order momentum in Adam is crucial to revitalize the weights that are dead due to the activation saturation in BNNs. We find that Adam, through its adaptive learning rate strategy, is better equipped to handle the rugged loss surface of BNNs and reaches a better optimum with higher generalization ability. Furthermore, we inspect the intriguing role of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Neural Networks and Applications · Machine Learning and ELM
MethodsStochastic Gradient Descent · Adam · Weight Decay
