Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets
Penghang Yin, Jiancheng Lyu, Shuai Zhang, Stanley Osher, Yingyong Qi,, Jack Xin

TL;DR
This paper provides a theoretical understanding of the straight-through estimator (STE) in training quantized neural networks, showing how proper STE choices lead to effective descent directions and convergence, while poor choices cause instability.
Contribution
It offers a theoretical justification for STE by analyzing its correlation with the true gradient and demonstrating convergence properties in a simplified neural network model.
Findings
Proper STE choice ensures positive correlation with the true gradient.
Negation of the coarse gradient acts as a descent direction.
Poor STE choices cause training instability near local minima.
Abstract
Training activation quantized neural networks involves minimizing a piecewise constant function whose gradient vanishes almost everywhere, which is undesirable for the standard back-propagation or chain rule. An empirical way around this issue is to use a straight-through estimator (STE) (Bengio et al., 2013) in the backward pass only, so that the "gradient" through the modified chain rule becomes non-trivial. Since this unusual "gradient" is certainly not the gradient of loss function, the following question arises: why searching in its negative direction minimizes the training loss? In this paper, we provide the theoretical justification of the concept of STE by answering this question. We consider the problem of learning a two-linear-layer network with binarized ReLU activation and Gaussian input data. We shall refer to the unusual "gradient" given by the STE-modifed chain rule as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Machine Learning and ELM
Methods*Communicated@Fast*How Do I Communicate to Expedia?
