Custom Gradient Estimators are Straight-Through Estimators in Disguise
Matt Schoenbauer, Daniele Moro, Lukasz Lew, Andrew Howard

TL;DR
This paper demonstrates that many gradient estimators used in quantization-aware training are essentially equivalent to straight-through estimators, revealing a fundamental connection and unifying perspective.
Contribution
It proves the equivalence of various gradient estimators to STE under certain conditions and shows this holds for different optimizers and models.
Findings
Gradient estimators are equivalent to STE with small learning rates.
The equivalence holds for adaptive optimizers like Adam.
Experimental results on MNIST and ImageNet support the theory.
Abstract
Quantization-aware training comes with a fundamental challenge: the derivative of quantization functions such as rounding are zero almost everywhere and nonexistent elsewhere. Various differentiable approximations of quantization functions have been proposed to address this issue. In this paper, we prove that when the learning rate is sufficiently small, a large class of weight gradient estimators is equivalent with the straight through estimator (STE). Specifically, after swapping in the STE and adjusting both the weight initialization and the learning rate in SGD, the model will train in almost exactly the same way as it did with the original gradient estimator. Moreover, we show that for adaptive learning rate algorithms like Adam, the same result can be seen without any modifications to the weight initialization and learning rate. We experimentally show that these results hold for…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The claim is strong that other gradient estimators works similar as STE in QAT. 2. Experiments show that the weight difference is small to support the claim.
1. The mirror room story does not appear closely connected to the theoretical analysis. 2. Assumption 5.1.1 violates Figure 1 where the gradient could be zero. 3. From Table 4, Adam leads to larger weight difference. 4. For more complicated task like ImageNet, the weight difference is much larger than MNIST.
1. The paper is overall well presented. 2. The concept of mirror effect is interesting.
A primary concern is that the key claims and several major concepts lack mathematical rigor. Additionally, the main theoretical results provided are too limited to substantiate the claims: 1. Contribution 1 states that '... all nonzero weight gradient estimators lead to approximately equivalent weight movement for non-adaptive learning rate optimizers ...'. However, the term 'approximately equivalent weight movement' lacks a precise mathematical definition. It would be helpful to formalize this
- The theoretical insights are interesting, unexpected and (to my knowledge) novel. They offer better understanding and insight into how gradient estimators work, which appeals to me. - The paper is generally well written and easy to read. I appreciate how the authors lead their result with an intuitive explanation and illustrative graphic. This makes the following theory much easier to intuit. - The experiments shown in the paper provide good evidence for the theoretical results.
Major 1. The claims relating to practical impact feel overstated ("practitioners can now confidently choose the STE"). The problem setting that the authors explore (full precision activations, quantized weights, uniform fixed point quantization, small learning rate) is rather specific, and practitioners may be interested in quantized activations, or low-precision floating point or larger learning rates etc. I would prefer if the authors tempered their claims. 2. The experiments, although they d
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Face and Expression Recognition · Bayesian Methods and Mixture Models
MethodsStochastic Gradient Descent · Adam
