Improving the Straight-Through Estimator with Zeroth-Order Information

Ningfeng Yang; Tor M. Aamodt

arXiv:2510.23926·cs.LG·October 29, 2025

Improving the Straight-Through Estimator with Zeroth-Order Information

Ningfeng Yang, Tor M. Aamodt

PDF

1 Video

TL;DR

This paper introduces FOGZO, a new gradient estimation method that combines the benefits of STE and zeroth-order approaches, improving training efficiency and accuracy in quantized neural networks.

Contribution

The paper proposes FOGZO, a novel method that reduces STE bias and computational cost by guiding zeroth-order gradients with first-order information.

Findings

01

FOGZO improves accuracy by 1-8% on DeiT models.

02

FOGZO achieves 1-22 perplexity point improvement on LLaMA.

03

FOGZO reduces computation by up to 796x compared to n-SPSA.

Abstract

We study the problem of training neural networks with quantized parameters. Learning low-precision quantized parameters by enabling computation of gradients via the Straight-Through Estimator (STE) can be challenging. While the STE enables back-propagation, which is a first-order method, recent works have explored the use of zeroth-order (ZO) gradient descent for fine-tuning. We note that the STE provides high-quality biased gradients, and ZO gradients are unbiased but can be expensive. We thus propose First-Order-Guided Zeroth-Order Gradient Descent (FOGZO) that reduces STE bias while reducing computations relative to ZO methods. Empirically, we show FOGZO improves the tradeoff between quality and training time in Quantization-Aware Pre-Training. Specifically, versus STE at the same number of iterations, we show a 1-8\% accuracy improvement for DeiT Tiny/Small, 1-2\% accuracy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Improving the Straight-Through Estimator with Zeroth-Order Information· slideslive