Learning Dynamics of Zeroth-Order Optimization: A Kernel Perspective

Zhe Li; Bicheng Ying; Zidong Liu; Haibo Yang

arXiv:2605.03373·cs.LG·May 6, 2026

Learning Dynamics of Zeroth-Order Optimization: A Kernel Perspective

Zhe Li, Bicheng Ying, Zidong Liu, Haibo Yang

PDF

TL;DR

This paper explains why zeroth-order optimization methods can effectively fine-tune large language models despite theoretical dimension-dependent slowdowns, by analyzing the empirical Neural Tangent Kernel and its dimension-free properties.

Contribution

The paper introduces a kernel perspective to understand the learning dynamics of zeroth-order methods, highlighting their scalability to large models through a dimension-free analysis.

Findings

01

The empirical NTK governs the learning behavior of ZO SGD.

02

The approximation error depends on output size, not parameter dimension.

03

This explains the scalability of ZO methods to large models like LLMs.

Abstract

Classical optimization theory establishes that zeroth-order (ZO) algorithms suffer from a dimension-dependent slowdown, with convergence rates typically scaling with the model dimension compared to first-order methods. However, in contrast to these theoretical expectations, a growing body of recent work demonstrates the successful application of ZO methods to fine-tuning Large Language Models (LLMs) with billions of parameters. To explain this paradox, we derive the one-step learning dynamics of ZO SGD, where the empirical Neural Tangent Kernel (eNTK) naturally emerges as the key term governing the learning behavior. Inspection of the eNTK produced by ZO SGD reveals that each element corresponds to the inner product of neural tangent vectors projected onto a random low-dimensional subspace. Thus, by invoking the Johnson-Lindenstrauss Lemma, our analysis shows that the fidelity of the ZO…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.