Learning Dynamics of Zeroth-Order Optimization: A Kernel Perspective
Zhe Li, Bicheng Ying, Zidong Liu, Haibo Yang

TL;DR
This paper explains why zeroth-order optimization methods can effectively fine-tune large language models despite theoretical dimension-dependent slowdowns, by analyzing the empirical Neural Tangent Kernel and its dimension-free properties.
Contribution
The paper introduces a kernel perspective to understand the learning dynamics of zeroth-order methods, highlighting their scalability to large models through a dimension-free analysis.
Findings
The empirical NTK governs the learning behavior of ZO SGD.
The approximation error depends on output size, not parameter dimension.
This explains the scalability of ZO methods to large models like LLMs.
Abstract
Classical optimization theory establishes that zeroth-order (ZO) algorithms suffer from a dimension-dependent slowdown, with convergence rates typically scaling with the model dimension compared to first-order methods. However, in contrast to these theoretical expectations, a growing body of recent work demonstrates the successful application of ZO methods to fine-tuning Large Language Models (LLMs) with billions of parameters. To explain this paradox, we derive the one-step learning dynamics of ZO SGD, where the empirical Neural Tangent Kernel (eNTK) naturally emerges as the key term governing the learning behavior. Inspection of the eNTK produced by ZO SGD reveals that each element corresponds to the inner product of neural tangent vectors projected onto a random low-dimensional subspace. Thus, by invoking the Johnson-Lindenstrauss Lemma, our analysis shows that the fidelity of the ZO…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
