Universally Empowering Zeroth-Order Optimization via Adaptive Layer-wise Sampling
Fei Wang, Li Shen, Liang Ding, Chao Xue, Ye Liu, Changxing Ding

TL;DR
This paper introduces AdaLeZO, an adaptive layer-wise zeroth-order optimization method that significantly accelerates large language model fine-tuning by dynamically allocating perturbations based on layer sensitivity.
Contribution
AdaLeZO formulates layer selection as a Multi-Armed Bandit problem and employs inverse probability weighting to reduce variance, improving efficiency without extra memory cost.
Findings
Achieves 1.7x to 3.0x wall-clock speedup on large models.
Effectively reduces variance in gradient estimation.
Seamlessly integrates with existing ZO optimizers.
Abstract
Zeroth-Order optimization presents a promising memory-efficient paradigm for fine-tuning Large Language Models by relying solely on forward passes. However, its practical adoption is severely constrained by slow wall-clock convergence and high estimation variance. In this work, we dissect the runtime characteristics of ZO algorithms and identify a critical system bottleneck where the generation of perturbations and parameter updates accounts for over 40% of the training latency. We argue that the standard uniform exploration strategy is fundamentally flawed as it fails to account for the heterogeneous sensitivity of layers in deep networks, resulting in computationally wasteful blind searches. To address this structural mismatch, we propose AdaLeZO, an Adaptive Layer-wise ZO optimization framework. By formulating the layer selection process as a non-stationary Multi-Armed Bandit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
