KerZOO: Kernel Function Informed Zeroth-Order Optimization for Accurate and Accelerated LLM Fine-Tuning
Zhendong Mi, Qitao Tan, Xiaodong Yu, Zining Zhu, Geng Yuan, Shaoyi Huang

TL;DR
KerZOO introduces a kernel-function-based zeroth-order optimization framework that reduces bias and accelerates fine-tuning of large language models, significantly saving training time and improving accuracy in resource-constrained settings.
Contribution
The paper proposes a novel kernel-function approach to mitigate bias in zeroth-order optimization for LLM fine-tuning, enhancing efficiency and convergence speed.
Findings
KerZOO reduces GPU training hours by up to 74%.
KerZOO outperforms existing ZO methods in accuracy.
KerZOO accelerates convergence in LLM fine-tuning.
Abstract
Large language models (LLMs) have demonstrated impressive capabilities across numerous NLP tasks. Nevertheless, conventional first-order fine-tuning techniques impose heavy memory demands, creating practical obstacles to real-world applications. Zeroth-order (ZO) optimization has recently emerged as a promising memory-efficient alternative, as it circumvents the need for backpropagation by estimating gradients solely through forward passes--making it particularly suitable for resource-limited environments. Despite its efficiency, ZO optimization suffers from gradient estimation bias, which significantly hinders convergence speed. To address this, we analytically identify and characterize the lower-order bias introduced during ZO-based gradient estimation in LLM fine-tuning. Motivated by tools in mathematical physics, we introduce a kernel-function-based ZO framework aimed at mitigating…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
1. The research problem in this paper is very important and interesting. ZO methods usually need more iterations to converge than first-order methods, so accelerating the training process will be important and interesting. 2. The proposed method in this paper is very easy to follow and the kernel-based method is very interesting. 3. The paper provides thorough experiments to verify the performance gain of the proposed method KerZOO.
1. I think the main concern is from the experiments. I noticed the paper provides the results on LLaMA3, but most experiments focus on RoBERTa and OPT. I hope the authors can provide more results on the state-of-the-art pre-trained models. Because I also think a strong pre-trained model can narrow the performance gap between zeroth-order based methods and first-order based methods. 2. I think the paper focused on accelerating the ZO training process, ans some related paper also focused on this
- Clear and targeted theory: the derivation isolates the lower-order bias in two-point ZO and gives explicit kernel moment conditions ($E[rK(r)]=C$, $E[r^{3}K(r)]=0$) that remove it in expectation, which is simple to check/implement. - Practical algorithmic wrapper: KerZOO fits the standard two-point estimator with minimal changes (draw $u$, draw $r$, apply $K(r)$) and uses a small number of directions (default $n=3$), making adoption straightforward. - Broad empirical wins with efficiency: cons
- Theoretical clarity/notation: the Taylor expansions mix notations (e.g., $∇^{2}f(x)$ vs. $∇^{2}L(θ)$) and use $D^{3}∇L$ (which suggests a 4-th order derivative) without a clean justification; please tighten the derivation around Eqs. (4)–(8) and state smoothness/independence assumptions precisely. - Assumption–implementation gap: the theory relies on $r \in [-1,1]$ with moment constraints, but the method later “shrinks” the range of $r$ over training to reduce variance, breaking $E[rK(r)]=C$ a
* Precisely targets ZO’s low-order bias and provides principled kernel conditions with expectation-level analysis. * Kernel weighting and scalar perturbations integrate cleanly; works with very few directions $(n \approx 3)$. * Substantial cuts in training steps/GPU hours while maintaining or improving accuracy. * Results span multiple model families/sizes and tasks, under both full-finetuning and PEFT (LoRA). * Sensitivity to $\beta$ and $C$, plus memory/time comparisons, aid reproducibility a
1. **Near-duplicate figures/tables.** The paper’s plotting/table template and ordering are *highly similar* to submission **#12350**, with only color changes: **#12282 Fig.1 / Fig.2 / Fig.3 ≈ #12350 Fig.3 / Fig.2 / Fig.4** (same axes styles, legend shapes, and layout). 2. **If a shared template is acceptable, how do you explain drifting baselines?** Under ostensibly comparable settings, baselines differ across the two papers in ways that **systematically favor each paper’s own method**. * E
The paper makes an interesting observation that the lower-order bias in ZO gradient estimation can be removed by incorporating an additional scalar random variable $r$ and applying a kernel function weighting. This approach is promising and has the potential to address the slow convergence commonly observed in existing ZO optimization methods.
The paper is not clearly written. There is limited discussion or intuition provided to explain the proposed kernel function and rationale behind Algorithm 1. Some experimental results are also difficult to interpret (e.g., Tables 4, 5, and 6). Overall, the writing needs to be significantly improved to meet publication standards. Additionally, the effect of kernel weighting on variance should be explored in more depth.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNuclear reactor physics and engineering · Numerical methods for differential equations · Magnetic confinement fusion research
