KerZOO: Kernel Function Informed Zeroth-Order Optimization for Accurate and Accelerated LLM Fine-Tuning

Zhendong Mi; Qitao Tan; Xiaodong Yu; Zining Zhu; Geng Yuan; Shaoyi Huang

arXiv:2505.18886·cs.LG·May 27, 2025

KerZOO: Kernel Function Informed Zeroth-Order Optimization for Accurate and Accelerated LLM Fine-Tuning

Zhendong Mi, Qitao Tan, Xiaodong Yu, Zining Zhu, Geng Yuan, Shaoyi Huang

PDF

Open Access 4 Reviews

TL;DR

KerZOO introduces a kernel-function-based zeroth-order optimization framework that reduces bias and accelerates fine-tuning of large language models, significantly saving training time and improving accuracy in resource-constrained settings.

Contribution

The paper proposes a novel kernel-function approach to mitigate bias in zeroth-order optimization for LLM fine-tuning, enhancing efficiency and convergence speed.

Findings

01

KerZOO reduces GPU training hours by up to 74%.

02

KerZOO outperforms existing ZO methods in accuracy.

03

KerZOO accelerates convergence in LLM fine-tuning.

Abstract

Large language models (LLMs) have demonstrated impressive capabilities across numerous NLP tasks. Nevertheless, conventional first-order fine-tuning techniques impose heavy memory demands, creating practical obstacles to real-world applications. Zeroth-order (ZO) optimization has recently emerged as a promising memory-efficient alternative, as it circumvents the need for backpropagation by estimating gradients solely through forward passes--making it particularly suitable for resource-limited environments. Despite its efficiency, ZO optimization suffers from gradient estimation bias, which significantly hinders convergence speed. To address this, we analytically identify and characterize the lower-order bias introduced during ZO-based gradient estimation in LLM fine-tuning. Motivated by tools in mathematical physics, we introduce a kernel-function-based ZO framework aimed at mitigating…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 6Confidence 3

Strengths

1. The research problem in this paper is very important and interesting. ZO methods usually need more iterations to converge than first-order methods, so accelerating the training process will be important and interesting. 2. The proposed method in this paper is very easy to follow and the kernel-based method is very interesting. 3. The paper provides thorough experiments to verify the performance gain of the proposed method KerZOO.

Weaknesses

1. I think the main concern is from the experiments. I noticed the paper provides the results on LLaMA3, but most experiments focus on RoBERTa and OPT. I hope the authors can provide more results on the state-of-the-art pre-trained models. Because I also think a strong pre-trained model can narrow the performance gap between zeroth-order based methods and first-order based methods. 2. I think the paper focused on accelerating the ZO training process, ans some related paper also focused on this

Reviewer 02Rating 4Confidence 3

Strengths

- Clear and targeted theory: the derivation isolates the lower-order bias in two-point ZO and gives explicit kernel moment conditions ($E[rK(r)]=C$, $E[r^{3}K(r)]=0$) that remove it in expectation, which is simple to check/implement. - Practical algorithmic wrapper: KerZOO fits the standard two-point estimator with minimal changes (draw $u$, draw $r$, apply $K(r)$) and uses a small number of directions (default $n=3$), making adoption straightforward. - Broad empirical wins with efficiency: cons

Weaknesses

- Theoretical clarity/notation: the Taylor expansions mix notations (e.g., $∇^{2}f(x)$ vs. $∇^{2}L(θ)$) and use $D^{3}∇L$ (which suggests a 4-th order derivative) without a clean justification; please tighten the derivation around Eqs. (4)–(8) and state smoothness/independence assumptions precisely. - Assumption–implementation gap: the theory relies on $r \in [-1,1]$ with moment constraints, but the method later “shrinks” the range of $r$ over training to reduce variance, breaking $E[rK(r)]=C$ a

Reviewer 03Rating 2Confidence 5

Strengths

* Precisely targets ZO’s low-order bias and provides principled kernel conditions with expectation-level analysis. * Kernel weighting and scalar perturbations integrate cleanly; works with very few directions $(n \approx 3)$. * Substantial cuts in training steps/GPU hours while maintaining or improving accuracy. * Results span multiple model families/sizes and tasks, under both full-finetuning and PEFT (LoRA). * Sensitivity to $\beta$ and $C$, plus memory/time comparisons, aid reproducibility a

Weaknesses

1. **Near-duplicate figures/tables.** The paper’s plotting/table template and ordering are *highly similar* to submission **#12350**, with only color changes: **#12282 Fig.1 / Fig.2 / Fig.3 ≈ #12350 Fig.3 / Fig.2 / Fig.4** (same axes styles, legend shapes, and layout). 2. **If a shared template is acceptable, how do you explain drifting baselines?** Under ostensibly comparable settings, baselines differ across the two papers in ways that **systematically favor each paper’s own method**. * E

Reviewer 04Rating 4Confidence 3

Strengths

The paper makes an interesting observation that the lower-order bias in ZO gradient estimation can be removed by incorporating an additional scalar random variable $r$ and applying a kernel function weighting. This approach is promising and has the potential to address the slow convergence commonly observed in existing ZO optimization methods.

Weaknesses

The paper is not clearly written. There is limited discussion or intuition provided to explain the proposed kernel function and rationale behind Algorithm 1. Some experimental results are also difficult to interpret (e.g., Tables 4, 5, and 6). Overall, the writing needs to be significantly improved to meet publication standards. Additionally, the effect of kernel weighting on variance should be explored in more depth.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNuclear reactor physics and engineering · Numerical methods for differential equations · Magnetic confinement fusion research