Towards Fast LLM Fine-tuning through Zeroth-Order Optimization with Projected Gradient-Aligned Perturbations

Zhendong Mi; Qitao Tan; Grace Li Zhang; Zhaozhuo Xu; Geng Yuan; Shaoyi Huang

arXiv:2510.18228·cs.LG·October 22, 2025

Towards Fast LLM Fine-tuning through Zeroth-Order Optimization with Projected Gradient-Aligned Perturbations

Zhendong Mi, Qitao Tan, Grace Li Zhang, Zhaozhuo Xu, Geng Yuan, Shaoyi Huang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces P-GAP, a zeroth-order optimization method with gradient alignment that significantly accelerates large language model fine-tuning, reducing training time and resource usage while improving accuracy.

Contribution

P-GAP is a novel zeroth-order optimization technique that estimates a low-dimensional gradient space and aligns perturbations, leading to faster and more efficient LLM fine-tuning.

Findings

01

Up to 6% accuracy improvement on classification tasks

02

Up to 12% higher accuracy on generation tasks

03

Reduced training iterations and GPU hours by over 70%

Abstract

Fine-tuning large language models (LLMs) using zeroth-order (ZO) optimization has emerged as a promising alternative to traditional gradient-based methods due to its reduced memory footprint requirement. However, existing ZO methods suffer from high variance in gradient estimation, leading to slow convergence and suboptimal performance on large-scale models. In this work, we propose P-GAP, a fast LLM fine-tuning approach through zeroth-order optimization with Projected Gradient-Aligned Perturbations. Specifically, we first estimate a low-dimensional gradient space and then align perturbations in projected gradients' direction within the space. This approach enables reduced the number of perturbed parameters and decreased variance, therefore accelerated convergence for LLM fine-tuning. Experiments on LLMs show that P-GAP consistently surpasses the baselines, achieving up to 6% increase…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 5

Strengths

* Clearly targets the known high-variance issue of ZO by combining dimensionality reduction with directional alignment of perturbations. * The projection/alignment step is simple, uses standard linear algebra, and can be dropped into existing ZO pipelines (incl. PEFT/LoRA) with minimal changes. * Provides analysis on variance scaling with dimensionality and argues why the proposed projection reduces estimator variance; includes a convergence discussion under stated assumptions. * Uses a lazy-upd

Weaknesses

1. **Near-duplicate figures/tables vs. #12282.** The plotting/table template and ordering are *almost identical* to **#12282**, with colors changed: **#12350 Fig.3 / Fig.2 / Fig.4 ≈ #12282 Fig.1 / Fig.2 / Fig.3** (axes, legends, and layouts closely match). 2. **If shared templates are acceptable, how to explain drifting baselines?** Under ostensibly comparable settings, baselines diverge between the two papers in ways that **systematically favor each paper’s method**. * Example (**Table 2 ·

Reviewer 02Rating 2Confidence 3

Strengths

The paper presents an interesting extension of directionally aligned perturbations from vectors to matrices to reduce variance in ZO gradient estimation. The authors conduct comprehensive experiments to demonstrate the effectiveness of the proposed method.

Weaknesses

The paper argues that P-GAP reduces variance through low-dimensional perturbation spaces, but there are no explicit variance measurements or visualizations supporting this claim. The main algorithm (Algorithm 1) is only briefly referenced in the main text, and the appendix lacks a detailed explanation of their theoretical results.

Reviewer 03Rating 4Confidence 4

Strengths

- The proposed projection-based, gradient-aligned perturbation design is useful for this task. - The paper provides thorough theoretical analysis to support the method.

Weaknesses

- The paper lacks sufficient ablation studies on hyperparameters (e.g., subspace dimension, update frequency, perturbation scale). How sensitive is the method to these choices, and what guidelines should practitioners follow? - Table 3 and Table 4 only report partial experiments. Could the authors provide results on more datasets to confirm robustness and generality? - The experiments mainly focus on OPT models, which are relatively outdated. It would strengthen the claims to include newer LLM

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Materials Science · Natural Language Processing Techniques