Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning

Yong Liu; Zirui Zhu; Chaoyu Gong; Minhao Cheng; Cho-Jui Hsieh; Yang You

arXiv:2402.15751·cs.LG·February 17, 2026·1 cites

Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning

Yong Liu, Zirui Zhu, Chaoyu Gong, Minhao Cheng, Cho-Jui Hsieh, Yang You

PDF

Open Access 3 Reviews

TL;DR

Sparse MeZO introduces a parameter-efficient zeroth-order optimization method that selectively applies to key parameters, significantly enhancing fine-tuning performance and speed for large language models with minimal memory use.

Contribution

It proposes a novel parameter selection scheme for zeroth-order optimization, enabling effective sparse fine-tuning of large language models with reduced memory and improved convergence.

Findings

01

Achieves 9% accuracy improvement on RTE task

02

Provides 3.5x faster convergence compared to MeZO

03

Enables fine-tuning LLaMA-30b on a single GPU

Abstract

While fine-tuning large language models (LLMs) for specific tasks often yields impressive results, it comes at the cost of memory inefficiency due to back-propagation in gradient-based training. Memory-efficient Zeroth-order (MeZO) optimizers, recently proposed to address this issue, only require forward passes during training, making them more memory-friendly. However, compared with exact gradients, ZO-based gradients usually exhibit an estimation error, which can significantly hurt the optimization process, leading to slower convergence and suboptimal solutions. In addition, we find that the estimation error will hurt more when adding to large weights instead of small weights. Based on this observation, this paper introduces Sparse MeZO, a novel memory-efficient zeroth-order optimization approach that applies ZO only to a carefully chosen subset of parameters. We propose a simple yet…

Peer Reviews

Decision·NeurIPS 2025 poster

Reviewer 01Rating 4Confidence 3

Strengths

**Strengths** * **s1:** The paper is well-motivated and written, clearly balanced between motivation, methods, and empirical evaluation. The method is simple and does not require a complex procedure on top of existing zero-order optimization processes. * **s2:** The experimental results are compelling, showing an average improvement of 3.7 points on the SuperGLUE tasks over vanilla MeZO without impact on the memory consumption or convergence speed. **Weaknesses** * **w1:** I am not fully conv

Reviewer 02Rating 3Confidence 5

Strengths

**Strengths:** * Incorporating sparsity and MeZO is an interesting direction for performance improvement. * The paper is structured logically, making it easy to follow the motivation and methodology. **Weaknesses:** * The paper lacks discussion and comparison with some important ZO works in sparsity, like [1]. Moreover, seems the method works due to gradient estimation with less noise, so it’s helpful to compare with some variance-reduction works of ZO, like [2] and [3]. * There are no specif

Reviewer 03Rating 4Confidence 3

Strengths

**Strengths** - The paper provides a new observation for zeroth order methods that small weights have bigger impacts on training than large weights. This is very interesting and somewhat counter-intuitive. I think this should be studied more in the future. - The experimental results are quite strong against vanilla mezo and support the hypothesis. **Weaknesses** - The experiments could be more comprehensive. Some of the datasets are missing from some tables (for example, table 2 and others in t

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAcoustic Wave Resonator Technologies · Particle accelerators and beam dynamics · Gyrotron and Vacuum Electronics Research

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings