Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning
Yong Liu, Zirui Zhu, Chaoyu Gong, Minhao Cheng, Cho-Jui Hsieh, Yang You

TL;DR
Sparse MeZO introduces a parameter-efficient zeroth-order optimization method that selectively applies to key parameters, significantly enhancing fine-tuning performance and speed for large language models with minimal memory use.
Contribution
It proposes a novel parameter selection scheme for zeroth-order optimization, enabling effective sparse fine-tuning of large language models with reduced memory and improved convergence.
Findings
Achieves 9% accuracy improvement on RTE task
Provides 3.5x faster convergence compared to MeZO
Enables fine-tuning LLaMA-30b on a single GPU
Abstract
While fine-tuning large language models (LLMs) for specific tasks often yields impressive results, it comes at the cost of memory inefficiency due to back-propagation in gradient-based training. Memory-efficient Zeroth-order (MeZO) optimizers, recently proposed to address this issue, only require forward passes during training, making them more memory-friendly. However, compared with exact gradients, ZO-based gradients usually exhibit an estimation error, which can significantly hurt the optimization process, leading to slower convergence and suboptimal solutions. In addition, we find that the estimation error will hurt more when adding to large weights instead of small weights. Based on this observation, this paper introduces Sparse MeZO, a novel memory-efficient zeroth-order optimization approach that applies ZO only to a carefully chosen subset of parameters. We propose a simple yet…
Peer Reviews
Decision·NeurIPS 2025 poster
**Strengths** * **s1:** The paper is well-motivated and written, clearly balanced between motivation, methods, and empirical evaluation. The method is simple and does not require a complex procedure on top of existing zero-order optimization processes. * **s2:** The experimental results are compelling, showing an average improvement of 3.7 points on the SuperGLUE tasks over vanilla MeZO without impact on the memory consumption or convergence speed. **Weaknesses** * **w1:** I am not fully conv
**Strengths:** * Incorporating sparsity and MeZO is an interesting direction for performance improvement. * The paper is structured logically, making it easy to follow the motivation and methodology. **Weaknesses:** * The paper lacks discussion and comparison with some important ZO works in sparsity, like [1]. Moreover, seems the method works due to gradient estimation with less noise, so it’s helpful to compare with some variance-reduction works of ZO, like [2] and [3]. * There are no specif
**Strengths** - The paper provides a new observation for zeroth order methods that small weights have bigger impacts on training than large weights. This is very interesting and somewhat counter-intuitive. I think this should be studied more in the future. - The experimental results are quite strong against vanilla mezo and support the hypothesis. **Weaknesses** - The experiments could be more comprehensive. Some of the datasets are missing from some tables (for example, table 2 and others in t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAcoustic Wave Resonator Technologies · Particle accelerators and beam dynamics · Gyrotron and Vacuum Electronics Research
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
