Fine-tuning Quantized Neural Networks with Zeroth-order Optimization

Sifeng Shang; Jiayi Zhou; Chenyu Lin; Minxian Li; Kaiyang Zhou

arXiv:2505.13430·cs.LG·February 13, 2026

Fine-tuning Quantized Neural Networks with Zeroth-order Optimization

Sifeng Shang, Jiayi Zhou, Chenyu Lin, Minxian Li, Kaiyang Zhou

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces QZO, a memory-efficient method for fine-tuning large language models by combining zeroth-order optimization with quantization, significantly reducing memory usage and enabling training on limited hardware.

Contribution

The paper proposes Quantized Zeroth-order Optimization (QZO), a novel approach that allows fine-tuning quantized neural networks without storing gradients or optimizer states, reducing memory costs substantially.

Findings

01

QZO reduces memory cost by over 18× for 4-bit LLMs.

02

QZO enables fine-tuning Llama-2-13B on a single 24GB GPU.

03

QZO is compatible with various quantization methods.

Abstract

As the size of large language models grows exponentially, GPU memory has become a bottleneck for adapting these models to downstream tasks. In this paper, we aim to push the limits of memory-efficient training by minimizing memory usage on model weights, gradients, and optimizer states, within a unified framework. Our idea is to eliminate both gradients and optimizer states using zeroth-order optimization, which approximates gradients by perturbing weights during forward passes to identify gradient directions. To minimize memory usage on weights, we employ model quantization, e.g., converting from bfloat16 to int4. However, directly applying zeroth-order optimization to quantized weights is infeasible due to the precision gap between discrete weights and continuous gradients, which would otherwise require de-quantization and re-quantization. To overcome this challenge, we propose…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. Q-SPSA operates on Δ, keeping integer weights fixed; avoids backprop and optimizer states. 2. DDC has theoretical backing and empirical impact (NaNs avoided; improved stability). 3. ~18× memory saving; single-GPU (24 GB) fine-tuning for large models; reduced trainable params/FLOPs vs MeZO. 4. Works across 4-bit (GPTQ) and 2-bit (AQLM) quantization; preliminary coverage of diffusion. Overall, I think this is a good paper.

Weaknesses

1. The comparison set omits strong practical baselines such as LoRA/QLoRA/AdaLoRA on the same datasets/models/bit-widths. Since QZO and PEFT are orthogonal (QZO tunes scales; PEFT adds low-rank adapters), including them would contextualize accuracy-vs-memory/latency trade-offs more fairly than only MeZO/SGD. (Tables report MeZO and full FT but not PEFT.) 2. The method’s success is acknowledged to depend on PTQ quality, but there’s no systematic study across quantizers (e.g., AWQ/SmoothQuant vs

Reviewer 02Rating 8Confidence 4

Strengths

- This is a particularly relevant problem, as researchers without access to high-end GPUs often face significant challenges in conducting LLM research due to the models’ large memory requirements. - The paper is well written, the proposed solution is elegant, and the experiments are carefully designed to evaluate different components of QZO, such as clipping.

Weaknesses

- Using clipping for variance reduction may make the proposed method sensitive to hyperparameter choices. As shown in Figure 3, the clipping parameter has a substantial impact on accuracy, and it is unclear how this parameter should be selected beyond trial and error.

Reviewer 03Rating 4Confidence 4

Strengths

1. This work proposes Quantized Zeroth-order Optimization (QZO), which significantly reduces training memory usage. 2. The optimization of quantized weights requires no de-quantization or re-quantization. 3. The memory optimization is effective — for example, fine-tuning Llama-2-13B on a single 24GB GPU.

Weaknesses

1. In the Introduction, the paper lacks a clear distinction between full-parameter fine-tuning and parameter-efficient fine-tuning. The current discussion implicitly assumes full-parameter fine-tuning, which is insufficient. 2. The experiments are also inadequate. Comparisons with parameter-efficient fine-tuning methods (e.g., LoRA, QLoRA) are missing, both in terms of memory consumption and fine-tuning performance. As a result, the paper does not demonstrate the trade-off between performance an

Code & Models

Repositories

maifoundations/qzo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Parallel Computing and Optimization Techniques

MethodsDiffusion