Zero-Order Sharpness-Aware Minimization
Yao Fu, Yihang Jin, Chunxia Zhang, Junmin Liu, Guang Dai, Haishan Ye

TL;DR
ZOSA is a novel zero-order optimization method that combines sharpness-aware minimization to improve prompt tuning efficiency and generalization in large language models, especially in resource-limited scenarios.
Contribution
It introduces ZOSA, integrating zero-order gradient estimation with sharpness-aware minimization, providing a computationally efficient and effective approach for prompt learning.
Findings
ZOSA outperforms existing prompt tuning methods in few-shot tasks.
It achieves better generalization by targeting flat minima.
The method is computationally efficient and stable across experiments.
Abstract
Prompt learning has become a key method for adapting large language models to specific tasks with limited data. However, traditional gradient-based optimization methods for tuning prompts are computationally intensive, posing challenges for efficiency. We introduce ZOSA (Zero-Order Sharpness-Aware Minimization), a novel optimization framework that integrates zero-order optimization with sharpness-aware minimization to enhance prompt tuning. ZOSA employs Rademacher perturbation vectors to estimate gradients without requiring backpropagation. By incorporating sharpness-aware principles, it targets flat minima in the loss landscape, improving generalization. An adaptive learning rate, guided by loss variability, further ensures stable convergence. Experiments on few-shot learning tasks, such as text classification and natural language inference, show that ZOSA significantly outperforms…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
- The experiments are reasonably broad, covering standard synthetic objectives and black-box prompt tuning on popular NLP benchmarks, comparing to several ZO baselines. - The method shows consistent gains over these black-box baselines in both convergence speed and downstream accuracy across tasks and dimensions.
- I think the main weakness of the paper is limited theoretical novelty. I am by no means an expert on this area, so I am willing to be corrected on any/all of the following points and their technical difficulty: - Properties 3.1 and 3.2 are proven in existing ZO literature - The proof of Theorem 4.3 seems to follows standard smooth nonconvex analysis for ZO methods (e.g., in Ghadimi and Lan 2013), and obtains the same rate, with some modifications for the extra normalization term.
1. minimal, clean SAM analogue in ZO with no gradient access; σ-normalization is used coherently for both the inner radius and the outer step. 2.Standard nonconvex ZO rate with explicit dependence on m,d,$\rho$ and a flatness argument (trace-Hessian bias). 3.Batched Rademacher directions are GPU-friendly; the algorithm is simple to implement.
1 Results are largely reported vs iterations, while ZOSA uses two probes/step and, on synthetic tasks, very large (m) and per-step query counts. There are no fixed-budget (equalized function queries) plots, no wall-clock comparisons, and no per-step cost breakdown, unlike recent fast ZO work that foregrounds efficiency. 2.Synthetic functions under-specified and potentially biased. High-dimensional success at (d=10^4) relies on very large query budgets per step, which likely masks estimator
1. Clean integration of SAM into ZO using a two-point estimator and loss-std normalization, yielding a normalized-SAM view. This offers a principled way to bias toward flatter minima without backprop. 2. The theoretical framework is well-developed with convergence analysis and generalization bounds. 3. Empirical results on both synthetic non-convex functions and real-world GLUE prompt tuning tasks demonstrate superior convergence speed and higher accuracy/F1 scores compared to a comprehensive
1. Limited novelty in components: Each individual component (batched Rademacher perturbations and variance reduction from FZOO, SAM-like perturbations from SABO) exists in prior work. The contribution is primarily in the combination, which while valuable, is somewhat incremental. 2. The generalization bound in Theorem 4.4 assumes convexity of the loss function, which is restrictive for neural networks and conflicts with the non-convex assumptions elsewhere in the paper. 3. Experimental compari
- The final method has good performance on synthetic functions and demonstrates on par performance with some selected baselines on a zero-order prompt tuning benchmark.
- Introducing SAM to zero-order optimization seems like an engineering trick to try, but I don't believe that it is a novel idea. - SAM is introduced as an addition to zero-order optimization, but neither the theoretical results nor the experiments demonstrates strong evidence that this is a good addition. It is not clear which baselines differ from ZOSA in exactly just the SAM component. Seems like a missing ablation to me. - Figure 1 and 2 are not well made and very to parse. - The title is to
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Speech Recognition and Synthesis
