SharpZO: Hybrid Sharpness-Aware Vision Language Model Prompt Tuning via Forward-Only Passes

Yifan Yang; Zhen Zhang; Rupak Vignesh Swaminathan; Jing Liu; Nathan Susanj; Zheng Zhang

arXiv:2506.20990·cs.LG·October 27, 2025

SharpZO: Hybrid Sharpness-Aware Vision Language Model Prompt Tuning via Forward-Only Passes

Yifan Yang, Zhen Zhang, Rupak Vignesh Swaminathan, Jing Liu, Nathan Susanj, Zheng Zhang

PDF

Open Access 1 Repo

TL;DR

SharpZO introduces a memory-efficient, forward-only fine-tuning method for vision-language models that combines global sharpness-aware exploration with local zeroth-order optimization, significantly improving accuracy and convergence.

Contribution

The paper presents a novel hybrid sharpness-aware zeroth-order optimization method that enhances forward-only fine-tuning of vision-language models with theoretical analysis and superior experimental results.

Findings

01

Achieves up to 7% accuracy improvement over existing methods.

02

Significantly faster convergence in fine-tuning.

03

Effective for memory-constrained, inference-only edge devices.

Abstract

Fine-tuning vision language models (VLMs) has achieved remarkable performance across various downstream tasks; yet, it requires access to model gradients through backpropagation (BP), making them unsuitable for memory-constrained, inference-only edge devices. To address this limitation, previous work has explored various BP-free fine-tuning methods. However, these approaches often rely on high-variance evolutionary strategies (ES) or zeroth-order (ZO) optimization, and often fail to achieve satisfactory performance. In this paper, we propose a hybrid Sharpness-aware Zeroth-order optimization (SharpZO) approach, specifically designed to enhance the performance of ZO VLM fine-tuning via a sharpness-aware warm-up training. SharpZO features a two-stage optimization process: a sharpness-aware ES stage that globally explores and smooths the loss landscape to construct a strong initialization,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yifanycc/sharpzo
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Automated Systems · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training