SharpZO: Hybrid Sharpness-Aware Vision Language Model Prompt Tuning via Forward-Only Passes
Yifan Yang, Zhen Zhang, Rupak Vignesh Swaminathan, Jing Liu, Nathan Susanj, Zheng Zhang

TL;DR
SharpZO introduces a memory-efficient, forward-only fine-tuning method for vision-language models that combines global sharpness-aware exploration with local zeroth-order optimization, significantly improving accuracy and convergence.
Contribution
The paper presents a novel hybrid sharpness-aware zeroth-order optimization method that enhances forward-only fine-tuning of vision-language models with theoretical analysis and superior experimental results.
Findings
Achieves up to 7% accuracy improvement over existing methods.
Significantly faster convergence in fine-tuning.
Effective for memory-constrained, inference-only edge devices.
Abstract
Fine-tuning vision language models (VLMs) has achieved remarkable performance across various downstream tasks; yet, it requires access to model gradients through backpropagation (BP), making them unsuitable for memory-constrained, inference-only edge devices. To address this limitation, previous work has explored various BP-free fine-tuning methods. However, these approaches often rely on high-variance evolutionary strategies (ES) or zeroth-order (ZO) optimization, and often fail to achieve satisfactory performance. In this paper, we propose a hybrid Sharpness-aware Zeroth-order optimization (SharpZO) approach, specifically designed to enhance the performance of ZO VLM fine-tuning via a sharpness-aware warm-up training. SharpZO features a two-stage optimization process: a sharpness-aware ES stage that globally explores and smooths the loss landscape to construct a strong initialization,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Automated Systems · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training
