GLAD: Generalizable Tuning for Vision-Language Models
Yuqi Peng, Pengfei Wang, Jianzhuang Liu, Shifeng Chen

TL;DR
GLAD introduces a simple, effective, and generalizable tuning framework for vision-language models that improves robustness and transferability across various tasks and datasets by combining LoRA with gradient regularization.
Contribution
The paper proposes GLAD, a novel tuning method combining LoRA with gradient regularization, enhancing generalization and robustness in vision-language models.
Findings
GLAD outperforms previous methods on 15 benchmark datasets.
It improves base-to-novel class generalization.
It enhances image domain and cross-dataset generalization.
Abstract
Pre-trained vision-language models, such as CLIP, show impressive zero-shot recognition ability and can be easily transferred to specific downstream tasks via prompt tuning, even with limited training data. However, existing prompt tuning methods face two main challenges: (1) In few-shot scenarios, data scarcity often leads to overfitting, making the model sensitive to changes in the input domain. (2) To mitigate overfitting, these methods typically rely on complex task-specific model architectures and sensitive hyperparameter tuning, severely restricting their general applicability. To address these issues, we propose a simpler and more general framework called GLAD (Generalizable LoRA tuning with RegulArized GraDient). We show that merely applying LoRA achieves performance in downstream tasks comparable to current state-of-the-art prompt-based methods. While LoRA is effective and easy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
