Enhancing Robustness of Vision-Language Models through Orthogonality   Learning and Self-Regularization

Jinlong Li; Dong Zhao; Zequn Jie; Elisa Ricci; Lin Ma; Nicu Sebe

arXiv:2407.08374·cs.CV·October 17, 2024

Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization

Jinlong Li, Dong Zhao, Zequn Jie, Elisa Ricci, Lin Ma, Nicu Sebe

PDF

Open Access

TL;DR

This paper proposes an orthogonal fine-tuning method with self-regularization for vision-language models like CLIP, enhancing robustness, stability, and generalization, especially in few-shot scenarios, by injecting orthogonal matrices and employing data augmentation.

Contribution

Introduces OrthSR, a novel orthogonal fine-tuning approach with self-regularization, improving stability and generalization of VLMs during task-specific adaptation.

Findings

01

Enhanced zero-shot generalization with self-regularization.

02

Improved few-shot classification performance.

03

Faster convergence and stability during fine-tuning.

Abstract

Efficient fine-tuning of vision-language models (VLMs) like CLIP for specific downstream tasks is gaining significant attention. Previous works primarily focus on prompt learning to adapt the CLIP into a variety of downstream tasks, however, suffering from task overfitting when fine-tuned on a small data set. In this paper, we introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization, while a self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. Specifically, trainable orthogonal matrices are injected seamlessly into the transformer architecture and enforced with orthogonality constraint during the training, benefiting from the norm-preserving property and thus leading to stable and faster convergence, while keeping the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsContext Optimization · Contrastive Language-Image Pre-training · Focus · Cutout