Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization
Jinlong Li, Dong Zhao, Zequn Jie, Elisa Ricci, Lin Ma, Nicu Sebe

TL;DR
This paper proposes an orthogonal fine-tuning method with self-regularization for vision-language models like CLIP, enhancing robustness, stability, and generalization, especially in few-shot scenarios, by injecting orthogonal matrices and employing data augmentation.
Contribution
Introduces OrthSR, a novel orthogonal fine-tuning approach with self-regularization, improving stability and generalization of VLMs during task-specific adaptation.
Findings
Enhanced zero-shot generalization with self-regularization.
Improved few-shot classification performance.
Faster convergence and stability during fine-tuning.
Abstract
Efficient fine-tuning of vision-language models (VLMs) like CLIP for specific downstream tasks is gaining significant attention. Previous works primarily focus on prompt learning to adapt the CLIP into a variety of downstream tasks, however, suffering from task overfitting when fine-tuned on a small data set. In this paper, we introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization, while a self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. Specifically, trainable orthogonal matrices are injected seamlessly into the transformer architecture and enforced with orthogonality constraint during the training, benefiting from the norm-preserving property and thus leading to stable and faster convergence, while keeping the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsContext Optimization · Contrastive Language-Image Pre-training · Focus · Cutout
