Difference Vector Equalization for Robust Fine-tuning of Vision-Language Models
Satoshi Suzuki, Shin'ya Yamaguchi, Shoichiro Takeda, Taiga Yamane, Naoki Makishima, Naotaka Kawata, Mana Ihori, Tomohiro Tanaka, Shota Orihashi, Ryo Masumura

TL;DR
This paper introduces Difference Vector Equalization (DiVE), a novel fine-tuning method for vision-language models that maintains geometric structure to improve robustness across in-distribution, out-of-distribution, and zero-shot tasks.
Contribution
DiVE is the first approach to preserve embedding geometry during fine-tuning, enhancing generalization without sacrificing performance.
Findings
DiVE outperforms existing methods on ID, OOD, and zero-shot benchmarks.
The proposed AVL and PVL losses effectively preserve geometric structure.
DiVE maintains the geometric structure of embeddings during fine-tuning.
Abstract
Contrastive pre-trained vision-language models, such as CLIP, demonstrate strong generalization abilities in zero-shot classification by leveraging embeddings extracted from image and text encoders. This paper aims to robustly fine-tune these vision-language models on in-distribution (ID) data without compromising their generalization abilities in out-of-distribution (OOD) and zero-shot settings. Current robust fine-tuning methods tackle this challenge by reusing contrastive learning, which was used in pre-training, for fine-tuning. However, we found that these methods distort the geometric structure of the embeddings, which plays a crucial role in the generalization of vision-language models, resulting in limited OOD and zero-shot performance. To address this, we propose Difference Vector Equalization (DiVE), which preserves the geometric structure during fine-tuning. The idea behind…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
