Difference Vector Equalization for Robust Fine-tuning of Vision-Language Models

Satoshi Suzuki; Shin'ya Yamaguchi; Shoichiro Takeda; Taiga Yamane; Naoki Makishima; Naotaka Kawata; Mana Ihori; Tomohiro Tanaka; Shota Orihashi; Ryo Masumura

arXiv:2511.09973·cs.CV·November 14, 2025

Difference Vector Equalization for Robust Fine-tuning of Vision-Language Models

Satoshi Suzuki, Shin'ya Yamaguchi, Shoichiro Takeda, Taiga Yamane, Naoki Makishima, Naotaka Kawata, Mana Ihori, Tomohiro Tanaka, Shota Orihashi, Ryo Masumura

PDF

Open Access 1 Video

TL;DR

This paper introduces Difference Vector Equalization (DiVE), a novel fine-tuning method for vision-language models that maintains geometric structure to improve robustness across in-distribution, out-of-distribution, and zero-shot tasks.

Contribution

DiVE is the first approach to preserve embedding geometry during fine-tuning, enhancing generalization without sacrificing performance.

Findings

01

DiVE outperforms existing methods on ID, OOD, and zero-shot benchmarks.

02

The proposed AVL and PVL losses effectively preserve geometric structure.

03

DiVE maintains the geometric structure of embeddings during fine-tuning.

Abstract

Contrastive pre-trained vision-language models, such as CLIP, demonstrate strong generalization abilities in zero-shot classification by leveraging embeddings extracted from image and text encoders. This paper aims to robustly fine-tune these vision-language models on in-distribution (ID) data without compromising their generalization abilities in out-of-distribution (OOD) and zero-shot settings. Current robust fine-tuning methods tackle this challenge by reusing contrastive learning, which was used in pre-training, for fine-tuning. However, we found that these methods distort the geometric structure of the embeddings, which plays a crucial role in the generalization of vision-language models, resulting in limited OOD and zero-shot performance. To address this, we propose Difference Vector Equalization (DiVE), which preserves the geometric structure during fine-tuning. The idea behind…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Difference Vector Equalization for Robust Fine-tuning of Vision-Language Models· underline

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis