The Inter-Intra Modal Measure: A Predictive Lens on Fine-Tuning Outcomes in Vision-Language Models

Laura Niss; Kevin Vogt-Lowell; Theodoros Tsiligkaridis

arXiv:2407.15731·cs.CV·August 5, 2025

The Inter-Intra Modal Measure: A Predictive Lens on Fine-Tuning Outcomes in Vision-Language Models

Laura Niss, Kevin Vogt-Lowell, Theodoros Tsiligkaridis

PDF

Open Access

TL;DR

The paper introduces the Inter-Intra Modal Measure (IIMM), a predictive metric for assessing fine-tuning outcomes in vision-language models, demonstrating strong predictive power and theoretical stability, aiding practitioners in model adaptation decisions.

Contribution

We propose IIMM, a novel metric that predicts fine-tuning gains and forgetting in vision-language models, supported by empirical analysis and a theoretical bound based on Wasserstein distance.

Findings

01

IIMM correlates strongly with in-domain performance improvements.

02

Higher IIMM scores indicate greater out-of-domain degradation.

03

IIMM outperforms existing transferability measures in predictive accuracy.

Abstract

The fine-tuning of large vision-language foundation models remains an underexplored area, particularly regarding its impact on learning gains and catastrophic forgetting. Inspired by the significance of modality gaps in contrastive dual-encoders, we introduce the Inter-Intra Modal Measure (IIMM) - a predictive metric that quantifies the relationship between intra-modal image embedding similarity and inter-modal misalignment. Through extensive empirical analysis across four state-of-the-art vision-language models and five fine-tuning techniques, we establish a strong linear relationship: tasks with higher IIMM scores yield greater in-domain performance improvements but suffer from more pronounced out-of-domain degradation, with some parameter-efficient fine-tuning (PEFT) methods exhibiting severe forgetting. Compared to existing transferability measures, the IIMM demonstrates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Human Pose and Action Recognition