The Inter-Intra Modal Measure: A Predictive Lens on Fine-Tuning Outcomes in Vision-Language Models
Laura Niss, Kevin Vogt-Lowell, Theodoros Tsiligkaridis

TL;DR
The paper introduces the Inter-Intra Modal Measure (IIMM), a predictive metric for assessing fine-tuning outcomes in vision-language models, demonstrating strong predictive power and theoretical stability, aiding practitioners in model adaptation decisions.
Contribution
We propose IIMM, a novel metric that predicts fine-tuning gains and forgetting in vision-language models, supported by empirical analysis and a theoretical bound based on Wasserstein distance.
Findings
IIMM correlates strongly with in-domain performance improvements.
Higher IIMM scores indicate greater out-of-domain degradation.
IIMM outperforms existing transferability measures in predictive accuracy.
Abstract
The fine-tuning of large vision-language foundation models remains an underexplored area, particularly regarding its impact on learning gains and catastrophic forgetting. Inspired by the significance of modality gaps in contrastive dual-encoders, we introduce the Inter-Intra Modal Measure (IIMM) - a predictive metric that quantifies the relationship between intra-modal image embedding similarity and inter-modal misalignment. Through extensive empirical analysis across four state-of-the-art vision-language models and five fine-tuning techniques, we establish a strong linear relationship: tasks with higher IIMM scores yield greater in-domain performance improvements but suffer from more pronounced out-of-domain degradation, with some parameter-efficient fine-tuning (PEFT) methods exhibiting severe forgetting. Compared to existing transferability measures, the IIMM demonstrates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Human Pose and Action Recognition
