Meta-Contrastive Learning for Vision-Language Models via Task-Adaptive CLIP Training

Merham Fouladvand; Peuroly Batra

arXiv:2603.27091·math.OC·March 31, 2026

Meta-Contrastive Learning for Vision-Language Models via Task-Adaptive CLIP Training

Merham Fouladvand, Peuroly Batra

PDF

TL;DR

This paper introduces a domain-conditioned meta-contrastive learning framework to enhance the cross-domain generalization and adaptability of vision-language models like CLIP, addressing domain shift issues.

Contribution

It formulates multimodal learning as a bilevel meta-learning problem with domain embeddings and regularization, improving robustness and few-shot adaptation.

Findings

01

Improved robustness under domain shift.

02

Enhanced few-shot adaptation performance.

03

Compatible with standard contrastive training pipelines.

Abstract

We propose Domain-Conditioned Meta-Contrastive Learning, a framework for improving the cross-domain generalization of vision-language models. While contrastive models such as CLIP achieve strong performance through large-scale training, they rely on a global objective that does not explicitly account for domain shift. To address this limitation, we formulate multimodal learning as a bilevel meta-learning problem over domain-conditioned tasks. Specifically, we introduce domain embeddings that modulate image and text representations, and optimize the model for rapid adaptation to domain-specific distributions via gradient-based inner-loop updates. In addition, we incorporate a cross-domain alignment regularization to encourage domain-invariant representations. Our approach is compatible with standard contrastive training pipelines and can be applied to heterogeneous datasets spanning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.