Learning Invariant Causal Mechanism from Vision-Language Models

Zeen Song; Siyu Zhao; Xingyu Zhang; Jiangmeng Li; Changwen Zheng; Wenwen Qiang

arXiv:2405.15289·cs.CV·June 18, 2025

Learning Invariant Causal Mechanism from Vision-Language Models

Zeen Song, Siyu Zhao, Xingyu Zhang, Jiangmeng Li, Changwen Zheng, Wenwen Qiang

PDF

Open Access 1 Video

TL;DR

This paper models CLIP's prediction process using causal inference, demonstrating how focusing on invariant causal factors can improve out-of-distribution robustness, and proposes a new framework called CLIP-ICM.

Contribution

The paper introduces CLIP-ICM, a causal-inference-based method that enhances CLIP's robustness by leveraging invariant causal mechanisms across environments.

Findings

01

CLIP-ICM significantly improves OOD performance of CLIP.

02

Theoretical proof of a linear mapping from CLIP embeddings to invariant factors.

03

Experimental validation on multiple OOD datasets shows robustness gains.

Abstract

Contrastive Language-Image Pretraining (CLIP) has achieved remarkable success, but its performance can degrade when fine-tuned in out-of-distribution (OOD) scenarios. We model the prediction process using a Structural Causal Model (SCM) and show that the causal mechanism involving both invariant and variant factors in training environments differs from that in test environments. In contrast, the causal mechanism with solely invariant factors remains consistent across environments. We theoretically prove the existence of a linear mapping from CLIP embeddings to invariant factors, which can be estimated using interventional data. Additionally, we provide a condition to guarantee low OOD risk of the invariant predictor. Based on these insights, we propose the Invariant Causal Mechanism of CLIP (CLIP-ICM) framework. CLIP-ICM involves collecting interventional data, estimating a linear…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Learning Invariant Causal Mechanism from Vision-Language Models· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling

MethodsCausal inference · Contrastive Language-Image Pre-training