Rethinking Misalignment in Vision-Language Model Adaptation from a   Causal Perspective

Yanan Zhang; Jiangmeng Li; Lixiang Liu; Wenwen Qiang

arXiv:2410.12816·cs.CV·November 6, 2024·2 cites

Rethinking Misalignment in Vision-Language Model Adaptation from a Causal Perspective

Yanan Zhang, Jiangmeng Li, Lixiang Liu, Wenwen Qiang

PDF

Open Access 1 Video

TL;DR

This paper introduces a causal perspective to address data misalignment in vision-language models like CLIP, proposing a novel method called CDC that decouples semantics and evaluates prediction uncertainty, improving adaptation to downstream tasks.

Contribution

It develops a structural causal model to analyze misalignment issues and proposes CDC, a causality-guided semantic decoupling method, to mitigate task-irrelevant knowledge interference.

Findings

01

CDC improves downstream task performance across multiple settings.

02

Decoupling semantics reduces task-irrelevant knowledge impact.

03

Uncertainty evaluation enhances prediction reliability.

Abstract

Foundational Vision-Language models such as CLIP have exhibited impressive generalization in downstream tasks. However, CLIP suffers from a two-level misalignment issue, i.e., task misalignment and data misalignment, when adapting to specific tasks. Soft prompt tuning has mitigated the task misalignment, yet the data misalignment remains a challenge. To analyze the impacts of the data misalignment, we revisit the pre-training and adaptation processes of CLIP and develop a structural causal model. We discover that while we expect to capture task-relevant information for downstream tasks accurately, the task-irrelevant knowledge impacts the prediction results and hampers the modeling of the true relationships between the images and the predicted classes. As task-irrelevant knowledge is unobservable, we leverage the front-door adjustment and propose Causality-Guided Semantic Decoupling and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Rethinking Misalignment in Vision-Language Model Adaptation from a Causal Perspective· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training