Rethinking Misalignment in Vision-Language Model Adaptation from a Causal Perspective
Yanan Zhang, Jiangmeng Li, Lixiang Liu, Wenwen Qiang

TL;DR
This paper introduces a causal perspective to address data misalignment in vision-language models like CLIP, proposing a novel method called CDC that decouples semantics and evaluates prediction uncertainty, improving adaptation to downstream tasks.
Contribution
It develops a structural causal model to analyze misalignment issues and proposes CDC, a causality-guided semantic decoupling method, to mitigate task-irrelevant knowledge interference.
Findings
CDC improves downstream task performance across multiple settings.
Decoupling semantics reduces task-irrelevant knowledge impact.
Uncertainty evaluation enhances prediction reliability.
Abstract
Foundational Vision-Language models such as CLIP have exhibited impressive generalization in downstream tasks. However, CLIP suffers from a two-level misalignment issue, i.e., task misalignment and data misalignment, when adapting to specific tasks. Soft prompt tuning has mitigated the task misalignment, yet the data misalignment remains a challenge. To analyze the impacts of the data misalignment, we revisit the pre-training and adaptation processes of CLIP and develop a structural causal model. We discover that while we expect to capture task-relevant information for downstream tasks accurately, the task-irrelevant knowledge impacts the prediction results and hampers the modeling of the true relationships between the images and the predicted classes. As task-irrelevant knowledge is unobservable, we leverage the front-door adjustment and propose Causality-Guided Semantic Decoupling and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training
