Causal Attention for Vision-Language Tasks
Xu Yang, Hanwang Zhang, Guojun Qi, Jianfei Cai

TL;DR
This paper introduces Causal Attention (CATT), a novel attention mechanism that mitigates confounding bias in vision-language models using causal intervention, leading to improved generalization and performance.
Contribution
The paper proposes CATT, combining in-sample and cross-sample attention, to remove confounding effects without prior confounder knowledge, enhancing vision-language model performance.
Findings
CATT improves various vision-language models significantly.
CATT enables lighter models to perform comparably to larger ones.
CATT is effective in large-scale pre-training scenarios.
Abstract
We present a novel attention mechanism: Causal Attention (CATT), to remove the ever-elusive confounding effect in existing attention-based vision-language models. This effect causes harmful bias that misleads the attention module to focus on the spurious correlations in training data, damaging the model generalization. As the confounder is unobserved in general, we use the front-door adjustment to realize the causal intervention, which does not require any knowledge on the confounder. Specifically, CATT is implemented as a combination of 1) In-Sample Attention (IS-ATT) and 2) Cross-Sample Attention (CS-ATT), where the latter forcibly brings other samples into every IS-ATT, mimicking the causal intervention. CATT abides by the Q-K-V convention and hence can replace any attention module such as top-down attention and self-attention in Transformers. CATT improves various popular…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
