Towards Deconfounded Image-Text Matching with Causal Inference
Wenhui Li, Xinqi Su, Dan Song, Lanjun Wang, Kun Zhang, An-An Liu

TL;DR
This paper introduces a causal inference approach to improve image-text matching by removing dataset bias and spurious correlations, leading to better generalization on benchmark datasets.
Contribution
It proposes a novel Deconfounded Causal Inference Network (DCIN) that uses Structural Causal Models and backdoor adjustment to mitigate intra- and inter-modal biases in image-text matching.
Findings
DCIN outperforms existing methods on Flickr30K and MSCOCO datasets.
The approach effectively reduces dataset bias and improves matching accuracy.
Experimental results demonstrate superior generalization capabilities.
Abstract
Prior image-text matching methods have shown remarkable performance on many benchmark datasets, but most of them overlook the bias in the dataset, which exists in intra-modal and inter-modal, and tend to learn the spurious correlations that extremely degrade the generalization ability of the model. Furthermore, these methods often incorporate biased external knowledge from large-scale datasets as prior knowledge into image-text matching model, which is inevitable to force model further learn biased associations. To address above limitations, this paper firstly utilizes Structural Causal Models (SCMs) to illustrate how intra- and inter-modal confounders damage the image-text matching. Then, we employ backdoor adjustment to propose an innovative Deconfounded Causal Inference Network (DCIN) for image-text matching task. DCIN (1) decomposes the intra- and inter-modal confounders and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsCausal inference
