Deconfounded Image Captioning: A Causal Retrospect
Xu Yang, Hanwang Zhang, Jianfei Cai

TL;DR
This paper introduces a causal inference-based framework called Deconfounded Image Captioning (DIC) to address dataset bias in vision-language tasks, improving captioning models' robustness and performance.
Contribution
It presents a novel causal perspective on image captioning, proposes the DICv1.0 framework, and demonstrates significant performance improvements on MS COCO datasets.
Findings
DICv1.0 improves captioning accuracy on MS COCO.
The framework effectively mitigates dataset bias effects.
State-of-the-art CIDEr-D scores achieved with DICv1.0.
Abstract
Dataset bias in vision-language tasks is becoming one of the main problems which hinders the progress of our community. Existing solutions lack a principled analysis about why modern image captioners easily collapse into dataset bias. In this paper, we present a novel perspective: Deconfounded Image Captioning (DIC), to find out the answer of this question, then retrospect modern neural image captioners, and finally propose a DIC framework: DICv1.0 to alleviate the negative effects brought by dataset bias. DIC is based on causal inference, whose two principles: the backdoor and front-door adjustments, help us review previous studies and design new effective models. In particular, we showcase that DICv1.0 can strengthen two prevailing captioning models and can achieve a single-model 131.1 CIDEr-D and 128.4 c40 CIDEr-D on Karpathy split and online split of the challenging MS COCO dataset,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
