Dependent Multi-Task Learning with Causal Intervention for Image Captioning
Wenqing Chen, Jidong Tian, Caoyun Fan, Hao He, and Yaohui Jin

TL;DR
This paper introduces a causal intervention framework with multi-task learning for image captioning, addressing content inconsistency and informativeness issues by leveraging intermediate tasks and Pearl's do-calculus.
Contribution
It proposes a novel dependent multi-task learning framework with causal intervention (DMTCI) that improves caption quality by reducing spurious correlations and enhancing visual understanding.
Findings
Outperforms baseline models on standard benchmarks.
Achieves competitive results with state-of-the-art methods.
Effectively reduces content inconsistency in generated captions.
Abstract
Recent work for image captioning mainly followed an extract-then-generate paradigm, pre-extracting a sequence of object-based features and then formulating image captioning as a single sequence-to-sequence task. Although promising, we observed two problems in generated captions: 1) content inconsistency where models would generate contradicting facts; 2) not informative enough where models would miss parts of important information. From a causal perspective, the reason is that models have captured spurious statistical correlations between visual features and certain expressions (e.g., visual features of "long hair" and "woman"). In this paper, we propose a dependent multi-task learning framework with the causal intervention (DMTCI). Firstly, we involve an intermediate task, bag-of-categories generation, before the final task, image captioning. The intermediate task would help the model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
