Semi-Supervised Image Captioning by Adversarially Propagating Labeled Data
Dong-Jin Kim, Tae-Hyun Oh, Jinsoo Choi, In So Kweon

TL;DR
This paper introduces a semi-supervised image captioning framework that leverages large unpaired image and caption datasets through adversarial learning to improve captioning performance, especially when paired data is scarce.
Contribution
The paper proposes a novel adversarial semi-supervised learning approach that associates unpaired image and caption data, enhancing image captioning models' generalization capabilities.
Findings
Significant performance improvements on image captioning benchmarks.
Effective handling of out-of-task and web-crawled unpaired data.
Theoretically well-founded with favorable global optimal properties.
Abstract
We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models. Constructing a large-scale labeled image captioning dataset is an expensive task in terms of labor, time, and cost. In contrast to manually annotating all the training samples, separately collecting uni-modal datasets is immensely easier, e.g., a large-scale image dataset and a sentence dataset. We leverage such massive unpaired image and caption data upon standard paired data by learning to associate them. To this end, our proposed semi-supervised learning method assigns pseudo-labels to unpaired samples in an adversarial learning fashion, where the joint distribution of image and caption is learned. Our method trains a captioner to learn from a paired data and to progressively associate unpaired data. This approach shows noticeable performance improvement even in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
