Semi-Supervised Image Captioning Considering Wasserstein Graph Matching

Yang Yang

arXiv:2403.17995·cs.CV·March 28, 2024·1 cites

Semi-Supervised Image Captioning Considering Wasserstein Graph Matching

Yang Yang

PDF

Open Access

TL;DR

This paper introduces a semi-supervised image captioning method that leverages scene graphs and Wasserstein distance to improve caption quality using limited labeled data and abundant unlabeled images.

Contribution

The novel SSIC-WGM approach uses scene graphs and Wasserstein graph matching for semi-supervised learning in image captioning, addressing cross-modal heterogeneity.

Findings

01

Improves captioning accuracy with limited labeled data.

02

Effectively utilizes unlabeled images through graph-based consistency constraints.

03

Demonstrates superior performance over existing semi-supervised methods.

Abstract

Image captioning can automatically generate captions for the given images, and the key challenge is to learn a mapping function from visual features to natural language features. Existing approaches are mostly supervised ones, i.e., each image has a corresponding sentence in the training set. However, considering that describing images always requires a huge of manpower, we usually have limited amount of described images (i.e., image-text pairs) and a large number of undescribed images in real-world applications. Thereby, a dilemma is the "Semi-Supervised Image Captioning". To solve this problem, we propose a novel Semi-Supervised Image Captioning method considering Wasserstein Graph Matching (SSIC-WGM), which turns to adopt the raw image inputs to supervise the generated sentences. Different from traditional single modal semi-supervised methods, the difficulty of semi-supervised…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection