Cross-Modal Similarity-Based Curriculum Learning for Image Captioning

Hongkuan Zhang; Saku Sugawara; Akiko Aizawa; Lei Zhou; Ryohei Sasano,; Koichi Takeda

arXiv:2212.07075·cs.CV·December 15, 2022

Cross-Modal Similarity-Based Curriculum Learning for Image Captioning

Hongkuan Zhang, Saku Sugawara, Akiko Aizawa, Lei Zhou, Ryohei Sasano,, Koichi Takeda

PDF

Open Access

TL;DR

This paper introduces a cross-modal similarity-based curriculum learning method for image captioning, which improves performance and generalization by adaptively measuring difficulty using a pretrained vision-language model.

Contribution

It proposes a novel difficulty measurement for image captioning based on cross-modal similarity, enhancing training efficiency and model generalization without extra costs.

Findings

01

Achieves superior performance on COCO and Flickr30k datasets.

02

Demonstrates improved generalization to unseen data.

03

Maintains competitive convergence speed.

Abstract

Image captioning models require the high-level generalization ability to describe the contents of various images in words. Most existing approaches treat the image-caption pairs equally in their training without considering the differences in their learning difficulties. Several image captioning approaches introduce curriculum learning methods that present training data with increasing levels of difficulty. However, their difficulty measurements are either based on domain-specific features or prior model training. In this paper, we propose a simple yet efficient difficulty measurement for image captioning using cross-modal similarity calculated by a pretrained vision-language model. Experiments on the COCO and Flickr30k datasets show that our proposed approach achieves superior performance and competitive convergence speed to baselines without requiring heuristics or incurring…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings