Cross-Modal Similarity-Based Curriculum Learning for Image Captioning
Hongkuan Zhang, Saku Sugawara, Akiko Aizawa, Lei Zhou, Ryohei Sasano,, Koichi Takeda

TL;DR
This paper introduces a cross-modal similarity-based curriculum learning method for image captioning, which improves performance and generalization by adaptively measuring difficulty using a pretrained vision-language model.
Contribution
It proposes a novel difficulty measurement for image captioning based on cross-modal similarity, enhancing training efficiency and model generalization without extra costs.
Findings
Achieves superior performance on COCO and Flickr30k datasets.
Demonstrates improved generalization to unseen data.
Maintains competitive convergence speed.
Abstract
Image captioning models require the high-level generalization ability to describe the contents of various images in words. Most existing approaches treat the image-caption pairs equally in their training without considering the differences in their learning difficulties. Several image captioning approaches introduce curriculum learning methods that present training data with increasing levels of difficulty. However, their difficulty measurements are either based on domain-specific features or prior model training. In this paper, we propose a simple yet efficient difficulty measurement for image captioning using cross-modal similarity calculated by a pretrained vision-language model. Experiments on the COCO and Flickr30k datasets show that our proposed approach achieves superior performance and competitive convergence speed to baselines without requiring heuristics or incurring…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
