STAIR Captions: Constructing a Large-Scale Japanese Image Caption Dataset
Yuya Yoshikawa, Yutaro Shigeto, Akikazu Takeuchi

TL;DR
This paper introduces STAIR Captions, a large-scale Japanese image caption dataset derived from MS-COCO, enabling improved Japanese image captioning through neural network training.
Contribution
The creation of the first large-scale Japanese image caption dataset, STAIR Captions, with over 820,000 captions for 164,000 images, facilitating better Japanese image captioning models.
Findings
Neural networks trained on STAIR Captions produce more natural Japanese captions.
Models trained on STAIR Captions outperform translation-based methods.
The dataset significantly advances Japanese image captioning research.
Abstract
In recent years, automatic generation of image descriptions (captions), that is, image captioning, has attracted a great deal of attention. In this paper, we particularly consider generating Japanese captions for images. Since most available caption datasets have been constructed for English language, there are few datasets for Japanese. To tackle this problem, we construct a large-scale Japanese image caption dataset based on images from MS-COCO, which is called STAIR Captions. STAIR Captions consists of 820,310 Japanese captions for 164,062 images. In the experiment, we show that a neural network trained using STAIR Captions can generate more natural and better Japanese captions, compared to those generated using English-Japanese machine translation after generating English captions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques
