DEJIMA: A Novel Large-scale Japanese Dataset for Image Captioning and Visual Question Answering

Toshiki Katsube; Taiga Fukuhara; Kenichiro Ando; Yusuke Mukuta; Kohei Uehara; Tatsuya Harada

arXiv:2512.00773·cs.CV·December 2, 2025

DEJIMA: A Novel Large-scale Japanese Dataset for Image Captioning and Visual Question Answering

Toshiki Katsube, Taiga Fukuhara, Kenichiro Ando, Yusuke Mukuta, Kohei Uehara, Tatsuya Harada

PDF

Open Access 1 Datasets

TL;DR

This paper introduces DEJIMA, a large-scale Japanese dataset for image captioning and VQA, created through a scalable pipeline, significantly surpassing existing datasets in size and cultural relevance, and improving model performance.

Contribution

The paper presents a novel scalable pipeline for constructing large-scale Japanese V&L datasets, resulting in DEJIMA, which enhances linguistic naturalness and cultural coverage compared to previous datasets.

Findings

01

DEJIMA contains 3.88 million image-text pairs.

02

Models trained on DEJIMA outperform baselines on Japanese V&L benchmarks.

03

DEJIMA demonstrates higher Japaneseness and cultural relevance than translated datasets.

Abstract

This work addresses the scarcity of high-quality, large-scale resources for Japanese Vision-and-Language (V&L) modeling. We present a scalable and reproducible pipeline that integrates large-scale web collection with rigorous filtering/deduplication, object-detection-driven evidence extraction, and Large Language Model (LLM)-based refinement under grounding constraints. Using this pipeline, we build two resources: an image-caption dataset (DEJIMA-Cap) and a VQA dataset (DEJIMA-VQA), each containing 3.88M image-text pairs, far exceeding the size of existing Japanese V&L datasets. Human evaluations demonstrate that DEJIMA achieves substantially higher Japaneseness and linguistic naturalness than datasets constructed via translation or manual annotation, while maintaining factual correctness at a level comparable to human-annotated corpora. Quantitative analyses of image feature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

MIL-UT/DEJIMA-dataset
dataset· 52 dl
52 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling