Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models
Issa Sugiura, Keito Sasagawa, Keisuke Nakao, Koki Maeda, Ziqi Yin, Zhishen Yang, Shuhei Kurita, Yusuke Oda, Ryoko Tokuhisa, Daisuke Kawahara, Naoaki Okazaki

TL;DR
Jagle is the largest Japanese multimodal dataset for vision-language models, enabling improved multilingual performance and broad task coverage through diverse data collection and generation strategies.
Contribution
This work introduces Jagle, a large-scale Japanese multimodal dataset created from heterogeneous sources, enhancing Japanese VLM training and performance.
Findings
A 2.2B model trained on Jagle outperforms existing Japanese VLMs.
Combining Jagle with FineVision improves English performance.
Jagle achieves strong results on ten Japanese evaluation tasks.
Abstract
Developing vision-language models (VLMs) that generalize across diverse tasks requires large-scale training datasets with diverse content. In English, such datasets are typically constructed by aggregating and curating numerous existing visual question answering (VQA) resources. However, this strategy does not readily extend to other languages, where VQA datasets remain limited in both scale and domain coverage, posing a major obstacle to building high-quality multilingual and non-English VLMs. In this work, we introduce Jagle, the largest Japanese multimodal post-training dataset to date, comprising approximately 9.2 million instances across diverse tasks. Rather than relying on existing VQA datasets, we collect heterogeneous source data, including images, image-text pairs, and PDF documents, and generate VQA pairs through multiple strategies such as VLM-based QA generation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
