Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models

Issa Sugiura; Keito Sasagawa; Keisuke Nakao; Koki Maeda; Ziqi Yin; Zhishen Yang; Shuhei Kurita; Yusuke Oda; Ryoko Tokuhisa; Daisuke Kawahara; Naoaki Okazaki

arXiv:2604.02048·cs.CV·April 3, 2026

Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models

Issa Sugiura, Keito Sasagawa, Keisuke Nakao, Koki Maeda, Ziqi Yin, Zhishen Yang, Shuhei Kurita, Yusuke Oda, Ryoko Tokuhisa, Daisuke Kawahara, Naoaki Okazaki

PDF

4 Models 1 Datasets

TL;DR

Jagle is the largest Japanese multimodal dataset for vision-language models, enabling improved multilingual performance and broad task coverage through diverse data collection and generation strategies.

Contribution

This work introduces Jagle, a large-scale Japanese multimodal dataset created from heterogeneous sources, enhancing Japanese VLM training and performance.

Findings

01

A 2.2B model trained on Jagle outperforms existing Japanese VLMs.

02

Combining Jagle with FineVision improves English performance.

03

Jagle achieves strong results on ten Japanese evaluation tasks.

Abstract

Developing vision-language models (VLMs) that generalize across diverse tasks requires large-scale training datasets with diverse content. In English, such datasets are typically constructed by aggregating and curating numerous existing visual question answering (VQA) resources. However, this strategy does not readily extend to other languages, where VQA datasets remain limited in both scale and domain coverage, posing a major obstacle to building high-quality multilingual and non-English VLMs. In this work, we introduce Jagle, the largest Japanese multimodal post-training dataset to date, comprising approximately 9.2 million instances across diverse tasks. Rather than relying on existing VQA datasets, we collect heterogeneous source data, including images, image-text pairs, and PDF documents, and generate VQA pairs through multiple strategies such as VLM-based QA generation,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

llm-jp/Jagle
dataset· 978 dl
978 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.