WAON: Large-Scale Japanese Image-Text Pair Dataset for Improving Model Performance on Japanese Cultural Tasks
Issa Sugiura, Shuhei Kurita, Yusuke Oda, Daisuke Kawahara, Yasuo Okabe, Naoaki Okazaki

TL;DR
WAON is a large-scale Japanese image-text dataset designed to enhance vision-language models' performance on Japanese cultural tasks, addressing data scarcity and improving benchmark results.
Contribution
The paper introduces WAON, the largest Japanese image-text dataset, and WAON-Bench, a curated benchmark, enabling better fine-tuning and evaluation for Japanese cultural AI applications.
Findings
Fine-tuning on WAON improves Japanese cultural task performance.
WAON achieves state-of-the-art results among comparable models.
The dataset construction pipeline enhances quality through filtering and deduplication.
Abstract
Contrastive pre-training on large-scale image-text pair datasets has driven major advances in vision-language representation learning. Recent work shows that pretraining on global data followed by language or culture specific fine-tuning is effective for improving performance in target domains. With the availability of strong open-weight multilingual models such as SigLIP2, this paradigm has become increasingly practical. However, for Japanese, the scarcity of large-scale, high-quality image-text pair datasets tailored to Japanese language and cultural content remains a key limitation. To address this gap, we introduce WAON, the largest Japanese image-text pair dataset constructed from Japanese web content in Common Crawl, containing approximately 155 million examples. Our dataset construction pipeline employs filtering and deduplication to improve dataset quality. To improve the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
