JaWildText: A Benchmark for Vision-Language Models on Japanese Scene Text Understanding
Koki Maeda, Naoaki Okazaki

TL;DR
JaWildText is a new benchmark designed to evaluate vision-language models on complex Japanese scene text understanding tasks, addressing language-specific challenges and providing detailed diagnostics.
Contribution
It introduces a comprehensive Japanese scene text benchmark with diverse tasks and a large annotated dataset, filling gaps left by previous multilingual and Japanese-specific resources.
Findings
Best model achieves 0.64 average score across tasks
Recognition, especially of kanji, remains the main bottleneck
JaWildText enables detailed diagnosis of Japanese scene text understanding
Abstract
Japanese scene text poses challenges that multilingual benchmarks often fail to capture, including mixed scripts, frequent vertical writing, and a character inventory far larger than the Latin alphabet. Although Japanese is included in several multilingual benchmarks, these resources do not adequately capture the language-specific complexities. Meanwhile, existing Japanese visual text datasets have primarily focused on scanned documents, leaving in-the-wild scene text underexplored. To fill this gap, we introduce JaWildText, a diagnostic benchmark for evaluating vision-language models (VLMs) on Japanese scene text understanding. JaWildText contains 3,241 instances from 2,961 images newly captured in Japan, with 1.12 million annotated characters spanning 3,643 unique character types. It comprises three complementary tasks that vary in visual organization, output format, and writing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
