JaWildText: A Benchmark for Vision-Language Models on Japanese Scene Text Understanding

Koki Maeda; Naoaki Okazaki

arXiv:2603.27942·cs.CV·April 1, 2026

JaWildText: A Benchmark for Vision-Language Models on Japanese Scene Text Understanding

Koki Maeda, Naoaki Okazaki

PDF

TL;DR

JaWildText is a new benchmark designed to evaluate vision-language models on complex Japanese scene text understanding tasks, addressing language-specific challenges and providing detailed diagnostics.

Contribution

It introduces a comprehensive Japanese scene text benchmark with diverse tasks and a large annotated dataset, filling gaps left by previous multilingual and Japanese-specific resources.

Findings

01

Best model achieves 0.64 average score across tasks

02

Recognition, especially of kanji, remains the main bottleneck

03

JaWildText enables detailed diagnosis of Japanese scene text understanding

Abstract

Japanese scene text poses challenges that multilingual benchmarks often fail to capture, including mixed scripts, frequent vertical writing, and a character inventory far larger than the Latin alphabet. Although Japanese is included in several multilingual benchmarks, these resources do not adequately capture the language-specific complexities. Meanwhile, existing Japanese visual text datasets have primarily focused on scanned documents, leaving in-the-wild scene text underexplored. To fill this gap, we introduce JaWildText, a diagnostic benchmark for evaluating vision-language models (VLMs) on Japanese scene text understanding. JaWildText contains 3,241 instances from 2,961 images newly captured in Japan, with 1.12 million annotated characters spanning 3,643 unique character types. It comprises three complementary tasks that vary in visual organization, output format, and writing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.