Evaluating Multimodal Large Language Models on Vertically Written Japanese Text

Keito Sasagawa; Shuhei Kurita; Daisuke Kawahara

arXiv:2511.15059·cs.CV·November 20, 2025

Evaluating Multimodal Large Language Models on Vertically Written Japanese Text

Keito Sasagawa, Shuhei Kurita, Daisuke Kawahara

PDF

Open Access 2 Datasets

TL;DR

This paper evaluates the ability of multimodal large language models to understand vertically written Japanese text, revealing limitations and improvements through synthetic dataset training.

Contribution

It introduces a synthetic Japanese OCR dataset for vertical writing and demonstrates its effectiveness in enhancing MLLMs' vertical text understanding capabilities.

Findings

01

Existing MLLMs perform poorly on vertical Japanese text.

02

Training on the synthetic dataset improves model performance on vertical writing.

03

The datasets and code are publicly available for further research.

Abstract

Multimodal Large Language Models (MLLMs) have seen rapid advances in recent years and are now being applied to visual document understanding tasks. They are expected to process a wide range of document images across languages, including Japanese. Understanding documents from images requires models to read what are written in them. Since some Japanese documents are written vertically, support for vertical writing is essential. However, research specifically focused on vertically written Japanese text remains limited. In this study, we evaluate the reading capability of existing MLLMs on vertically written Japanese text. First, we generate a synthetic Japanese OCR dataset by rendering Japanese texts into images, and use it for both model fine-tuning and evaluation. This dataset includes Japanese text in both horizontal and vertical writing. We also create an evaluation dataset sourced…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Generative Adversarial Networks and Image Synthesis