Evaluating Multimodal Large Language Models on Vertically Written Japanese Text
Keito Sasagawa, Shuhei Kurita, Daisuke Kawahara

TL;DR
This paper evaluates the ability of multimodal large language models to understand vertically written Japanese text, revealing limitations and improvements through synthetic dataset training.
Contribution
It introduces a synthetic Japanese OCR dataset for vertical writing and demonstrates its effectiveness in enhancing MLLMs' vertical text understanding capabilities.
Findings
Existing MLLMs perform poorly on vertical Japanese text.
Training on the synthetic dataset improves model performance on vertical writing.
The datasets and code are publicly available for further research.
Abstract
Multimodal Large Language Models (MLLMs) have seen rapid advances in recent years and are now being applied to visual document understanding tasks. They are expected to process a wide range of document images across languages, including Japanese. Understanding documents from images requires models to read what are written in them. Since some Japanese documents are written vertically, support for vertical writing is essential. However, research specifically focused on vertically written Japanese text remains limited. In this study, we evaluate the reading capability of existing MLLMs on vertically written Japanese text. First, we generate a synthetic Japanese OCR dataset by rendering Japanese texts into images, and use it for both model fine-tuning and evaluation. This dataset includes Japanese text in both horizontal and vertical writing. We also create an evaluation dataset sourced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Generative Adversarial Networks and Image Synthesis
